MOVI very aggressive about matching utterances to sentence vocabulary

Hello again,


I’ve just started trying to program my MOVI to select various “diagnostic modes” in my Arduino project. So far I have found that, after receiving the call sign, MOVI is very “aggressive” about matching any further utterance to words trained in the “addSentence” commands in setup(). For example, with only two sentences trained: 1) “spark advance” and 2) “solenoids”, MOVI will think I said “advance” when calling getResult() after a return code of RAW_WORDS from poll() if I say a word like “nothing” or “cancel”, and the +'ve value I get from poll() is the sentence number for “spark advance”. The false positive rate I am getting with this behaviour is undesirable.


I did try the “trick” in the WordSpotter example where one trains a “background model” with a bunch of sentences that contain frequently used English words, but then I get the opposite problem: even if I clearly say “spark advance”, MOVI thinks I said some combination of the background model words instead. This only seems to serve to flip the problem on its head where I get almost everything coming back false negative.


I am working in a very quiet “lab” environment with virtually no background noise. I hope to move MOVI into my car where there is going to be a lot more ambient noise around, but for now, I think I can safely say that I’m not suffering from background noise issues. I am a native English speaker; if anything I would think my “Canadian articulation” (not maritime) should be about as optimal as it gets.


I have another post with respect to the use of an external microphone - this is unrelated as the behaviour I am describing here is observed with MOVI’s built in microphone.


Is this expected behaviour from MOVI that it would very aggressively try to match utterances to trained words? Any advise on improving the strength of the phonetic matching? Would it be better if I put together some code sketch examples to evaluate?


Dylan.

[Last edited Apr 07, 2016 03:24:21]

As usual, the caveat is that speech recognition is not a solved problem and many tasks will require ‘tinkering’. Which, in my humble opinion, makes speech recognition a perfect area for some fun maker projects… in anyways, MOVI, as indeed any speech recognizer, will always try to match everything it ‘hears’ to the expected words as a first step and the expected sentences as a second step. Also, if you haven’t please definitely take a look at Chapter 4 of our user manual.


If you can, a code example would indeed help, as from your description, I am pretty sure these are not microphone problems. Without having seen your code: If you keep the background model individual words and make the things to recognize three to five word sentences, you will probably get way better results.


Hope that helped…

Thanks again Gerald. I have tried to read the user manual in detail. My feedback would be that it does not provide a lot of strong guidance in this realm.


It is fun to tinker with it to a point, then without a great deal of transparency into how the mechanisms work and why decisions are being made, it becomes difficult to remain optimistic without any grasp on what to try next.


I have completed a sketch for my application that includes all the “diagnostic commands” that I desire for now. I have coded my own timeout between callsign to successful command recognition. I have added compensation for utterances heard before the callsign that end up being falsely interpreted as commands following the call sign by adjusting the threshold at particular times. I want to field test this a bit before sending any code sketches or asking anything further on this topic. I know for sure that I’m going to have some suggestions/feedback for you with respect to the lack of sentence listening timeouts, interpretation of utterances which precede the callsign, and CALLSIGN_DETECTED event not always preceding BEGIN_LISTEN, but it’s too early just yet without some further validation.


Unfortunately this is now complicated by the fact that my application requires a reliable external microphone in a somewhat noisy environment at a distance of about 2 feet. I may need to break through that problem first, we will see.


Dylan.

Dylan,


The reason why we can’t give a lot of detail in the manual is because every case is different. It’s like being a lawyer or a tax accountant: Certain standard rules apply but then every project is different. This is why companies pay hundreds of thousand of dollars to get a telephone dialog system running.


In anyways, I am looking forward to seeing some code when you are ready and I probably will have some ideas. I would in general recommend not to use your own timeouts and not to tinker too much with the threshold. But again, it may make sense for you. In your case, however, it will definitely make sense to make sure your sentences are phonetically wide apart. For example, “turn the light on” and “turn the light off” are two sentences that are super close together as only “n” and “f” make the difference. Our standard example “let there be light” and “go dark” shows how to make them widely different. Then, if you can include context, you can help increase the accuracy. To stay with the light example, you could just use the sentence “switch the light” and have a bool variable store the state of the light, which cuts down on one sentence to (mis-)recognize.


Another tip is that the micdebug command (see microphone thread) is probably your friend and also looking at the raw results in the console will probably help a lot.


But again, this is general advise… hope it help anyways.

[Last edited Apr 08, 2016 16:43:40]

My code is quite extensive because it contains all sorts of diagnostics and integrations, but the main loop() routine where all the MOVI interaction takes place isn’t too bad. I need to solve my microphone problem before I’d feel confident in sharing the code.


Perhaps I can pose a couple of general scenarios for your comment in the meantime, in case it causes me to change course.


Timeout problem: I find that after the callsign is received, MOVI will wait indefinitely for my voice. Because of the aggressive matching I am experiencing, I wanted to have a 5 second timeout between callsign and sentence recognition. So after receiving the -140 from poll(), if MOVI is still waiting for me to say something, I do movi.stopDialog() followed by movi.restartDialog(). This has so far been successful for me to get MOVI to just go back to listening for the callsign after 5 seconds of hearing nothing. Reasonable?


Utterances preceding callsign problem: I’m actually using a Sparkfun MP3 shield to announce results of diagnostics triggered by MOVI commands. This predates me having MOVI - I was using the other Arduino voice recognition board. MOVI’s voice is reasonably impressive, but the MP3 voices are that much more life like so I’d like to keep things this way. So the first time I say the callsign, MOVI waits up to 5 seconds (my custom timeout) for my first command, then the MP3 announces the results of the requested diagnostic. The second time I give MOVI the callsign, it responds with -140 and then IMMEDIATELY with -141 choosing one of the available sentences, usually a smaller/simpler one. What I did was to do setThreshold(95) before playing an MP3 voice and then setThreshold back to the default after the MP3 is finished playing. This seems to have solved the problem for now (though I’m worried that once I have a mic that I don’t have to “eat” to make work with MOVI that even with a 95 threshold, MOVI may hear the MP3). I tried the stopDialog/restartDialog thing after playing the MP3 as I did for the timeout but it made no difference to the undesired behaviour. Reasonable? Is this expected behaviour that MOVI would cache utterances prior to the callsign and use those in sentence matching?


CALLSIGN_DETECTED not sent problem: A large percentage of the time, poll() doesn’t return CALLSIGN_DETECTED prior to BEGIN_LISTEN. For this one you may need to see my code to comment. For the time being, I’m using BEGIN_LISTEN as an indication that the callsign was detected.


If it helps to get an idea of how my application is intended to work here is a video with the old voice recognition board:


https://www.youtube.com/watch?v=6Y2CrRenziA

There is no need to watch the whole video, you’ll get the idea after the first 30 seconds or so. I was very lucky with that run by the way, this is not meant to say that I think the other board worked well at all. On the contrary, the other board I speak of is extremely unreliable in my experience. False positives rates were very frustrating. I am hopeful based on what I saw on the “bench” using MOVI’s integrated mic that MOVI can do much better.


Thanks for your advise,

Dylan.

[Last edited Apr 08, 2016 20:49:21]

If MOVI is waiting a long time for you to finish your voice input, it means you need to increase the threshold. Too much noise is making MOVI think somebody is still speaking. Resetting MOVI’s recognizer with with stopDialog() and restartDialog() is possible but it shouldn’t be necessary if you tune the threshold right.


I am not really understanding your second paragraph. But rather than fiddling with setting the threshold to 95 (which pretty much will kill any recognition, no matter how loud), I wonder if you could make the loudspeakers not be heard by MOVI. I guess, you are right, this means you need to fiddle with your microphone first. MOVI doesn’t ‘cache’ any sentences. After stopDialog() all results should be gone.


Using BEGIN_LISTEN is fine and safer (race condition)! Not sure why you wouldn’t see CALLSIGN_DETECTED at all though.


Seeing your video, your best bet is a directed microphone that catches as few noise as possible from the car and other external sources. It’s impressive though!

The desire I have for MOVI not to wait too long for a sentence after a callsign stems from a legacy of false positives on the old board’s “Trigger” (their word for callsign). Maybe I can do away with this whole notion with MOVI if I can count on it not to false positive the callsign. I never could figure out how the old board would get “okaymontycarlow” from casual conversation, but it would, and very regularly. Anyway, it’s not that MOVI is waiting too long for me to finish my sentence after the call sign, it’s waiting too long for me to start talking. At least in the old paradigm I have in my head anyway.


You may not understand the second paragraph entirely because the behaviour is very wonky. In my application, MOVI will quickly (subsecond) return BEGIN_LISTEN, END_LISTEN, RAW_WORDS, and a positive sentence number after a call sign, not nearly enough time to have said anything. This would only happen after the MP3 voice responded. I figured it must be because MOVI heard the MP3 voice speaking prior to the callsign because nothing else that I could think of made sense. If it continues to bother me once my microphone issues are resolved then I’ll take a video of it to show you.


With respect to the CALLSIGN_DETECTED issue, once I’m comfortable in sharing my code you can have a look, and hopefully I can screen capture my Serial monitor into a little video to show what MOVI is returning from poll(). It’s low priority though seeing as how BEGIN_LISTEN works just fine and you actually think it is preferable. I doubt I’d hit a race condition either way, it’s just the indicator I use to have the MP3 player say “Yes Dylan”.


This project has been a long time in the works. One could accomplish the same on a modern OBD-II car computer much more easily. I challenged myself to hack into an old 6801 based OBD-1 car computer. Now I just need MOVI to help me make it shine.


Thanks again, hopefully I can move forward again soon with a microphone solution.


Dylan.

Okay, so I’m back again with a bunch of progress and some finer tuned questions I believe. I’m going to post them in priority (to me anyway) order here, and try to be as succinct as I can:


QUESTION 1 - Customized externally generated sound between callsign and sentence recognition:


You can totally disregard my misunderstanding above about MOVI “caching” the sound prior to the callsign. Let’s start again on that one as I now understand the true nature of the problem:


If you watch my video above with me using my old voice recognition board, you’ll see that I say the “callsign”, then my Sparkfun MP3 says “Yes Dylan”, then I say my command. Since making that video, I replaced the “Yes Dylan” with a “Whoosh Whoosh” sound because the old board was so darn bad about false positiving and I got tired of hearing my name when I didn’t ask for it. For what it’s worth the sound file is here:


http://m1knight.prekfun.com/m1KnightRider/download/audio/soundEffects/scannerSweep.wav

But the problem I run into with MOVI is the same regardless of what sound I play. When I say the callsign to MOVI and the Sparkfun MP3 plays “Whoosh Whoosh”, MOVI comes back and thinks I said “Oh Two”. When I changed “Oh Two” to “Oxygen”, it comes back with “Coolant”. I tried doing this (see code snip below) but it has no effect. I have a feeling MOVI does not obey setThreshold if it has already reached a state of BEGIN_LISTEN. Am I right there?



…etc…
if ((movi_res = movi.poll()) != 0 ) {
  Serial.print(F("movi.poll() = "));
  Serial.println(movi_res);

  if (movi_res == BEGIN_LISTEN) {
    movi.setThreshold(95);
    mp3.stopTrack();
    mp3.setVolume(40, 40);
    mp3.playMP3("TRIGGER.MP3");
    vr_mode = VR_MODE_SPEAKING_TRIGGER;

…etc…

Dylan,


First observation:

The setThreshold in this context doesn’t seem to be working. I put MICDEBUG ON and can usually hear the MP3 playing back from MOVI’s headphone jack.

You are indeed correct: SetThreshold takes a while because the internal speech/non-speech model needs to be reset. So it will not be set quickly enough for ignoring the sound you are playing. You could try to recognize that particular sound (using a bogus sentence like “oh two”) and then ignore it or you could set MOVI’s output volume to 0 (which is immediate) and SAY a sentence about as long as you need it (and then set the output volume high again) but these are nasty workarounds. Here is the good news: The MOVI 1.1 firmware update includes a PLAY command. This means, you will be able to ask MOVI to play back that file. And of course, MOVI knows not to accidentally process it’s own output. So your best bet is to stay tuned and what for us to push the update in a couple weeks (we want to have it for Maker Faire Bay Area).


Question 2:

Yes, this is a reasonable feature request. I am actually already working on something similar for MOVI 1.1. So, I need to have you wait for MOVI 1.1 on this one too. Sorry.


Question 3:

At this point you are doing everything right. I’ve tried, but unfortunately, I can’t reproduce the problem on my end using this snippet.

Thanks again for your response and support Gerald. Just some brief feedback:


1) I thought about listening for my “whoosh” sound, or even “Yes Dylan” as I had in my video, but then MOVI will go back to listening for the callsign if it gets that. I may try that method again using a hacked version of “ask()” (ask takes too long to get started as you have it because it drops into say() even if the string to say is empty) to get back into listening for a true sentence. You didn’t like my idea of a toggle to just have MOVI report CALLSIGN_DETECTED and not automatically go to BEGIN_LISTEN? That would solve my problem if I could manually control when MOVI starts listening after the callsign, though I need to fix question #3 in that case.


2) I will wait patiently for that, thanks!


3) I’ll see if I can throw together a self-contained sketch that illustrates the problem with only the base Arduino and MOVI libraries required. This will happen faster if I can turn you on to my idea in question #1 :slight_smile:


Thanks again, looking forward to the firmware update.

Dylan.

Dylan,


I think the fix that I have in mind for 2) will also help with 1)


3) That would be great!


Gerald

Okay, I reproduced the missing CALLSIGN_DETECTED in a small sketch. It actually wasn’t so easy. Once I ripped out all the vehicle diagnostics checks etc., the problem went away. After scratching my head for a while, I figured that probably meant a timing issue. So I put a delay(10) at the very end of loop(), and there it is - missing -200 CALLSIGN_DETECTED every time I test it. 10 millis to do a bunch of stuff in loop() doesn’t seem excessive to me.


First I need to confess something about how I’m using MOVI, though I would thing it would be BETTER than the planned implementation. I’m using a Mega 2560 R3 and wanted to make use of one of the 3 otherwise disused additional hardware serial lines. I chose Serial1. I hacked your library slightly to take a HardwareSerial * as the construction parameter. I’m linking those here for your consideration, and possible use with my sketch if you are inclined to test it with a Mega and wiring your MOVI to RX1/TX1. One would think that the same or worse symptoms as I am seeing would happen with SoftwareSerial.


My callsign is “okaymontycarlo”. If you say Okay followed by the name of that famous region of Monaco in relatively rapid succession, MOVI generally picks it up. I’d bet the sketch will yield the same problem with any callsign, but I wanted to give it to you closest to the actual conditions I am using as possible.


movi_callsign_missing.ino

MOVIShield.cpp (modified for HardwareSerial)

MOVISheild.h (modified for HardwareSerial)

Here’s where I start crossing my fingers for a fix and a feature to allow me to tell MOVI not to drop straight to BEGIN_LISTEN after CALLSIGN_DETECTED. :-)-X


Dylan.

Thanks, Dylan. I’ll take a look.


One other thing crossed my mind: If you do a sendCommand(“ASK”) rather than using the ask() method form the library, MOVI will immediately go into active listen mode and not “try to say something”.


Gerald

Thanks Gerald, I’ll be looking forward to your observations. I have a good guess what the problem is, but I don’t want to “lead the witness” inappropriately. :slight_smile:


That’s essentially what I meant about a hacked version of “ask()”, the only difference being I envisioned adding a method to the MOVI class seeing as I already hacked it a bit for HardwareSerial. That will be exactly what I need if I can reliably catch the CALLSIGN_DETECTED and have a way to turn off the automatic BEGIN_LISTEN.


Dylan.

Okay, CALLSIGN_DETECTED missing problem:


I thought it might help to increase the buffer sizes for HardwareSerial since MOVI seems kind of “chatty” with it’s -230 responses. So I boosted the buffer sizes from 64 to 256. It didn’t really help unfortunately. There is some kind of strange timing issue that I do not understand so I guess I’ll need to wait for you to have a run at my sketch to consider a proper fix.


The problem stems from me sending “setThreshold” at the beginning and end of each MP3 that I play. The sketch I uploaded simulates this with some delays. I set debug=true in MOVI’s init() routine and here is what I see in my actual sketch:


movi.poll() = -141

MOVIEvent[201]: SPEED


movi.poll() = -201

MOVIEvent[202]: #1


movi.poll() = 2

Heard: SPEED, MOVI selected trained sentence: SPEED

Got VR command 0

0

SPEED1.MP3

0

SPEED2.MP3

0

begin say: SPEED1.MP3

end say: speed1.mp3

begin say: NUMBERS/0/00/0.MP3

end say: numbers/0/00/0.mp3

begin say: SPEED2.MP3

end say: speed2.mp3

done saying non-special case

MOVIEvent[230]: Set THRESHOLD to 95 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOLD to 5 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOLD to 95 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOLD to 5 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOLD to 95 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOMOVIEvent[200]: CALLSIGN DETECTED <-- Here’s the problem.


movi.poll() = -230

MOVIEvent[140]: ACTIVELISTEN


movi.poll() = -140

MOVIEvent[230]: Set THRESHOLD to 95 percent.


movi.poll() = -230

MOVIEvent[230]: Set THRESHOLD to 5 percent.


movi.poll() = -230

MOVIEvent[141]: END ACTIVELISTEN


movi.poll() = -141

MOVIEvent[201]: AVERAGES


This is the tail end of a callsign/command sequence followed by another callsign/command. You can see the issue on the line where I put “Here’s the problem”.


I even tried to put a while (movi.poll() == -230); after every setThreshold to try to clear those out, but for some reason I do not understand, the -230’s seem to come through well after all of my MP3 responses are played, with the last one getting buffered somehow and then being muddled up with the next CALLSIGN_DETECTED.


I suspect there is a little bug in MOVI where it’s buffering the -230 responses. I can work around it for now by hacking poll() to look for the last indexOf(“MOVIEvent[”) instead of the first in a string received from MOVI. For a real solid fix, I think I’ll need to wait for you to run my sample sketch and consider how MOVI’s firmware is reacting.


Thanks for your continued support!!

Dylan.

I had a closer look at the MOVI API poll() routine on a hunch I had recalling the hack I did to use lastIndexOf. A big part of my problem here is that a single call to poll() retrieves at most one character from the serial variable that the class is initialized with. That means that in order to get a non-zero response to (for example):


MOVIEvent[230]: Set THRESHOLD to 95 percent.


It takes at least 58 calls to poll() before a non-zero result is returned.


Why would you use:


if (mySerial->available()) {


instead of:


while (mySerial->available()) {


in poll() to make it more of a hungry algorithm?


Dylan.

So this is the decision between using a blocking and a non-blocking IO call. I am using if, i.e. make it a non-blocking call, because I want to give the power back to loop() immediately.


A non-blocking call allows you to use the CPU right away, for example if your Arduino has other things to do as well (such as updating a display every tenth of a second). If you make it a while() loop then poll() would always return results but you couldn’t do anything else in your main loop() method. But yes, if you want poll() to always give yo some result you could do this:


int res=recognizer.poll();
while (res==0) {
delay(50);
res=recognizer.poll();
}
if (res==1) // sentence 1

What I’m suggesting won’t block. With a while instead of an if, poll() would just get as many characters as it has available before returning.


Dylan.

Let me take a look.


Gerald

Please do for the potential benefit of the API - I am confident you will agree that “poll” is an incorrect semantic definition for what is happening in poll() as it currently exists, at least if you consider a Serial instance that has any amount of buffer. poll() as it stands right now is more like “readSingleCharFromMOVI”.


I tried this change I am recommending today - it works way better in terms of flushing through MOVI’s responses and it does not block, I can be certain of that because I observed my loop() cruising on and doing its usual thing updating my LCD etc. in the absence of a response from MOVI.


Dylan.