Speech Onset

Dear Audeme Team,

I have a question about the speech onset detection. I would like to detect a speech pattern in realtime, but since the overhead of the computation seems to be to huge, this might not be possible. Instead I would like to detect any speech pattern as quick as possible. It seems that this could be possible since there is a threshold for speech detection. Is there a way to get the timepoint where a speech pattern is detected instead of noise or background tones ?

About the only think I can think of is to set the call sign to be empty. Then MOVI reacts to any pattern. Together with the noise threshold, this can be tuned to be more specific. Other than that MOVI does not allow tuning any of the timings.

Hope this scheds some light,
Gerald

I am not using a callsign for my tests. I tuned the noise threshold already. My problem here is, that I need to know when a word is spoken, before any classification happens. I would like to use the MOVI to trigger a movement with an Exoskeleton, but to do so it is necessary to detect the speech onset (not the classification) as soon as possible. As a workaround, I attached another microphone to the Arduino and I am using a simple threshold to detect the speech onset (after smoothing the audio signal). It works well, but I would like to use the mic on the MOVI to detect the onset. Is it possible to get a feedback from MOVI, when the speech onset is detected ? If not, is it possible to get the raw audio signals from the mic ?

best Regards
Niklas

No and no, unfortunately. Using a Raspberry PI to get the raw signal from the MIC and doing threshold detection there is your best bet. An Arduino probably doesn’t have enough power to do pattern recognition on raw signals. MOVI doesn’t allow access to the raw signal and also doesn’t give you trigger thresholds. Again, the 9600bps connection between the Arduino and MOVI would just not be fast enough for things like that.

Okay, thank you for your fast reply! Maybe another approach would work as well. I measured the time between a spoken word and the correct classification (the detection of the defined word from MOVI). It was about 1.2 seconds. If this delay would be constant and known, that would work for my use case. So I ask myself which conditions effect the time of the classification result (e.g. distance to the mic or magnitude of the audio signal).

I can look up the timings but it will be a few days. 1.2 sec sounds about right.

That would be great! It would be most helpful to know how much the delay varies under different conditions. It would also be important to know if the classification algorithm is deterministic, since computation time would vary otherwise.