R
R
rPman2013-07-18 13:19:27
Speech recognition
rPman, 2013-07-18 13:19:27

'Low-level' speech recognition (sounds)?

I didn’t dig very deep, but if I understand correctly, ready-made voice recognition solutions take care of everything, giving the user only the finished text, unfortunately with a delay (and even more so only after receiving the entire phrase).
But is it possible to get a stream of recognized sounds (not even letters) - transcription, in real time? And most importantly, equipped with 'character-by-character' timestamps and such properties as timbre, tone, and even language affiliation (or some parameter that will allow you to determine which group the transcription is used from), etc.
Naturally, I'm talking about offline libraries and frameworks . What is allowed and to what extent? Of course, opensource and cross-platform is recommended.
Paid solutions are also possible, but I would not want to 'buy a plane with an air terminal to go for bread on the next street'.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
merlin-vrn, 2013-07-19
@merlin-vrn

You may have been pointed to sphinx, read about it. There is a version for Java (sphinx4), there is a version for C (poketsphinx).
But.
The fact is that the speech recognition scheme itself is based on hidden Markov models (HMMs).
This is how sphinx works: first, sound processing occurs (filtering, obtaining a cepstrum), then features (features) are extracted from this cepstrum - as a result, we have a stream, if I'm not mistaken, of 13-dimensional feature vectors with a frequency of 100 Hz. Here the vectors of this stream correspond to specific sounds - either there will be a transient process associated with a consonant, or - many similar vectors in a row - a stretching vowel sound.
The problem here is that this stream is very dirty in terms of the quality of the information. What kind of features were extracted there - only Dan Zhurafsky knows. The specified stream is usually then sent to the HMM, which knows exactly the words, in the sense of which sounds usually follow which, etc., and based on this knowledge, it assumes what the output should actually have been (what “meaned” ). I have a hard time imagining how you can do something without filtering with HMM.

L
lightcaster, 2013-07-18
@lightcaster

Dig towards cmu sphinx. As far as I remember, there it was possible to get a list of phonemes at the output without further decoding. How much realtime it is I can’t judge

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question