Better speech recognition. How to turn recognized string into commands. Generating subtitles for films

“I would like to say right away that this is my first time dealing with recognition services. And therefore, I’ll tell you about the services from a layman’s point of view,” noted our expert, “to test recognition, I used three instructions: Google, Yandex and Azure.”

Google

The well-known IT corporation offers to test its Google Cloud Platform product online. Anyone can try the service for free. The product itself is convenient and easy to use.

Pros:

support for more than 80 languages;
fast name processing;
high-quality recognition in conditions of poor communication and in the presence of extraneous sounds.

Minuses:

there are difficulties in recognizing messages with an accent and poor pronunciation, which makes the system difficult to use by anyone other than native speakers;
lack of clear technical support for the service.

Yandex

Speech recognition from Yandex is available in several options:

Cloud
Library for access from mobile applications
"Boxed" version
JavaScript API

But let's be objective. We are primarily interested not in the variety of usage possibilities, but in the quality of speech recognition. Therefore, we used the trial version of SpeechKit.

Pros:

ease of use and configuration;
good text recognition in Russian;
the system provides several answer options and, through neural networks, tries to find the option that is most similar to the truth.

Minuses:

During stream processing, some words may be determined incorrectly.

Azure

Azure is developed by Microsoft. It stands out from its analogues due to its price. But, be prepared to face some difficulties. The instructions presented on the official website are either incomplete or outdated. We were unable to launch the service adequately, so we had to use a third-party launch window. However, even here you will need an Azure service key for testing.

Pros:

Compared to other services, Azure processes messages very quickly in real time.

Minuses:

the system is very sensitive to accent and has difficulty recognizing speech from non-native speakers;
The system operates only in English.

Review results:

After weighing all the pros and cons, we settled on Yandex. SpeechKit is more expensive than Azure, but cheaper than Google Cloud Platform. Google's program has seen a constant improvement in the quality and accuracy of recognition. The service improves itself using machine learning technologies. However, Yandex’s recognition of Russian-language words and phrases is a level higher.

How to use voice recognition in business?

There are a lot of options for using recognition, but we will focus your attention on the one that will primarily affect your company’s sales. For clarity, let’s look at the recognition process using a real example.

Not so long ago, one well-known SaaS service became our client (at the request of the company, the name of the service was not disclosed). With the help of F1Golos, they recorded two audio videos, one of which was aimed at extending the life of warm customers, the other - at processing customer requests.

How to extend customer life using voice recognition?

Often, SaaS services operate on a monthly subscription fee. Sooner or later, the period of trial use or paid traffic ends. Then there is a need to extend the service. The company decided to warn users about the end of traffic 2 days before the expiration of the term of use. Users were notified via voice mail. The video sounded like this: “Good afternoon, we remind you that your paid period for using the XXX service is ending. To extend the service, say yes; to cancel the services provided, say no.”

Calls from users who said the code words: YES, RENEW, I WANT, MORE DETAILS; were automatically transferred to the company's operators. Thus, about 18% of users renewed their registration thanks to just one call.

How to simplify a data processing system using speech recognition?

The second audio clip, launched by the same company, was of a different nature. They used voice messaging to reduce the cost of verifying phone numbers. Previously, they verified user numbers using a robocall. The robot asked users to press certain keys on the phone. However, with the advent of recognition technologies, the company changed tactics. The text of the new video was as follows: “You have registered on the XXX portal, if you confirm your registration, say yes. If you did not submit a registration request, say no." If the client uttered the words: YES, I CONFIRM, AHA or OF COURSE, the data about this was instantly transferred to the company’s CRM system. And the registration request was confirmed automatically in a couple of minutes. The introduction of recognition technologies has reduced the time of one call from 30 to 17 seconds. Thus, the company reduced costs by almost 2 times.

If you are interested in other ways to use voice recognition, or want to learn more about voice messaging, follow the link. On F1Golos you can sign up for your first newsletter for free and learn for yourself how new recognition technologies work.

We were asked a question on Facebook:
“To work with text, I need to transcribe 3 hours of voice recording. I tried to upload an audio file with a picture to YouTube and use their text decoder, but it turned out to be some kind of gobbledygook. Tell me, how can I solve this technically? Thank you!
Alexander Konovalov"

Alexander, there is a simple technical solution - but the result will depend solely on the quality of your recording. Let me explain what quality we are talking about.

In recent years, Russian speech recognition technologies have made significant progress. The percentage of recognition errors has decreased to such a level that it has become easier to “pronounce” other text in a special mobile application or Internet service, manually correcting individual “misprints” - than to type the entire text on the keyboard.

But in order for the artificial intelligence of the recognition system to do its job, the user must do his. Namely: speak into the microphone clearly and measuredly, avoid strong background noise, if possible, use a stereo headset or an external microphone attached to the buttonhole (for the quality of recognition, it is important that the microphone is always at the same distance from your lips, and that you yourself speak at the same volume ). Naturally, the higher the class of the audio device, the better.

It is not difficult to adhere to these conditions if, instead of accessing the Internet speech recognition service directly, you use a voice recorder as an intermediate intermediary device. By the way, such a “personal secretary” is especially indispensable when you do not have access to the Internet. Naturally, it is better to use at least an inexpensive professional voice recorder rather than a recording device built into a cheap MP3 player or smartphone. This will give a much better chance of “feeding” the received recordings to the speech recognition service.

It’s difficult, but you can persuade the interlocutor you’re interviewing to follow these rules (one more tip: if you don’t have an external clip-on microphone in your kit, at least keep the recorder next to the interlocutor, and not with you).

But “taking notes” at the required level automatically at a conference or seminar is, in my opinion, almost unrealistic (after all, you will not be able to control the speech of the speakers and the reaction of the listeners). Although there is a rather interesting option: turning professionally recorded audio lectures and audio books into text (if they were not superimposed with background music and noise).

Let's hope that the quality of your voice recording is high enough so that it can be transcribed in automatic mode.

If not, with almost any recording quality you can decrypt in semi-automatic mode.

In addition, in a number of situations, the greatest saving of time and effort will be brought to you, paradoxically, by decoding in manual mode. More precisely, the version that I myself have been using for ten years. 🙂

So, in order.

1. Automatic speech recognition

Many people advise transcribing voice recordings on YouTube. But this method forces the user to waste time at the stage of loading the audio file and background image, and then during the process of clearing the resulting text from timestamps. Meanwhile, it’s easy to save this time. 🙂

You can recognize audio recordings directly from your computer using the capabilities of one of the Internet services running on the Google recognition engine (I recommend Speechpad.ru or Speechlogger.com). All you need to do is do a little trick: instead of your voice being played from the microphone, redirect the audio stream played by your computer player to the service.

This trick is called a software stereo mixer (it is usually used to record music on a computer or broadcast it from a computer to the Internet).

The stereo mixer was included in Windows XP - but was removed by the developers from later versions of this operating system (they say for copyright protection purposes: to prevent gamers from stealing music from games, etc.). However, a stereo mixer often comes with audio card drivers (for example, Realtec cards built into the motherboard). If you do not find the stereo mixer on your PC using the screenshots below, try reinstalling the audio drivers from the CD that came with the motherboard or from its manufacturer’s website.

If this does not help, install an alternative program on your computer. For example, the free VB-CABLE Virtual Audio Device: the owner of the above-mentioned Speechpad.ru service recommends using it.

The first step You must disable the microphone to use in recording mode and enable the stereo mixer (or virtual VB-CABLE) instead.

To do this, click on the speaker icon in the lower right corner (near the clock) - or select the “Sound” section in the “Control Panel”. In the “Recording” tab of the window that opens, right-click and check the boxes next to “Show disconnected devices” and “Show disconnected devices.” Right-click on the microphone icon and select “Disconnect” (in general, disconnect all devices marked with a green icon).

Right-click on the stereo mixer icon and select “Enable”. A green icon will appear on the icon, indicating that the stereo mixer has become the default device.

If you decide to use VB-CABLE, then enable it in the “Recording” tab in the same way.

And also in the “Playback” tab.

Second step. Turn on audio recording in any player (if you need to transcribe the audio track of a video, you can also launch the video player). At the same time, download the Speechpad.ru service in the Chrome browser and click the “Enable recording” button in it. If the recording is of sufficiently high quality, you will see how the service transforms speech into meaningful text close to the original before your eyes. True, without punctuation marks, which you will have to place yourself.

I recommend using AIMP as an audio player, which will be discussed in more detail in the third sub-chapter. Now I’ll just note that this player allows you to slow down the recording without speech distortion, as well as correct some other errors. This can somewhat improve the recognition of not very high-quality recordings. (Sometimes it is even advised to pre-process bad recordings in professional audio editing programs. However, in my opinion, this is too time-consuming a task for most users, who would much faster type text by hand. :)

2. Semi-automatic speech recognition

Everything is simple here. If the recording is of poor quality and the recognition “chokes” or the service produces too many errors, help the matter yourself by “embedding” into the chain: “audio player – announcer – recognition system.”

Your task: listen to recorded speech using headphones and at the same time dictate it through a microphone to an online recognition service. (Of course, you don’t need to switch from microphone to stereo mixer or virtual cable in the list of recording devices, as in the previous section). And as an alternative to the Internet services mentioned above, you can use smartphone applications like the free Yandex.Dictovka or the dictation function on an iPhone with the iOS 8 operating system and higher.

I note that in semi-automatic mode you have the opportunity to immediately dictate punctuation marks, which services are not yet capable of placing in automatic mode.

If you manage to dictate synchronously with the recording playing on the player, the preliminary transcription will take almost as much time as the recording itself (not counting the subsequent time spent correcting spelling and grammatical errors). But even working according to the scheme: “listen to a phrase - dictate - listen to a phrase - dictate” can give you a good saving of time compared to traditional typing.

I recommend using the same AIMP as an audio player. First, you can use it to slow down the playback to a speed at which you are comfortable working in simultaneous dictation mode. Secondly, this player can return the recording for a specified number of seconds: this is sometimes necessary in order to better hear an illegible phrase.

3. Transcript of voice recording manually

You may find in practice that you get tired of dictation in semi-automatic mode too quickly. Or you make too many mistakes with the service. Or, thanks to your speed typing skills, you can create ready-made corrected text on the keyboard much easier than using dictation. Or your voice recorder, microphone on a stereo headset, or audio card do not provide sound quality acceptable for the service. Or maybe you simply don’t have the ability to dictate out loud in your work or home office.

In all these cases, my proprietary method of manual decoding will help you (listen to the recording in AIMP - type the text in Word). It will help you turn your post into text faster than many professional journalists whose typing speed is similar to yours! At the same time, you will spend much less effort and nerves than they do. 🙂

What is the main reason why energy and time are wasted when transcribing audio recordings in the traditional way? Due to the fact that the user makes a lot of unnecessary movements.

The user constantly reaches out to either the voice recorder or the computer keyboard. I stopped playback - typed the listened passage into a text editor - started playback again - rewinded the illegible recording - etc., etc.

Using a regular software player on a computer does not make the process much easier: the user has to constantly minimize/expand Word, stop/start the player, and even move the player slider back and forth to find an illegible fragment, and then return to the last listened place in the recording.

To reduce these and other wasted time, specialized IT companies are developing software and hardware transcribers. These are quite expensive solutions for professionals - journalists, court stenographers, investigators, etc. But, in fact, for our purposes only two functions are required:

the ability to slow down the playback of a voice recording without distorting it or lowering the tone (many players allow you to slow down the playback speed - but, alas, in this case the human voice turns into a monstrous robotic voice, which is difficult to perceive by ear for a long time);
the ability to stop recording or roll back it for a specified number of seconds and return it back without stopping typing or minimizing the text editor window.

In my time, I tested dozens of audio programs - and found only two available paid applications that met these requirements. I bought one of them. I searched a little more for my dear readers 🙂 - and found a wonderful free solution - the AIMP player, which I still use myself.

“Once you enter the AIMP settings, find the Global Keys section and reconfigure Stop/Start to the Escape (Esc) key. Believe me, this is the most convenient, since you don’t have to think about it and your finger won’t accidentally hit other keys. Set the items “Move backward a little” and “Move forward a little”, respectively, to the Ctrl keys + back/forward cursor keys (you have four arrow keys on your keyboard - select two of them). This function is needed to re-listen to the last fragment or move forward a little.

Then, by calling up the equalizer, you can reduce the Speed and Tempo values and increase the Pitch value. At the same time, you will notice that the playback speed will slow down, but the pitch of the voice (if you select the “Pitch” value well) will not change. Select these two parameters so that you can type text almost simultaneously, only occasionally stopping it.

Once everything is set up, typing will take you less time and your hands will be less tired. You will be able to transcribe the audio recording calmly and comfortably, practically without lifting your fingers from typing on the keyboard.”

I can only add to what has been said that if the recording is not of very high quality, you can try to improve its playback by experimenting with other settings in the AIMP Sound Effects Manager.

And the number of seconds for which it will be most convenient for you to move backwards or forwards through a recording using hotkeys - set in the “Player” section of the “Settings” window (which can be called up by pressing the “Ctrl + P” hotkeys).

I wish you to save more time on routine tasks - and use it fruitfully for important things! 🙂 And don’t forget to turn on the microphone in the list of recording devices when you get ready to talk on Skype! 😉

3 ways to transcribe voice recordings: speech recognition, dictation, manual mode

Man has always been attracted to the idea of controlling a machine using natural language. Perhaps this is partly due to the desire of man to be ABOVE the machine. So to speak, to feel superior. But the main message is to simplify human interaction with artificial intelligence. Voice control in Linux has been implemented with varying degrees of success for almost a quarter of a century. Let's look into the issue and try to get as close to our OS as possible.

The crux of the matter

Systems for working with human voice for Linux have been around for a long time, and there are a great many of them. But not all of them process Russian speech correctly. Some were completely abandoned by the developers. In the first part of our review, we will talk directly about speech recognition systems and voice assistants, and in the second, we will look at specific examples of their use on a Linux desktop.

It is necessary to distinguish between speech recognition systems themselves (translation of speech into text or into commands), such as, for example, CMU Sphinx, Julius, as well as applications based on these two engines, and voice assistants, which have become popular with the development of smartphones and tablets. This is, rather, a by-product of speech recognition systems, their further development and the implementation of all successful ideas of voice recognition, their application in practice. There are still few of these for Linux desktops.

You need to understand that the speech recognition engine and the interface to it are two different things. This is the basic principle of Linux architecture - dividing a complex mechanism into simpler components. The most difficult work falls on the shoulders of the engines. This is usually a boring console program that runs unnoticed by the user. The user interacts mainly with the interface program. Creating an interface is not difficult, so developers focus their main efforts on developing open-source speech recognition engines.

What happened before

Historically, all speech processing systems in Linux developed slowly and in leaps and bounds. The reason is not the crookedness of the developers, but the high level of entry into the development environment. Writing system code for working with voice requires a highly qualified programmer. Therefore, before starting to understand speech systems in Linux, it is necessary to make a short excursion into history. IBM once had such a wonderful operating system - OS/2 Warp (Merlin). It came out in September back in 1996. In addition to the fact that it had obvious advantages over all other operating systems, OS/2 was equipped with a very advanced speech recognition system - IBM ViaVoice. For that time, this was very cool, considering that the OS ran on systems with a 486 processor with 8 MB of RAM (!).

As you know, OS/2 lost the battle to Windows, but many of its components continued to exist independently. One of these components was the same IBM ViaVoice, which turned into an independent product. Since IBM always loved Linux, ViaVoice was ported to this OS, which gave the brainchild of Linus Torvalds the most advanced speech recognition system of its time.

Unfortunately, the fate of ViaVoice did not turn out the way Linux users would have liked. The engine itself was distributed free of charge, but its sources remained closed. In 2003, IBM sold the rights to the technology to the Canadian-American company Nuance. Nuance, which developed perhaps the most successful commercial speech recognition product - Dragon Naturally Speeking, is still alive today. This is almost the end of the inglorious history of ViaVoice on Linux. During the short time that ViaVoice was free and available to Linux users, several interfaces were developed for it, such as Xvoice. However, the project has long been abandoned and is now practically inoperable.

INFO

The most difficult part of machine speech recognition is natural human language.

What today?

Today everything is much better. In recent years, after the discovery of the Google Voice API sources, the situation with the development of speech recognition systems in Linux has improved significantly, and the quality of recognition has increased. For example, the Linux Speech Recognition project based on the Google Voice API shows very good results for the Russian language. All engines work approximately the same: first, the sound from the microphone of the user’s device enters the recognition system, after which either the voice is processed on the local device, or the recording is sent to a remote server for further processing. The second option is more suitable for smartphones or tablets. Actually, this is exactly how commercial engines work - Siri, Google Now and Cortana.

Of the variety of engines for working with the human voice, there are several that are currently active.

WARNING

Installing many of the described speech recognition systems is a non-trivial task!

CMU Sphinx

Much of the development of CMU Sphinx takes place at Carnegie Mellon University. At different times, both the Massachusetts Institute of Technology and the now deceased Sun Microsystems corporation worked on the project. The engine sources are distributed under the BSD license and are available for both commercial and non-commercial use. Sphinx is not a custom application, but rather a set of tools that can be used to develop end-user applications. Sphinx is now the largest speech recognition project. It consists of several parts:

Pocketsphinx is a small, fast program that processes sound, acoustic models, grammars and dictionaries;
Sphinxbase library, required for Pocketsphinx to work;
Sphinx4 - the actual recognition library;
Sphinxtrain is a program for training acoustic models (recordings of the human voice).

The project is developing slowly but surely. And most importantly, it can be used in practice. And not only on PCs, but also on mobile devices. In addition, the engine works very well with Russian speech. If you have straight hands and a clear head, you can set up Russian speech recognition using Sphinx to control home appliances or a smart home. In fact, you can turn an ordinary apartment into a smart home, which is what we will do in the second part of this review. Sphinx implementations are available for Android, iOS and even Windows Phone. Unlike the cloud method, when the work of speech recognition falls on the shoulders of Google ASR or Yandex SpeechKit servers, Sphinx works more accurately, faster and cheaper. And completely local. If you wish, you can teach Sphinx the Russian language model and the grammar of user queries. Yes, you will have to work a little while installing. Just like setting up Sphinx voice models and libraries is not an activity for beginners. Because the core of CMU Sphinx, the Sphinx4 library, is written in Java, you can include its code in your speech recognition applications. Specific examples of use will be described in the second part of our review.

VoxForge

Let us especially highlight the concept of a speech corpus. A speech corpus is a structured set of speech fragments, which is provided with software for accessing individual elements of the corpus. In other words, it is a set of human voices in different languages. Without a speech corpus, no speech recognition system can operate. It is difficult to create a high-quality open speech corpus alone or even with a small team, so a special project is collecting recordings of human voices - VoxForge.

Anyone with access to the Internet can contribute to the creation of a speech corpus by simply recording and submitting a speech fragment. This can be done even by phone, but it is more convenient to use the website. Of course, in addition to the audio recording itself, the speech corpus must include additional information, such as phonetic transcription. Without this, speech recording is meaningless for the recognition system.

HTK, Julius and Simon

HTK - Hidden Markov Model Toolkit is a toolkit for research and development of speech recognition tools using hidden Markov models, developed at the University of Cambridge under the patronage of Microsoft (Microsoft once bought this code from a commercial enterprise Entropic Cambridge Research Laboratory Ltd, and then returned it Cambridge together with a restrictive license). The project's sources are available to everyone, but the use of HTK code in products intended for end users is prohibited by the license.

However, this does not mean that HTK is useless for Linux developers: it can be used as an auxiliary tool when developing open-source (and commercial) speech recognition tools, which is what the developers of the open-source Julius engine, which is being developed in Japan, do. Julius works best with Japanese. The great and powerful is also not deprived, because the same VoxForge is used as a voice database.

Continuation is available only to subscribers

Option 1. Subscribe to Hacker to read all materials on the site

Subscription will allow you to read ALL paid materials on the site within the specified period. We accept payments by bank cards, electronic money and transfers from mobile operator accounts.

Perhaps the most convenient text transcription program for Windows and Mac OS, which combines an audio player and a text editor. The principle of operation is very simple - load an audio file into the program, listen to it using hot keys on the keyboard (you can assign them yourself) and at the same time type text. Playback speed and audio volume are also adjusted using the keyboard. This way, your hands are always on the keyboard and there is no need to use the mouse or switch between different programs. It should be taken into account that the built-in text editor does not recognize errors and does not have many other familiar functions, for example, switching hyphens in dashes. However, you can use other text editors in parallel with Express Scribe by using hotkeys to control audio playback. The program is shareware, full cost: $17-50.

02. Transcriber-pro

A Russian-language program for Windows that allows you to listen not only to audio, but also to view video files. The built-in text editor has the ability to add timestamps and names of interlocutors. The resulting text can be imported into “interactive transcripts” and can also be adjusted as part of a group project. The application is available only with an annual subscription, the cost is 689 rubles per year.

03. RPlayer V1.4

A simple program for processing and transcribing audio files with hotkey support and the ability to type in Microsoft Word. Unlike previous similar programs, it can be downloaded for free, but it is unstable on new versions of Windows.

04. Voco

Professional Windows application for converting speech to text. Supports voice typing in any test browser, has a large collection of thematic dictionaries and does not require an Internet connection for speech recognition. The extended versions "Voco.Professional" and "Voco.Enterprise" can work with ready-made audio files. The only drawback is the high cost of the application.

05. Dragon Dictation

Free mobile application for dictated speech recognition. The program can recognize about 40 languages and their varieties, allows you to edit text and send it to email, social networks or copy to the clipboard. An Internet connection is required to operate.

06. RealSpeaker

A unique application that can not only recognize audio files, but also live speech spoken to the camera. Thanks to a special video extension, “RealSpeaker” reads lip movements, thereby improving the speech recognition process by up to 20-30% compared to other similar algorithms. Currently, the application supports 11 languages: Russian, English (American and British dialects), French, German, Chinese, Korean and Japanese, Turkish, Spanish, Italian and Ukrainian. The program is distributed free of charge, the cost depends on the duration of the subscription, the unlimited version costs about 2 thousand rubles.

) using a real Hello World example of home appliance control.
Why home appliances? Yes, because thanks to such an example you can appreciate that speed and accuracy which can be achieved by using completely local speech recognition without servers like Google ASR or Yandex SpeechKit.
I also attach to the article all the source code of the program and the assembly itself for Android.

Why suddenly?

Having recently come across , I asked the author why he wanted to use server-based speech recognition for his program (in my opinion, this was unnecessary and led to some problems). To which I received a counter question about whether I could describe in more detail the use of alternative methods for projects where there is no need to recognize anything, and the dictionary consists of a finite set of words. And even with an example of practical application...

Why do we need anything else besides Yandex and Google?

For that very “practical application” I chose the topic voice control for smart home.
Why exactly this example? Because it shows several advantages of completely local speech recognition over recognition using cloud solutions. Namely:

Speed- we do not depend on servers and therefore do not depend on their availability, bandwidth, etc. factors
Accuracy- our engine works only with the dictionary that interests our application, thereby increasing the quality of recognition
Price- we don’t have to pay for each request to the server
Voice activation- as an additional bonus to the first points - we can constantly “listen to the broadcast” without wasting our traffic and without loading the servers

Note

Let me make a reservation right away that these advantages can be considered advantages only for a certain class of projects, Where are we we know for sure in advance, what dictionary and what grammar the user will operate with. That is, when we do not need to recognize arbitrary text (for example, an SMS message or a search query). Otherwise, cloud recognition is indispensable.

So Android can recognize speech without the Internet!

Yes, yes... Only on JellyBean. And only from half a meter, no more. And this recognition is the same dictation, only using a much smaller model. So we can’t manage or configure it either. And what she will return to us next time is unknown. Although just right for SMS!

What do we do?

We will implement a voice remote control for home appliances, which will work accurately and quickly, from a few meters and even on cheap, crappy, very inexpensive Android smartphones, tablets and watches.
The logic will be simple but very practical. We activate the microphone and say one or more device names. The application recognizes them and turns them on and off depending on the current state. Or he receives a fortune from them and pronounces it in a pleasant female voice. For example, the current temperature in the room.

Practical options abound

In the morning, without opening your eyes, you slapped your palm on the smartphone screen on the nightstand and commanded “Good morning!” - the script starts, the coffee maker turns on and hums, pleasant music is heard, the curtains open.
Let's hang a cheap (2 thousand, no more) smartphone on the wall in each room. We go home after work and command into the void “Smart Home! Lights, TV! - I don’t think there’s any need to say what happens next.

Transcriptions

Grammar describes what what the user can say. For Pocketsphinx to know, How he will pronounce it, it is necessary for each word from the grammar to write how it sounds in the corresponding language model. That is transcription every word. It is called dictionary.

Transcriptions are described using a special syntax. For example:
smart uu m n ay j house d oo m

In principle, nothing complicated. A double vowel in transcription indicates stress. A double consonant is a soft consonant followed by a vowel. All possible combinations for all sounds of the Russian language.

It is clear that we cannot describe in advance all the transcriptions in our application, because we do not know in advance the names that the user will give to their devices. Therefore, we will generate such transcriptions “on the fly” according to some rules of Russian phonetics. To do this, you can implement the following PhonMapper class, which can receive a string as input and generate the correct transcription for it.

Voice activation

This is the ability of the speech recognition engine to “listen to the broadcast” all the time in order to react to a predetermined phrase (or phrases). At the same time, all other sounds and speech will be discarded. This is not the same as describing grammar and just turning on the microphone. I will not present here the theory of this task and the mechanics of how it works. Let me just say that recently the programmers working on Pocketsphinx implemented such a function, and now it is available out of the box in the API.

One thing is definitely worth mentioning. For an activation phrase, you need not only to specify the transcription, but also to select the appropriate one sensitivity threshold value. A value that is too small will lead to many false positives (this is when you did not say the activation phrase, but the system recognizes it). And too high - to immunity. Therefore, this setting is of particular importance. Approximate range of values - from 1e-1 to 1e-40 depending on the activation phrase.

Proximity sensor activation

This task is specific to our project and is not directly related to recognition. The code can be seen directly in the main activity.
She implements SensorEventListener and at the moment of approach (the sensor value is less than the maximum) it turns on the timer, checking after a certain delay whether the sensor is still blocked. This is done to eliminate false positives.
When the sensor is not blocked again, we stop recognition, getting the result (see description below).

Let's start recognition

Pocketsphinx provides a convenient API for configuring and running the recognition process. These are the classes SpechRecognizer And SpeechRecognizerSetup.
This is what the configuration and launch of recognition looks like:

PhonMapper phonMapper = new PhonMapper(getAssets().open("dict/ru/hotwords")); Grammar grammar = new Grammar(names, phonMapper); grammar.addWords(hotword); DataFiles dataFiles = new DataFiles(getPackageName(), "ru"); File hmmDir = new File(dataFiles.getHmm()); File dict = new File(dataFiles.getDict()); File jsgf = new File(dataFiles.getJsgf()); copyAssets(hmmDir); saveFile(jsgf, grammar.getJsgf()); saveFile(dict, grammar.getDict()); mRecognizer = SpeechRecognizerSetup.defaultSetup() .setAcousticModel(hmmDir) .setDictionary(dict) .setBoolean("-remove_noise", false) .setKeywordThreshold(1e-7f) .getRecognizer(); mRecognizer.addKeyphraseSearch(KWS_SEARCH, hotword); mRecognizer.addGrammarSearch(COMMAND_SEARCH, jsgf);

Here we first copy all the necessary files to disk (Pocketpshinx requires an acoustic model, grammar and dictionary with transcriptions to be on disk). Then the recognition engine itself is configured. The paths to the model and dictionary files are indicated, as well as some parameters (sensitivity threshold for the activation phrase). Next, the path to the file with the grammar, as well as the activation phrase, is configured.

As you can see from this code, one engine is configured for both grammar and activation phrase recognition. Why is this done? So that we can quickly switch between what we currently need to recognize. This is what starting the activation phrase recognition process looks like:

MRecognizer.startListening(KWS_SEARCH);
And this is how speech is recognized according to a given grammar:

MRecognizer.startListening(COMMAND_SEARCH, 3000);
The second argument (optional) is the number of milliseconds after which recognition will automatically end if no one says anything.
As you can see, you can use only one engine to solve both problems.

How to get the recognition result

To get the recognition result, you must also specify an event listener that implements the interface RecognitionListener.
It has several methods that are called by pocketsphinx when one of the events occurs:

onBeginningOfSpeech- the engine heard some sound, maybe it was speech (or maybe not)
onEndOfSpeech- sound ends
onPartialResult- there are intermediate recognition results. For an activation phrase, this means that it worked. Argument Hypothesis
onResult- the final result of recognition. This method will be called after the method is called stop at SpeechRecognizer. Argument Hypothesis contains recognition data (string and score)

By implementing the onPartialResult and onResult methods in one way or another, you can change the recognition logic and obtain the final result. Here's how it's done in the case of our application:

@Override public void onEndOfSpeech() ( Log.d(TAG, "onEndOfSpeech"); if (mRecognizer.getSearchName().equals(COMMAND_SEARCH)) ( mRecognizer.stop(); ) ) @Override public void onPartialResult(Hypothesis hypothesis) ( if (hypothesis == null) return; String text = hypothesis.getHypstr(); if (KWS_SEARCH.equals(mRecognizer.getSearchName())) ( startRecognition(); ) else ( Log.d(TAG, text); ) ) @Override public void onResult(Hypothesis hypothesis) ( mMicView.setBackgroundResource(R.drawable.background_big_mic); mHandler.removeCallbacks(mStopRecognitionCallback); String text = hypothesis != null ? hypothesis.getHypstr() : null; Log.d(TAG , "onResult " + text); if (COMMAND_SEARCH.equals(mRecognizer.getSearchName())) ( if (text != null) ( Toast.makeText(this, text, Toast.LENGTH_SHORT).show(); process(text ); ) mRecognizer.startListening(KWS_SEARCH);

When we receive the onEndOfSpeech event, and if at the same time we recognize the command to be executed, then we need to stop the recognition, after which onResult will be called immediately.
OnResult needs to check what was just recognized. If this is a command, then you need to run it and switch the engine to recognize the activation phrase.
In onPartialResult we are only interested in recognizing the activation phrase. If we detect it, we immediately start the command recognition process. Here's what it looks like:

Private synchronized void startRecognition() ( if (mRecognizer == null || COMMAND_SEARCH.equals(mRecognizer.getSearchName())) return; mRecognizer.cancel(); new ToneGenerator(AudioManager.STREAM_MUSIC, ToneGenerator.MAX_VOLUME).startTone(ToneGenerator. TONE_CDMA_PIP, 200); post(400, new Runnable() ( @Override public void run() ( mMicView.setBackgroundResource(R.drawable.background_big_mic_green); mRecognizer.startListening(COMMAND_SEARCH, 3000); Log.d(TAG, "Listen commands"); post(4000, mStopRecognitionCallback); ) )); )
Here we first play a small signal to notify the user that we have heard him and are ready for his command. During this time, the microphone should be turned off. Therefore, we start recognition after a short timeout (slightly longer than the duration of the signal, so as not to hear its echo). It also starts a thread that will forcefully stop recognition if the user speaks for too long. In this case it is 3 seconds.

How to turn recognized string into commands

Well, everything here is specific to a particular application. In the case of the naked example, we simply pull out the device names from the line, search for the desired device and either change its state using an HTTP request to the smart home controller, or report its current state (as in the case of a thermostat). This logic can be seen in the Controller class.

How to synthesize speech

Speech synthesis is the inverse operation of recognition. Here it’s the other way around - you need to turn a line of text into speech so that the user can hear it.
In the case of the thermostat, we have to make our Android device speak the current temperature. Using the API TextToSpeech this is quite easy to do (thanks to Google for the wonderful female TTS for the Russian language):

Private void speak(String text) ( synchronized (mSpeechQueue) ( mRecognizer.stop(); mSpeechQueue.add(text); HashMap params = new HashMap (2); params.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, UUID.randomUUID().toString()); params.put(TextToSpeech.Engine.KEY_PARAM_STREAM, String.valueOf(AudioManager.STREAM_MUSIC)); params.put(TextToSpeech.Engine.KEY_FEATURE_NETWORK_SYNTHESIS, "true"); mTextToSpeech.speak(text, TextToSpeech.QUEUE_ADD, params); ) )

I'll probably say something banal, but before the synthesis process, it is necessary to disable recognition. On some devices (for example, all Samsung devices) it is generally impossible to listen to the microphone and synthesize something at the same time.
The end of speech synthesis (that is, the end of the process of speaking text by a synthesizer) can be tracked in the listener:

Private final TextToSpeech.OnUtteranceCompletedListener mUtteranceCompletedListener = new TextToSpeech.OnUtteranceCompletedListener() ( @Override public void onUtteranceCompleted(String utteranceId) ( synchronized (mSpeechQueue) ( mSpeechQueue.poll(); if (mSpeechQueue.isEmpty()) ( mRecognizer.startListening (KWS_SEARCH) ; ) ) ) );

In it, we simply check if there is anything else in the synthesis queue, and enable activation phrase recognition if there is nothing else.

And it's all?

Yes! As you can see, quickly and efficiently recognizing speech directly on the device is not difficult at all, thanks to the presence of such wonderful projects as Pocketsphinx. It provides a very convenient API that can be used in solving problems related to recognizing voice commands.

In this example, we have attached recognition to a completely specific task - voice control of smart home devices. Due to local recognition, we achieved very high speed and minimized errors.
It is clear that the same code can be used for other voice-related tasks. It doesn't have to be a smart home. Add tags