Data for speech recognition offline Russian package. Speech recognition using .NET desktop applications. How to synthesize speech

Products and technologies:

Visual Studio, C#, .NET Speech Libraries

The article discusses:

  • adding speech recognition support to the console application;
  • processing of recognized speech;
  • installation of speech recognition libraries;
  • comparison of Microsoft.Speech and System.Speech;
  • Adding speech recognition support to a Windows Forms application.

With the advent of Windows Phone Cortana, a speech-activated personal assistant (as well as a fruit-company counterpart that can't be talked about in vain), speech-enabled apps have become increasingly prominent in software development. In this article, I'll show you how to get started with speech recognition and synthesis in Windows console applications, Windows Forms applications, and Windows Presentation Foundation (WPF).

Note that you can also add speech capabilities to Windows Phone apps, ASP.NET web apps, Windows Store apps, Windows RT, and Xbox Kinect, but the techniques differ from those discussed in this article.

A good way to get an idea of ​​what exactly this article will be discussing is to look at the screenshots of two different demo programs on rice. 1 And 2 . After launching the console application on rice. 1 immediately says the phrase “I am awake.” Of course, you won't be able to hear the demo application while reading this article, so it displays the text of what the computer is saying. Then the user says the command “Speech on”. The demo application responds with recognized text and then internally listens and responds to requests to add two numbers.

Rice. 1. Speech recognition and synthesis in a console application


Rice. 2. Speech recognition in a Windows Forms application

The user asked the app to add one and two, then two and three. The application recognized the spoken commands and gave answers by voice. I'll describe more useful ways to use speech recognition later.

The user then said “Speech off,” a voice command that disables listening to addition commands, but does not completely disable speech recognition. After this verbal command, the next command to add one and two was ignored. Finally, the user turned on command listening again and uttered the meaningless command “Klatu barada nikto”, which the application recognized as a command to completely deactivate speech recognition and terminate itself.

On rice. 2 shows a Windows Forms application with dummy speech-enabled. This application recognizes spoken commands, but does not respond to them with voice output. When you first launched the app, the Speech On checkbox was not checked, indicating that speech recognition was not active. The user checked this checkbox and then said "Hello". The application responded by displaying the recognized text in the ListBox control at the bottom of the window.

The user then said, “Set text box 1 to red.” The application recognized the speech and responded: “Set text box 1 red,” which is almost (but not quite) exactly what the user said. Although on rice. 2 You can't see it, the text in the TextBox control at the top of the window is really red.

Then the user said: “Please set text box 1 to white.” The app recognized this as "set text box 1 white" and did just that. The user concluded by saying, “Good-bye,” and the application displayed that text, but did nothing with Windows Forms, although it could have cleared the Speech On checkbox, for example.

Using the synthesizer object is quite simple.

In the following sections, I'll walk you through the process of creating both demo programs, including installing the necessary .NET speech libraries. This article assumes that you have at least intermediate programming skills, but know nothing about speech recognition and synthesis.

Adding speech recognition support to a console application

To create the demo shown in rice. 1, I launched Visual Studio and created a new C# console application called ConsoleSpeech. I've used speech tools successfully with Visual Studio 2010 and 2012, but any relatively recent version should be fine. After loading the template code into the editor, I renamed the Program.cs file in the Solution Explorer window to the more descriptive ConsoleSpeechProgram.cs, and Visual Studio renamed the Program class for me.

Next, I added a link to the Microsoft.Speech.dll file, which is located in C:\ProgramFiles (x86)\Microsoft SDKs\Speech\v11.0\Assembly. This DLL was missing from my computer and I had to download it. Installing the files needed to add speech recognition and synthesis to an application is not that trivial. I will explain the installation process in detail in the next section, but for now let's assume that Microsoft.Speech.dll is on your system.

By adding a reference to the speech DLL, I removed all the using statements from the top of the code except the one that pointed to the top-level System namespace. Then I added using statements for the Microsoft.Speech.Recognition, Microsoft.Speech.Synthesis, and System.Globalization namespaces. The first two namespaces are mapped to the speech DLL. Note that there are also namespaces such as System.Speech.Recognition and System.Speech.Synthesis, which can be confusing. I'll explain the difference between them shortly. The Globalization namespace was available by default and did not require a new reference to be added to the project.

All source code for the demo console application is provided at rice. 3, and is also available in the source package accompanying this article. I've removed all the standard error handling to avoid obscuring the main ideas as much as possible.

Rice. 3. Demo console application source code

using System; using Microsoft.Speech.Recognition; using Microsoft.Speech.Synthesis; using System.Globalization; namespace ConsoleSpeech ( class ConsoleSpeechProgram ( static SpeechSynthesizer ss = new SpeechSynthesizer(); static SpeechRecognitionEngine sre; static bool done = false; static bool speechOn = true; static void Main(string args) ( try ( ss.SetOutputToDefaultAudioDevice(); Console.WriteLine ("\n(Speaking: I am awake)"); ss.Speak("I am awake"); CultureInfo ci = new CultureInfo("en-us"); sre = new SpeechRecognitionEngine(ci); ); sre.SpeechRecognized; Choices ch_StartStopCommands = new Choices(); b_StartStop = new GrammarBuilder(); gb_StartStop.Append(ch_StartStopCommands); Grammar g_StartStop = new Grammar(gb_StartStop); Choices ch_Numbers(); Add("3"); ch_Numbers.Add("4"); gb_WhatIsXplusY.Append("What is"); gb_WhatIsXplusY.Append(ch_Numbers); gb_WhatIsXplusY.Append("plus"); gb_WhatIsXplusY.Append(ch_Numbers); Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY); sre.LoadGrammarAsync(g_StartStop); sre.LoadGrammarAsync(g_WhatIsXplusY); sre.RecognizeAsync(RecognizeMode.Multiple); while (done == false) ( ; ) Console.WriteLine("\nHit< enter >to close shell\n"); Console.ReadLine(); ) catch (Exception ex) ( Console.WriteLine(ex.Message); Console.ReadLine(); ) ) // Main static void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e ) ( string txt = e.Result.Text; float confidence = e.Result.Confidence; Console.WriteLine("\nRecognized: " + txt); if (confidence< 0.60) return; if (txt.IndexOf("speech on") >= 0) ( Console.WriteLine("Speech is now ON"); speechOn = true; ) if (txt.IndexOf("speech off") >= 0) ( Console.WriteLine("Speech is now OFF"); speechOn = false ) if (speechOn == false) return; if (txt.IndexOf("klatu") >= 0 && txt.IndexOf("barada") >= 0) ( ((SpeechRecognitionEngine)sender). RecognizeAsyncCancel(); done = true; Console.WriteLine("(Speaking: Farewell)"); ss.Speak("Farewell"); ) if (txt.IndexOf("What") >= 0 && txt.IndexOf("plus") >= 0) ( string words = txt.Split(" "); int num1 = int.Parse(words); int num2 = int.Parse(words); int sum = num1 + num2; Console.WriteLine("(Speaking: " + words + " plus " + words + " equals " + sum + ")"); ss.SpeakAsync(words + " plus " + words + " equals " + sum); ) ) // sre_SpeechRecognized ) // Program ) // ns

After the using statements, the demo code starts like this:

namespace ConsoleSpeech ( class ConsoleSpeechProgram ( static SpeechSynthesizer ss = new SpeechSynthesizer(); static SpeechRecognitionEngine sre; static bool done = false; static bool speechOn = true; static void Main(string args) ( ...

The SpeechSynthesizer object, at the class level, enables an application to synthesize speech. The SpeechRecognitionEngine object allows an application to listen to and recognize spoken words or phrases. The boolean variable done determines when the entire application terminates. The speechOn boolean variable controls whether the application listens to any commands other than the command to exit the program.

The idea here is that the console application does not accept keyboard input, so it is always listening for commands. However, if speechOn is false, only the command to exit the program is recognized and executed; other commands are recognized but ignored.

The Main method starts like this:

try ( ss.SetOutputToDefaultAudioDevice(); Console.WriteLine("\n(Speaking: I am awake)"); ss.Speak("I am awake");

An instance of the SpeechSynthesizer object was created when it was declared. Using the synthesizer object is quite simple. The SetOutputToDefaultAudioDevice method sends output to speakers connected to your computer (you can also send output to a file). The Speak method takes a string and then speaks it. That's how easy it is.

Speech recognition is much more complex than speech synthesis. The Main method continues by creating a resolver object:

CultureInfo ci = new CultureInfo("en-us"); sre = new SpeechRecognitionEngine(ci); sre.SetInputToDefaultAudioDevice(); sre.SpeechRecognized += sre_SpeechRecognized;

First, the CultureInfo object specifies the language to be recognized, in this case United States English. The CultureInfo object is in the Globalization namespace, which we referenced with a using statement. Then, after calling the SpeechRecognitionEngine constructor, the voice input is assigned to the default audio device - most often the microphone. Note that most laptops have a built-in microphone, but desktops will require an external microphone (often combined with headphones these days).

The key method for the recognizer object is the SpeechRecognized event handler. When using Visual Studio, if you type "sre.SpeechRecognized +=" and wait a split second, IntelliSense will automatically end your expression with the event handler name - sre_SpeechRecognized. I suggest you press the Tab key to accept the suggestion and use this name as the default.

Choices ch_Numbers = new Choices(); ch_Numbers.Add("1"); ch_Numbers.Add("2"); ch_Numbers.Add("3"); ch_Numbers.Add("4"); // from a technical point of view, // this is Add(new string ( "4" )); GrammarBuilder gb_WhatIsXplusY = new GrammarBuilder(); gb_WhatIsXplusY.Append("What is"); gb_WhatIsXplusY.Append(ch_Numbers); gb_WhatIsXplusY.Append("plus"); gb_WhatIsXplusY.Append(ch_Numbers); Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY);

The three main objects here are the Choices set, the GrammarBuilder template, and the Grammar control. When I create a Grammar for recognition, I start by listing some specific examples of what I need to recognize. Let's say, "What is one plus two?" and “What is three plus four?”

Then I define the corresponding generic template, for example "What is plus ?. The template is a GrammarBuilder, and the specific values ​​that are passed to the template are a set of Choices. The Grammar object encapsulates the template and Choices.

In the demo program, I limit the additions to 1 through 4 and add them as strings to the Choices set. More efficient approach:

string numbers = new string ("1", "2", "3", "4" ); Choices ch_Numbers = new Choices(numbers);

I'm presenting a less efficient approach to creating a Choices set for two reasons. First, adding one line at a time was the only approach I've seen in other speech recognition examples. Secondly, you might think that adding one row at a time shouldn't work at all; Visual Studio IntelliSense shows in real time that one of the Add overloads accepts a parameter of type params string phrases. If you didn't notice the params keyword, you may have assumed that the Add method only accepts arrays of strings and not a single string. But this is not so: he accepts both. I recommend passing an array.

Creating a set of Choices from sequential numbers is somewhat of a special case and allows for a programmatic approach like:

string numbers = new string; for (int i = 0; i< 100; ++i) numbers[i] = i.ToString(); Choices ch_Numbers = new Choices(numbers);

After creating Choices to fill the GrammarBuilder slots, the demo program creates a GrammarBuilder and then a controlling Grammar:

GrammarBuilder gb_WhatIsXplusY = new GrammarBuilder(); gb_WhatIsXplusY.Append("What is"); gb_WhatIsXplusY.Append(ch_Numbers); gb_WhatIsXplusY.Append("plus"); gb_WhatIsXplusY.Append(ch_Numbers); Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY);

The demo program uses a similar template to create the Grammar for start and stop commands:

Choices ch_StartStopCommands = new Choices(); ch_StartStopCommands.Add("speech on"); ch_StartStopCommands.Add("speech off"); ch_StartStopCommands.Add("klatu barada nikto"); GrammarBuilder gb_StartStop = new GrammarBuilder(); gb_StartStop.Append(ch_StartStopCommands); Grammar g_StartStop = new Grammar(gb_StartStop);

Grammars can be defined very flexibly. Here the commands “speech on”, “speech off” and “klatu barada nikto” are placed in one grammar, since they are logically related. These three commands could be defined in three different grammars, or the commands "speech on" and "speech off" could be placed in one grammar and the command "klatu barada nikto" in a second.

Once you've created all the Grammar objects, you put them into the speech recognizer and speech recognition is activated:

sre.LoadGrammarAsync(g_StartStop); sre.LoadGrammarAsync(g_WhatIsXplusY); sre.RecognizeAsync(RecognizeMode.Multiple);

The RecognizeMode.Multiple argument is needed when you have more than one grammar, which will be the case in all but the simplest programs. The Main method completes as follows:

While (done == false) ( ; ) Console.WriteLine("\nHit< enter >to close shell\n"); Console.ReadLine(); ) catch (Exception ex) ( Console.WriteLine(ex.Message); Console.ReadLine(); ) ) // Main

A strange-looking empty while loop allows you to keep the shell of a console application running. The loop will be completed when the class-level boolean variable done is set to true by the speech recognition event handler.

Recognized speech processing

The code for handling speech recognition events starts like this:

static void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) ( string txt = e.Result.Text; float confidence = e.Result.Confidence; Console.WriteLine("\nRecognized: " + txt); if (confidence< 0.60) return; ...

The recognized text is stored in the Result.Text property of the SpeechRecognizedEventArgs object. Alternatively, you can use the Result.Words set. The Result.Confidence property stores a value between 0.0 and 1.0 that is a rough estimate of how well the spoken text matches any of the grammars associated with the recognizer. The demo program instructs the event handler to ignore text with low confidence in the recognized text.

Confidence values ​​vary greatly depending on the complexity of your grammars, microphone quality, and other factors. For example, if the demo program only needs to recognize the numbers 1 through 4, the confidence values ​​on my computer are typically around 0.75. But, if the grammar must recognize numbers from 1 to 100, the confidence values ​​drop to about 0.25. In a nutshell, you should usually experiment with confidence values ​​to achieve good speech recognition results.

if (txt.IndexOf("speech on") >= 0) ( Console.WriteLine("Speech is now ON"); speechOn = true; ) if (txt.IndexOf("speech off") >= 0) ( Console .WriteLine("Speech is now OFF"); speechOn = false; if (speechOn == false) return;

Although it may not be entirely obvious at first, this logic should make sense if you think about it. The secret exit command is then processed:

if (txt.IndexOf("klatu") >= 0 && txt.IndexOf("barada") >= 0) ( ((SpeechRecognitionEngine)sender).RecognizeAsyncCancel(); done = true; Console.WriteLine("(Speaking: Farewell)"); ss.Speak("Farewell"); )

Note that the speech recognition engine can actually recognize nonsense words. If a Grammar object contains words that are not in the object's built-in dictionary, Grammar tries to identify those words using semantic heuristics if possible, and is usually quite successful. That's why I used "klatu" instead of the correct "klaatu" (from an old sci-fi movie).

Also note that you are not required to process all of the text recognized by Grammar (“klatu barada nikto”) - you just need to have enough information to uniquely identify the grammatical phrase (“klatu” and “barada”).

If (txt.IndexOf("What") >= 0 && txt.IndexOf("plus") >= 0) ( string words = txt.Split(" "); int num1 = int.Parse(words); int num2 = int.Parse(words); int sum = num1 + num2; Console.WriteLine("(Speaking: " + words + " plus " + words + " equals " + sum + ")"); " plus " + words + " equals " + sum); ) ) // sre_SpeechRecognized ) // Program ) // ns

Note that the text in Results.Text is case sensitive ("What" and "what"). Having recognized a phrase, it can be parsed into specific words. In this case, the recognized text is of the form “What is x plus y”, so “What” is placed in words, and the two added numbers (as strings) are stored in words and words.

Installing libraries

The explanation of the demo program assumes that all the necessary speech libraries are installed on your computer. To create and run demo programs, you need to install four packages: the SDK (provides the ability to create demos in Visual Studio), the runtime (runs the demos after they are created), and the recognized and synthesized (pronounced) languages.

To install the SDK, search the Internet for “Speech Platform 11 SDK.” This will take you to the correct page in the Microsoft Download Center ( rice. 4). By clicking the Download button you will see the options shown in rice. 5. The SDK comes in 32-bit and 64-bit versions. I strongly advise using the 32-bit version regardless of the bit size of your system. The 64-bit version does not work with some applications.


Rice. 4. Main SDK installation page in Microsoft Download Center


Rice. 5. Install Speech SDK

You don't need anything more than a single .msi file under x86 (for 32-bit systems). By selecting this file and clicking Next, you can run the installer directly from here. Speech libraries don't give much feedback about when the installation is complete, so don't look for any success messages.


Rice. 6. Installation of the runtime environment

It is extremely important to select the same platform version (11 in the demo) and bit depth (32 or 64) as the SDK. Again, I strongly recommend the 32-bit version, even if you're running on a 64-bit system.

You can then set the recognition language. The download page is provided at rice. 7. The demo program uses the file MSSpeech_SR_en-us_TELE.msi (English-U.S.). SR stands for speech recognition, and TELE stands for telephony; this means that the recognized language is designed to work with low-quality audio input, such as from a telephone or desktop microphone.


Rice. 7. Setting a recognized language

Finally, you can set the language and voice for speech synthesis. The download page is provided at rice. 8. The demo program uses the MSSpeech_TTS_en-us_Helen.msi file. TTS (text-to-speech) is essentially synonymous with speech synthesis. Note the two available voices English, U.S. There are other English voices, but not U.S. Creating synthesis language files is a very difficult task. However, you can purchase and install other voices from a variety of companies.


Rice. 8. Setting the voice and synthesis language

Interestingly, although speech recognition language and voice/speech synthesis language are actually completely different things, both packages are options on the same download page. Download Center UI allows you to check both the recognition language and the synthesis language, but trying to install them at the same time was disastrous for me, so I recommend installing them separately.

Comparing Microsoft.Speech with System.Speech

If you're new to speech recognition and synthesis for Windows applications, you can easily get confused by the documentation because there are multiple speech platforms. Specifically, in addition to the Microsoft.Speech.dll library used by the demo programs in this article, there is a library called System.Speech.dll that is part of the Windows operating system. The two libraries are similar in the sense that their APIs are almost, but not completely, identical. Therefore, if you look for speech processing examples on the Internet and see code snippets rather than complete programs, it is not at all obvious whether the example is System.Speech or Microsoft.Speech.

If you are new to speech processing, use the Microsoft.Speech library rather than System.Speech to add speech support to your .NET application.

Although both libraries share a common core code base and similar APIs, they are definitely different. Some key differences are summarized in table 1.

Table 1. Main differences between Microsoft.Speech and System.Speech

System.Speech DLL is part of the OS, so it is installed on every Windows system. The Microsoft.Speech DLL (and its associated runtime and languages) must be downloaded and installed on the system. Recognition using System.Speech usually requires training for a specific user, when the user reads some text, and the system learns to understand the pronunciation characteristic of this user. Recognition using Microsoft.Speech works immediately for any user. System.Speech can recognize almost any word (this is called free dictation). Microsoft.Speech will only recognize words and phrases that are in the Grammar object defined in the program.

Adding speech recognition support to a Windows Forms application

The process for adding speech recognition and synthesis support to a Windows Forms or WPF application is similar to that for a console application. To create the demo program shown in rice. 2, I launched Visual Studio, created a new C# Windows Forms application and renamed it WinFormSpeech.

After loading the template code into the editor, I added a link to the Microsoft.Speech.dll file in the Solution Explorer window - just like I did in the console program. At the top of the source code, I removed unnecessary using statements, leaving only references to the System, Data, Drawing, and Forms namespaces. Then I added two using statements for the Microsoft.Speech.Recognition and System.Globalization namespaces.

The Windows Forms-based demo does not use speech synthesis, so I do not link to the Microsoft.Speech.Synthesis library. Adding speech synthesis to a Windows Forms application is the same as in a console application.

In Visual Studio in design mode, I dragged a TextBox, CheckBox, and ListBox controls onto the Form. Double-clicked the CheckBox and Visual Studio automatically created a skeleton CheckChanged event handler method.

Recall that the demo console program immediately began listening for spoken commands and continued to do so until it terminated. This approach could be used in a Windows Forms application, but instead I decided to allow the user to turn speech recognition on and off using a CheckBox control (i.e., a check box).

The source code in the Form1.cs file of the demo program where the partial class is defined is presented at rice. 9. A speech recognition engine object is declared and created as a Form member. In the Form's constructor, I hook up the SpeechRecognized event handler and then create and load two Grammars objects:

public Form1() ( InitializeComponent(); sre.SetInputToDefaultAudioDevice(); sre.SpeechRecognized += sre_SpeechRecognized; Grammar g_HelloGoodbye = GetHelloGoodbyeGrammar(); Grammar g_SetTextBox = GetTextBox1TextGrammar(); sre.LoadGrammarAsync(g_HelloGood) bye); sre.LoadGrammarAsync(g_SetTextBox); // sre.RecognizeAsync() is // in the CheckBox event handler)

Rice. 9. Add speech recognition support to Windows Forms

using System; using System.Data; using System.Drawing; using System.Windows.Forms; using Microsoft.Speech.Recognition; using System.Globalization; namespace WinFormSpeech ( public partial class Form1: Form ( static CultureInfo ci = new CultureInfo("en-us"); static SpeechRecognitionEngine sre = new SpeechRecognitionEngine(ci); public Form1() ( InitializeComponent(); sre.SetInputToDefaultAudioDevice(); sre .SpeechRecognized += sre_SpeechRecognized; .RecognizeAsync() is // in the CheckBox event handler) static Grammar GetHelloGoodbyeGrammar() ( Choices ch_HelloGoodbye = new Choices(); ch_HelloGoodbye.Add("hello"); ch_HelloGoodbye.Add("goodbye"); GrammarBuilder gb_result = new GrammarBuilder(ch_HelloGoodbye); Grammar g_result = new Grammar(gb_result); return g_result; ) static Grammar GetTextBox1TextGrammar() ( Choices ch_Colors = new Choices(); ch_Colors.Add(new string ( "red", "white", "blue" )); GrammarBuilder gb_result = new GrammarBuilder(); gb_result.Append("set text box 1"); gb_result.Append(ch_Colors); Grammar g_result = new Grammar(gb_result); return g_result; ) private void checkBox1_CheckedChanged(object sender, EventArgs e) ( if (checkBox1.Checked == true) sre.RecognizeAsync(RecognizeMode.Multiple); else if (checkBox1.Checked == false) // disabled sre.RecognizeAsyncCancel(); ) void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) ( string txt = e.Result.Text; float conf = e.Result.Confidence; if (conf< 0.65) return; this.Invoke(new MethodInvoker(() =>( listBox1.Items.Add("I heard you say: " + txt); ))); // WinForm specifics if (txt.IndexOf("text") >= 0 && txt.IndexOf("box") >= 0 && txt.IndexOf("1")>= 0) ( string words = txt.Split( " "); this.Invoke(new MethodInvoker(() => ( textBox1.Text = words; ))); // WinForm specifics ) ) // Form ) // ns

I could have created two Grammar objects directly, like in a console program, but instead, to make the code a little clearer, I defined two helper methods (GetHelloGoodbyeGrammar and GetTextBox1TextGrammar) that do the job.

static Grammar GetTextBox1TextGrammar() ( Choices ch_Colors = new Choices(); ch_Colors.Add(new string ( "red", "white", "blue" )); GrammarBuilder gb_result = new GrammarBuilder(); gb_result.Append("set text box 1"); gb_result.Append(ch_Colors); Grammar g_result = new Grammar(gb_result); return g_result; )

This helper method will recognize the phrase "set text box 1 red". However, the user is not required to pronounce this phrase exactly. For example, he could say "Please set the text in text box 1 to red" and the speech recognition engine would still recognize the phrase as "set text box 1 red" - albeit with a lower confidence value than an exact match with the Grammar template. In other words, when creating Grammar objects, you are not required to take into account all variations of a phrase. This radically simplifies the use of speech recognition.

The event handler for CheckBox is defined like this:

private void checkBox1_CheckedChanged(object sender, EventArgs e) ( if (checkBox1.Checked == true) sre.RecognizeAsync(RecognizeMode.Multiple); else if (checkBox1.Checked == false) // disabled sre.RecognizeAsyncCancel(); )

The speech recognition engine object, sre (speech recognition engine), always exists for the life of a Windows Forms application. This object is activated and deactivated by calls to the RecognizeAsync and RecognizeAsyncCancel methods when the user toggles the CheckBox respectively.

The SpeechRecognized event handler definition starts with:

void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) ( string txt = e.Result.Text; float conf = e.Result.Confidence; if (conf< 0.65) return; ...

Besides the more or less constantly used Result.Text and Result.Confidence properties, the Result object has several other useful but more complex properties that you may want to explore; for example, Homophones and ReplacementWordUnits. In addition, the speech recognition engine provides several useful events like SpeechHypothesized.

this.Invoke((Action)(() => listBox1.Items.Add("I heard you say: " + txt)));

In theory, using the MethodInvoker delegate is slightly more efficient than an Action in this situation, since MethodInvoker is part of the Windows.Forms namespace and therefore specific to Windows Forms applications. The Action delegate is more versatile. This example shows that you can completely manipulate a Windows Forms application through the speech recognition engine - this is an incredibly powerful and useful feature.

Conclusion

The information presented in this article should get you started right away if you want to explore speech synthesis and recognition in .NET applications. Mastering the technology itself is a breeze once you get past the bumps of initial learning curve and component installation. The real challenge in speech synthesis and recognition is understanding when it's actually useful.

With console programs, you can create interesting back-and-forth conversations where the user asks a question and the program answers, resulting in essentially a Cortana-like environment. You must exercise some caution because when speech comes from your computer's speakers, it will be picked up by the microphone and may be recognized again. I've found myself in some pretty funny situations where I asked a question, the app recognized it and responded, but the spoken answer triggered the next recognition event, and I ended up with a funny, endless speech loop.

Another possible use of speech in a console program is recognizing commands like "Launch Notepad" and "Launch Word". In other words, such a console program can be used on your computer to perform actions that would otherwise require a lot of manipulation of the keyboard and mouse.

James McCaffrey(Dr. James McCaffrey) works for Microsoft Research in Redmond, Washington. He took part in the creation of several Microsoft products, including Internet Explorer and Bing. He can be contacted at [email protected].

Thanks to Microsoft Research experts Rob Gruen, Mark Marron, and Curtis von Veh for reviewing this article.

) using a real Hello World example of home appliance control.
Why home appliances? Yes, because thanks to such an example you can appreciate that speed and accuracy which can be achieved by using completely local speech recognition without servers like Google ASR or Yandex SpeechKit.
I also attach to the article all the source code of the program and the assembly itself for Android.

Why suddenly?

Having recently come across this, I asked the author why he wanted to use server-based speech recognition for his program (in my opinion, this was unnecessary and led to some problems). To that end, could I describe in more detail the use of alternative methods for projects where there is no need to recognize anything, and the dictionary consists of a finite set of words. And even with an example of practical application...

Why do we need anything else besides Yandex and Google?

For that very “practical application” I chose the topic voice control for smart home.
Why exactly this example? Because it shows several advantages of completely local speech recognition over recognition using cloud solutions. Namely:
  • Speed- we do not depend on servers and therefore do not depend on their availability, bandwidth, etc. factors
  • Accuracy- our engine works only with the dictionary that interests our application, thereby increasing the quality of recognition
  • Price- we don’t have to pay for each request to the server
  • Voice activation- as an additional bonus to the first points - we can constantly “listen to the broadcast” without wasting our traffic and without loading the servers

Note

Let me make a reservation right away that these advantages can be considered advantages only for a certain class of projects, Where are we we know for sure in advance, what dictionary and what grammar the user will operate with. That is, when we do not need to recognize arbitrary text (for example, an SMS message or a search query). Otherwise, cloud recognition is indispensable.

So Android can recognize speech without the Internet!
Yes, yes... Only on JellyBean. And only from half a meter, no more. And this recognition is the same dictation, only using a much smaller model. So we can’t manage or configure it either. And what she will return to us next time is unknown. Although just right for SMS!

What do we do?

We will implement a voice remote control for home appliances, which will work accurately and quickly, from a few meters and even on cheap, crappy, very inexpensive Android smartphones, tablets and watches.
The logic will be simple but very practical. We activate the microphone and say one or more device names. The application recognizes them and turns them on and off depending on the current state. Or he receives a fortune from them and pronounces it in a pleasant female voice. For example, the current temperature in the room.

Practical applications abound

In the morning, without opening your eyes, you slapped your palm on the smartphone screen on the nightstand and commanded “Good morning!” - the script starts, the coffee maker turns on and hums, pleasant music is heard, the curtains open.
Let's hang a cheap (2 thousand, no more) smartphone on the wall in each room. We go home after work and command into the void “Smart Home! Lights, TV! - I don’t think there’s any need to say what happens next.

Transcriptions



Grammar describes what what the user can say. For Pocketsphinx to know, How he will pronounce it, it is necessary for each word from the grammar to write how it sounds in the corresponding language model. That is transcription every word. It is called dictionary.

Transcriptions are described using a special syntax. For example:
smart uu m n ay j house d oo m

In principle, nothing complicated. A double vowel in transcription indicates stress. A double consonant is a soft consonant followed by a vowel. All possible combinations for all sounds of the Russian language.

It is clear that we cannot describe in advance all the transcriptions in our application, because we do not know in advance the names that the user will give to their devices. Therefore, we will generate such transcriptions “on the fly” according to some rules of Russian phonetics. To do this, you can implement the following PhonMapper class, which can receive a string as input and generate the correct transcription for it.

Voice activation

This is the ability of the speech recognition engine to “listen to the broadcast” all the time in order to react to a predetermined phrase (or phrases). All other sounds and speech will be discarded. This is not the same as describing grammar and just turning on the microphone. I will not present here the theory of this task and the mechanics of how it works. Let me just say that recently the programmers working on Pocketsphinx implemented such a function, and now it is available out of the box in the API.

One thing is definitely worth mentioning. For an activation phrase, you need not only to specify the transcription, but also to select the appropriate one sensitivity threshold value. A value that is too small will result in many false positives (this is when you did not say the activation phrase, but the system recognizes it). And too high - to immunity. Therefore, this setting is of particular importance. Approximate range of values ​​- from 1e-1 to 1e-40 depending on the activation phrase.

Proximity sensor activation

This task is specific to our project and is not directly related to recognition. The code can be seen directly in the main activity.
She implements SensorEventListener and at the moment of approach (the sensor value is less than the maximum) it turns on the timer, checking after a certain delay whether the sensor is still blocked. This is done to eliminate false positives.
When the sensor is not blocked again, we stop recognition, getting the result (see description below).

Let's start recognition

Pocketsphinx provides a convenient API for configuring and running the recognition process. These are the classes SpechRecognizer And SpeechRecognizerSetup.
This is what the configuration and launch of recognition looks like:

PhonMapper phonMapper = new PhonMapper(getAssets().open("dict/ru/hotwords")); Grammar grammar = new Grammar(names, phonMapper); grammar.addWords(hotword); DataFiles dataFiles = new DataFiles(getPackageName(), "ru"); File hmmDir = new File(dataFiles.getHmm()); File dict = new File(dataFiles.getDict()); File jsgf = new File(dataFiles.getJsgf()); copyAssets(hmmDir); saveFile(jsgf, grammar.getJsgf()); saveFile(dict, grammar.getDict()); mRecognizer = SpeechRecognizerSetup.defaultSetup() .setAcousticModel(hmmDir) .setDictionary(dict) .setBoolean("-remove_noise", false) .setKeywordThreshold(1e-7f) .getRecognizer(); mRecognizer.addKeyphraseSearch(KWS_SEARCH, hotword); mRecognizer.addGrammarSearch(COMMAND_SEARCH, jsgf);

Here we first copy all the necessary files to disk (Pocketpshinx requires an acoustic model, grammar and dictionary with transcriptions to be on disk). Then the recognition engine itself is configured. The paths to the model and dictionary files are indicated, as well as some parameters (sensitivity threshold for the activation phrase). Next, the path to the file with the grammar, as well as the activation phrase, is configured.

As you can see from this code, one engine is configured for both grammar and activation phrase recognition. Why is this done? So that we can quickly switch between what we currently need to recognize. This is what starting the activation phrase recognition process looks like:

MRecognizer.startListening(KWS_SEARCH);
And this is how speech is recognized according to a given grammar:

MRecognizer.startListening(COMMAND_SEARCH, 3000);
The second argument (optional) is the number of milliseconds after which recognition will automatically end if no one says anything.
As you can see, you can use only one engine to solve both problems.

How to get the recognition result

To get the recognition result, you must also specify an event listener that implements the interface RecognitionListener.
It has several methods that are called by pocketsphinx when one of the events occurs:
  • onBeginningOfSpeech- the engine heard some sound, maybe it was speech (or maybe not)
  • onEndOfSpeech- sound ends
  • onPartialResult- there are intermediate recognition results. For an activation phrase, this means that it worked. Argument Hypothesis
  • onResult- the final result of recognition. This method will be called after the method is called stop at SpeechRecognizer. Argument Hypothesis contains recognition data (string and score)

By implementing the onPartialResult and onResult methods in one way or another, you can change the recognition logic and obtain the final result. Here's how it's done in the case of our application:

@Override public void onEndOfSpeech() ( Log.d(TAG, "onEndOfSpeech"); if (mRecognizer.getSearchName().equals(COMMAND_SEARCH)) ( mRecognizer.stop(); ) ) @Override public void onPartialResult(Hypothesis hypothesis) ( if (hypothesis == null) return; String text = hypothesis.getHypstr(); if (KWS_SEARCH.equals(mRecognizer.getSearchName())) ( startRecognition(); ) else ( Log.d(TAG, text); ) ) @Override public void onResult(Hypothesis hypothesis) ( mMicView.setBackgroundResource(R.drawable.background_big_mic); mHandler.removeCallbacks(mStopRecognitionCallback); String text = hypothesis != null ? hypothesis.getHypstr() : null; Log.d(TAG , "onResult " + text); if (COMMAND_SEARCH.equals(mRecognizer.getSearchName())) ( if (text != null) ( Toast.makeText(this, text, Toast.LENGTH_SHORT).show(); process(text ); ) mRecognizer.startListening(KWS_SEARCH);

When we receive the onEndOfSpeech event, and if at the same time we recognize the command to be executed, then we need to stop the recognition, after which onResult will be called immediately.
In onResult you need to check what was just recognized. If this is a command, then you need to launch it for execution and switch the engine to recognize the activation phrase.
In onPartialResult we are only interested in recognizing the activation phrase. If we detect it, we immediately start the command recognition process. Here's what it looks like:

Private synchronized void startRecognition() ( if (mRecognizer == null || COMMAND_SEARCH.equals(mRecognizer.getSearchName())) return; mRecognizer.cancel(); new ToneGenerator(AudioManager.STREAM_MUSIC, ToneGenerator.MAX_VOLUME).startTone(ToneGenerator. TONE_CDMA_PIP, 200); post(400, new Runnable() ( @Override public void run() ( mMicView.setBackgroundResource(R.drawable.background_big_mic_green); mRecognizer.startListening(COMMAND_SEARCH, 3000); Log.d(TAG, "Listen commands"); post(4000, mStopRecognitionCallback); ) )); )
Here we first play a small signal to notify the user that we have heard him and are ready for his command. During this time, the microphone should be turned off. Therefore, we start recognition after a short timeout (slightly longer than the duration of the signal, so as not to hear its echo). It also starts a thread that will forcefully stop recognition if the user speaks for too long. In this case it is 3 seconds.

How to turn recognized string into commands

Well, everything here is specific to a particular application. In the case of the naked example, we simply pull out the device names from the line, search for the desired device and either change its state using an HTTP request to the smart home controller, or report its current state (as in the case of a thermostat). This logic can be seen in the Controller class.

How to synthesize speech

Speech synthesis is the inverse operation of recognition. Here it’s the other way around - you need to turn a line of text into speech so that the user can hear it.
In the case of the thermostat, we have to make our Android device speak the current temperature. Using the API TextToSpeech this is quite easy to do (thanks to Google for the wonderful female TTS for the Russian language):

Private void speak(String text) ( synchronized (mSpeechQueue) ( ​​mRecognizer.stop(); mSpeechQueue.add(text); HashMap params = new HashMap (2); params.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, UUID.randomUUID().toString()); params.put(TextToSpeech.Engine.KEY_PARAM_STREAM, String.valueOf(AudioManager.STREAM_MUSIC)); params.put(TextToSpeech.Engine.KEY_FEATURE_NETWORK_SYNTHESIS, "true"); mTextToSpeech.speak(text, TextToSpeech.QUEUE_ADD, params); ) )

I'll probably say something banal, but before the synthesis process, it is necessary to disable recognition. On some devices (for example, all Samsung devices) it is generally impossible to listen to the microphone and synthesize something at the same time.
The end of speech synthesis (that is, the end of the process of speaking text by a synthesizer) can be tracked in the listener:

Private final TextToSpeech.OnUtteranceCompletedListener mUtteranceCompletedListener = new TextToSpeech.OnUtteranceCompletedListener() ( @Override public void onUtteranceCompleted(String utteranceId) ( synchronized (mSpeechQueue) ( ​​mSpeechQueue.poll(); if (mSpeechQueue.isEmpty()) ( mRecognizer.startListening (KWS_SEARCH) ; ) ) ) );

In it, we simply check if there is anything else in the synthesis queue, and enable activation phrase recognition if there is nothing else.

And it's all?

Yes! As you can see, quickly and efficiently recognizing speech directly on the device is not difficult at all, thanks to the presence of such wonderful projects as Pocketsphinx. It provides a very convenient API that can be used in solving problems related to recognizing voice commands.

In this example, we have attached recognition to a completely specific task - voice control of smart home devices. Due to local recognition, we achieved very high speed and minimized errors.
It is clear that the same code can be used for other voice-related tasks. It doesn't have to be a smart home.

  • voice control
  • voice engine
  • Add tags

    This phone has speech recognition or voice input, but it only works via the Internet, connecting to Google services. But a phone can be taught to recognize speech without the Internet, we will look at how to enable Russian language recognition in offline. For this method to work, you must have two applications installed. Voice Search And Google Search, although these programs are already present in the factory firmware.

    For firmware

    Go to your phone settings and select

    Select Russian language and download it.

    For firmware 2.8B

    In the new firmware the menu item " Offline speech recognition" absent.

    If you had offline packages installed before the firmware update, and you did not wipe (reset settings) during the update, then they should have been preserved. Otherwise you will have to roll back to the firmware 2.2 , install voice packages, and only then update the system to 2.8B.

    For Rev.B devices

    We install the update through recovery and enjoy voice recognition in oyline.

    2. Download the database for Russian speech and copy it to the SD card

    Download Russian_offline.zip 1301

    3. Enter recovery by holding (Volume + and On) with the phone turned off.

    4. Select Apply update from external storage and select the downloaded archive.

    No program can completely replace the manual work of transcribing recorded speech. However, there are solutions that can significantly speed up and facilitate the translation of speech into text, that is, simplify transcription.

    Transcription is the recording of an audio or video file in text form. There are paid paid tasks on the Internet, when the performer is paid a certain amount of money for transcribing the text.

    Speech to text translation is useful

    • students to translate recorded audio or video lectures into text,
    • bloggers running websites and blogs,
    • writers, journalists for writing books and texts,
    • information businessmen who need a text after their webinar, speech, etc.,
    • people who have difficulty typing - they can dictate a letter and send it to family or friends,
    • other options.

    We will describe the most effective tools available on PCs, mobile applications and online services.

    1 Website speechpad.ru

    This is an online service that allows you to translate speech into text using the Google Chrome browser. The service works with a microphone and ready-made files. Of course, the quality will be much higher if you use an external microphone and dictate yourself. However, the service does a good job even with YouTube videos.

    Click “Enable recording”, answer the question about “Using a microphone” - to do this, click “Allow”.

    The long instructions about using the service can be collapsed by clicking on button 1 in Fig. 3. You can get rid of advertising by completing a simple registration.

    Rice. 3. Speechpad service

    The finished result is easy to edit. To do this, you need to either manually correct the highlighted word or dictate it again. The results of the work are saved in your personal account, they can also be downloaded to your computer.

    List of video lessons on working with speechpad:

    You can transcribe videos from Youtube or from your computer, however, you will need a mixer, more details:

    Video "audio transcription"

    The service operates in seven languages. There is a small minus. It lies in the fact that if you need to transcribe a finished audio file, then its sound is heard through the speakers, which creates additional interference in the form of an echo.

    2 Service dictation.io

    A wonderful online service that allows you to translate speech into text for free and easily.

    Rice. 4. Service dictation.io

    1 in Fig. 4 – Russian language can be selected at the end of the page. In the Google Chrome browser, the language is selected, but for some reason in Mozilla there is no such option.

    It is noteworthy that the ability to auto-save the finished result has been implemented. This will prevent accidental deletion as a result of closing a tab or browser. This service does not recognize finished files. Works with a microphone. You need to name punctuation marks when dictating.

    The text is recognized quite correctly, there are no spelling errors. You can insert punctuation marks yourself from the keyboard. The finished result can be saved on your computer.

    3 RealSpeaker

    This program allows you to easily translate human speech into text. It is designed to work on different systems: Windows, Android, Linux, Mac. With its help, you can convert speech heard into a microphone (for example, it can be built into a laptop), as well as recorded into audio files.

    Can understand 13 world languages. There is a beta version of the program that works as an online service:

    You need to follow the link above, select the Russian language, upload your audio or video file to the online service and pay for its transcription. After transcription, you can copy the resulting text. The larger the file for transcription, the more time it will take to process it, more details:

    In 2017 there was a free transcription option using RealSpeaker, but in 2018 there is no such option. It is very confusing that the transcribed file is available to all users for downloading; perhaps this will be improved.

    Contacts of the developer (VKontakte, Facebook, Youtube, Twitter, email, phone) of the program can be found on the page of his website (more precisely, in the footer of the site):

    4 Speechlogger

    An alternative to the previous application for mobile devices running on Android. Available for free in the app store:

    The text is edited automatically and punctuation marks are added. Very convenient for dictating notes to yourself or making lists. As a result, the text will be of very decent quality.

    5 Dragon Dictation

    This is an application that is distributed free of charge for mobile devices from Apple.

    The program can work with 15 languages. It allows you to edit the result and select the desired words from the list. You need to clearly pronounce all sounds, not make unnecessary pauses and avoid intonation. Sometimes there are mistakes in the endings of words.

    The Dragon Dictation application is used by owners, for example, to dictate a shopping list in a store while moving around the apartment. When I get there, I can look at the text in the note, and I don’t have to listen.

    Whatever program you use in your practice, be prepared to double-check the results and make certain adjustments. This is the only way to get a flawless text without errors.

    Also useful services:

    Receive the latest computer literacy articles directly to your inbox.
    Already more 3,000 subscribers

    .