Search for an optimal audio speech recognition system with closed source code, but with open APIs for integration

Google Speech API- Google voice recognition service.

Speech recognition allows you to create systems automatic maintenance clients in cases where touch-tone control is not applicable. As an example, consider an airline ticket booking service, which involves selecting a large number of cities. The tone menu in such a service is not convenient, so voice control will be the most effective. The dialogue between the system and the subscriber may look like this:

System: Hello. Where do you want to fly? Subscriber: Kazan System: Where do you want to fly from? Subscriber: Moscow System: Specify departure date Subscriber: April 10
  • Voice navigation in multi-level IVR menus and automatic connection with the right employee
  • Address recognition for delivery
  • Automatic voice authentication of users when requesting personalized or confidential information by phone or via the Internet
  • Information service help system
  • Corporate customer voice self-service system (balance request, check personal account, ticket booking)

A speech recognition system typically consists of the following parts:

  • Recording a message from a subscriber
  • Voice recognition and receiving text data from the service
  • Analyzing the information received and taking the necessary actions

For use Google Speech API on your system do the following:

Step 1. Download and import scripts into your system Oktell.

Download the script:(for versions Oktell older than 2.10)

The archive contains two scripts:

  • Google_Speech_API_main- script for recording voice message, is an example of proper use of the recognition service in the main scenario.
  • Google_Speech_API- script for sending a recording to Google service and receiving the recognized message.

After importing scripts into Oktell, save them" To server"

NOTE: Google Speech API is a paid product. In the script (Web request component GoogleVoice) a trial key is used, which can be blocked due to a certain number of requests. During tests maximum amount no requests found. If you want to purchase paid version Google Speech API contact Google support.

Step 2. In the module " Administration" - "External numbers"add extension number with type " Launching IVR". Select IVR scenario Google_Speech_API_main.

Here is information from the Internet from the site vorabota.ru :

To start converting voice to text, you will need a microphone (in laptops it is built-in), a good one is desirable internet connection speed and browser Google Chrome no lower than version 25. Unfortunately, the voice typing feature does not work in other browsers.

Launch the voice input page in the Chrome browser. At the bottom of the window, select the language in which you plan to dictate the text. Click on the microphone icon in the top right corner. And in the pop-up line, click the “allow” button for the browser to use the microphone.

Now you can slowly and clearly pronounce short phrases. After you finish dictating text by voice, you can select it using a keyboard shortcut Ctrl+C copy to the clipboard and then paste into any editor for processing. If desired, the text can be sent immediately by email.

Perhaps, Web Speech API– the simplest and fairly high-quality way to convert your speech into text. Since there is no need to be distracted by any additional manipulations with the keyboard. Just turn on the microphone and speak the text.

In any case, you will have to use some additional text editor for further correction of the dictated text.

Launched in the browser Google Chrome page http://vorabota.ru/voice/text.html and tried voice text input. I read the phrase " Web Speech API Voice typing. Select all. Send Email" and received " Websphere api voice typing select all send email". Second try: " Click the Allow button to unmute the microphone» — « Click the Allow button to enable the microphone«.

A comparison of the original phrase and the result shows that: a) the Russian phrase is converted into Russian text with sufficient quality; b) English phrase converted to English text with errors that are easy to correct; c) mandatory text correction is required to correct errors and place punctuation marks, and capital letters; d) the difference between this implementation Voice dialing text from others available on the Internet, in extreme simplicity: there is nothing superfluous in it, which makes it easy to learn and use.

My conclusion is this: it makes sense to implement this way Voice text input on your website to make it easier to enter text on the website pages.

You just need to insert the required code on the appropriate page of the site.

Created a separate page intended only for Voice text input, and began debugging it.

Here is the page code Dictate the text:

Debugging code...

You can use the given code on your website, transforming it as you see fit.

I invite everyone to speak out in

From this day forward, independent developers have access to Cloud Speech API, the speech recognition technology on which Google products are based. The updated product is now available on Google Cloud.

An open beta version of Cloud Speech was released last summer. This technology with simple API allows developers to convert audio to text. Models neural network can recognize more than 80 languages ​​and dialects, and the finished transcription appears immediately after speaking the text.

The API is built on technology that provides speech recognition functionality in Google Assistant, Search and Now, but the new version has made changes to adapt the technology to the needs of Cloud users.

What's different about the new version of the Cloud Speech API?

Thanks to developer feedback, the Google team was able to improve the accuracy of transcription of long audio recordings and speed up data processing by 3 times compared to the original version. Support for other audio formats has also been added, including WAV, OPUS and Speex.

According to statistics, in the past this API was most often used to manage applications and devices using voice search, speech commands and voice menu. But Cloud Speech can be used in a wide range of IoT devices, including cars, TVs, speakers and, of course, phones and PCs.

Among the common uses of technology, it is worth noting its use in organizations to analyze the work of call centers, track communication with customers and increase sales.

  • Asterisk,
  • Google API,
  • Yandex API
  • Choosing a Speech Recognition API

    I only considered the API option; packaged solutions were not needed because they required resources, recognition data is not critical for business, and using them is much more complicated and requires more man-hours.

    The first was Yandex SpeechKit Cloud. I immediately liked it because of its ease of use:

    Curl -X POST -H "Content-Type: audio/x-wav" --data-binary "@speech.wav" "https://asr.yandex.net/asr_xml?uuid=<идентификатор пользователя>&key= &topic=queries"
    Pricing policy: 400 rubles per 1000 requests. The first month is free. But after that there were only disappointments:

    When transmitting a large proposal, a response of 2-3 words was received
    - These words were recognized in a strange sequence
    - Attempts to change the topic did not bring positive results

    Perhaps this was due to the average recording quality; we tested everything through voice gateways and ancient Panasonic phones. For now, I plan to use it in the future to build IVR.

    The next one was a service from Google. The Internet is replete with articles that suggest using the API for Chromium developers. Now the keys for this API cannot be obtained so easily. Therefore, we will use a commercial platform.

    Pricing policy - 0-60 minutes per month free. Next, $0.006 per 15 seconds of speech. Each request is rounded to a multiple of 15. The first two months are free, a credit card is required to create a project. The API use cases in the underlying documentation are varied. We will use a Python script:

    Script from documentation

    """Google Cloud Speech API sample application using the REST API for batch processing.""" import argparse import base64 import json from googleapiclient import discovery import httplib2 from oauth2client.client import GoogleCredentials DISCOVERY_URL = ("https://(api). googleapis.com/$discovery/rest?" "version=(apiVersion)") def get_speech_service(): credentials = GoogleCredentials.get_application_default().create_scoped(["https://www.googleapis.com/auth/cloud-platform "]) http = httplib2.Http() credentials.authorize(http) return discovery.build("speech", "v1beta1", http=http, discoveryServiceUrl=DISCOVERY_URL) def main(speech_file): """Transcribe the given audio file. Args: speech_file: the name of the audio file. """ with open(speech_file, "rb") as speech: speech_content = base64.b64encode(speech.read()) service = get_speech_service() service_request = service.speech ().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode": "en-US", # a BCP-47 language tag ), "audio": ( "content": speech_content.decode("UTF-8") ) )) response = service_request.execute() print(json.dumps(response)) if __name__ == " __main__": parser = argparse.ArgumentParser() parser.add_argument("speech_file", help="Full path of audio file to be recognized") args = parser.parse_args() main(args.speech_file)

    Preparing to use the Google Cloud Speech API

    We will need to register the project and create a service account key for authorization. Here is the link to get the trial, you need a Google account. After registration, you need to activate the API and create an authorization key. Then you need to copy the key to the server.

    Let's move on to setting up the server itself, we will need:

    Python
    - python-pip
    - python google api client

    Sudo apt-get install -y python python-pip pip install --upgrade google-api-python-client
    Now we need to export two environment variables to successfully work with the API. The first is the path to the service key, the second is the name of your project.

    Export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_file.json export GCLOUD_PROJECT=your-project-id
    Let's download the test audio file and try to run the script:

    Wget https://cloud.google.com/speech/docs/samples/audio.raw python voice.py audio.raw ("results": [("alternatives": [("confidence": 0.98267895, "transcript": "how old is the Brooklyn Bridge")])])
    Great! The first test is successful. Now let’s change the text recognition language in the script and try to recognize it:

    Nano voice.py service_request = service.speech().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode" : "ru-RU", # a BCP-47 language tag
    We need a .raw audio file. We use sox for this

    Apt-get install -y sox sox test.wav -r 16000 -b 16 -c 1 test.raw python voice.py test.raw ("results": [("alternatives": [("confidence": 0.96161985, " transcript": "\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435 \u0412\u0430\u0441 \u043f\u0440\u0438\u0432\u0 435\u0442\u0441\u0442 \u0432\u0443\u0435\u0442 \u043a\u043e\u043c\u043f\u0430\u043d\u0438\u044f")])])
    Google returns the answer to us in Unicode. But we want to see normal letters. Let's change our voice.py a little:

    Print(json.dumps(response))
    We will use

    S = simplejson.dumps(("var": response), ensure_ascii=False) print s
    Let's add import simplejson. The final script is below the cut:

    Voice.py

    """Google Cloud Speech API sample application using the REST API for batch processing.""" import argparse import base64 import json import simplejson from googleapiclient import discovery import httplib2 from oauth2client.client import GoogleCredentials DISCOVERY_URL = ("https://(api ).googleapis.com/$discovery/rest?" "version=(apiVersion)") def get_speech_service(): credentials = GoogleCredentials.get_application_default().create_scoped(["https://www.googleapis.com/auth/cloud -platform"]) http = httplib2.Http() credentials.authorize(http) return discovery.build("speech", "v1beta1", http=http, discoveryServiceUrl=DISCOVERY_URL) def main(speech_file): """Transcribe the given audio file. Args: speech_file: the name of the audio file. """ with open(speech_file, "rb") as speech: speech_content = base64.b64encode(speech.read()) service = get_speech_service() service_request = service .speech().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode": "en-US", # a BCP-47 language tag ), "audio": ( "content": speech_content.decode("UTF-8") ) )) response = service_request.execute() s = simplejson.dumps(("var": response ), ensure_ascii=False) print s if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("speech_file", help="Full path of audio file to be recognized") args = parser.parse_args( ) main(args.speech_file)


    But before running it, you will need to export one more environment variable export PYTHONIOENCODING=UTF-8. Without it, I had problems with stdout when called in scripts.

    Export PYTHONIOENCODING=UTF-8 python voice.py test.raw ("var": ("results": [("alternatives": [("confidence": 0.96161985, "transcript": "Hello, welcome to the company")]) ]))
    Great. Now we can call this script in the dialplan.

    Asterisk dialplan example

    To call the script, I will use a simple dialplan:

    Exten => 1234,1,Answer exten => 1234,n,wait(1) exten => 1234,n,Playback(howtomaketicket) exten => 1234,n,Playback(beep) exten => 1234,n,Set( FILE=$(CALLERID(num))--$(EXTEN)--$(STRFTIME($(EPOCH),%d-%m-%Y--%H-%M-%S)).wav) exten => 1234,n,MixMonitor($(FILE),/opt/test/send.sh [email protected]"$(CDR(src))" "$(CALLERID(name))" "$(FILE)") exten => 1234,n,wait(28) exten => 1234,n,Playback(beep) exten => 1234,n,Playback(Thankyou!) exten => 1234,n,Hangup()
    I use mixmonitor to record and run the script when finished. You can use record and this will probably be better. Example send.sh for sending - it assumes you already have mutt configured:

    #!/bin/bash #script for sending notifications # export the necessary environment variables # Google license file export GOOGLE_APPLICATION_CREDENTIALS=/opt/test/project.json # project name export GCLOUD_PROJECT=project-id # python encoding export PYTHONIOENCODING=UTF-8 #list of input variables EMAIL=$1 CALLERIDNUM=$2 CALLERIDNAME=$3 FILE=$4 # recode sound file in raw in order to give it to Google API sox /var/spool/asterisk/monitor/$FILE -r 16000 -b 16 -c 1 /var/spool/asterisk/monitor/$FILE.raw # assign the value of the executed script to the variable on converting sound to text and cutting off unnecessary TEXT=`python /opt/test/voice.py /var/spool/asterisk/monitor/$FILE.raw | sed -e "s/.*transcript"://" -e "s/)])]))//"` # send the letter, include the recognized text in the letter echo "new notification from the number: $CALLERIDNUM $CALLERIDNAME $ TEXT " | mutt -s "This is the header of the letter" -e "set [email protected] realname="I'm sending alerts"" -a "/var/spool/asterisk/monitor/$FILE" -- $EMAIL

    Conclusion

    Thus, we solved the problem. I hope my experience is useful to someone. I will be glad to receive comments (perhaps this is the only reason why it’s worth reading Habr!). In the future, I plan to implement an IVR with voice control elements based on this.

    The Web Speech API enables you to incorporate voice data into web apps. The Web Speech API has two parts: SpeechSynthesis (Text-to-Speech), and SpeechRecognition (Asynchronous Speech Recognition.)

    Web Speech Concepts and Usage

    The Web Speech API makes web apps able to handle voice data. There are two components to this API:

    • Speech recognition is accessed via the SpeechRecognition interface, which provides the ability to recognize voice context from an audio input (normally via the device"s default speech recognition service) and respond appropriately. Generally you"ll use the interface"s constructor to create a new SpeechRecognition object, which has a number of event handlers available for detecting when speech is input through the device"s microphone. The SpeechGrammar interface represents a container for a particular set of grammar that your app should recognise. Grammar is defined using JSpeech Grammar Format ( JSGF.)
    • Speech synthesis is accessed via the SpeechSynthesis interface, a text-to-speech component that allows programs to read out their text content (normally via the device"s default speech synthesiser.) Different voice types are represented by SpeechSynthesisVoice objects, and different parts of text that you want to be spoken are represented by SpeechSynthesisUtterance objects. You can get these spoken by passing them to the SpeechSynthesis.speak() method.

    Web Speech API Interfaces

    Speech recognition

    SpeechRecognition The controller interface for the recognition service; this also handles the SpeechRecognitionEvent sent from the recognition service.

    SpeechRecognitionAlternative Represents a single word that has been recognized by the speech recognition service.

    SpeechSynthesis The controller interface for the speech service; this can be used to retrieve information about the synthesis voices available on the device, start and pause speech, and other commands besides.

    SpeechSynthesisErrorEvent Contains information about any errors that occur while processing SpeechSynthesisUtterance objects in the speech service.

    SpeechSynthesisEvent Contains information about the current state of SpeechSynthesisUtterance objects that have been processed in the speech service.

    SpeechSynthesisUtterance Represents a speech request. It contains the content the speech service should read and information about how to read it (e.g. language, pitch and volume.) SpeechSynthesisVoice Represents a voice that the system supports. Every SpeechSynthesisVoice has its own relative speech service including information about language, name and URI.

    Window.speechSynthesis Specced out as part of a interface called SpeechSynthesisGetter , and Implemented by the Window object, the speechSynthesis property provides access to the SpeechSynthesis controller, and therefore the entry point to speech synthesis functionality. Examples The Web Speech API repo on GitHub contains demos to illustrate speech recognition and synthesis.
    Specifications Specification Status

    Comment

    Web Speech API

    Draft

    Initial definitionBrowser compatibility
    SpeechRecognitionhttps://github.com/mdn/browser-compat-data and send us a pull request.DesktopMobileChromeEdgeFirefoxInternet ExplorerOperaSafariAndroid webviewChrome for AndroidEdge Mobile
    Firefox for Android

    Opera for Android

    Safari on iOS

    Samsung Internet SpeechRecognition

    Experimental

    Samsung Internet SpeechRecognition

    Chrome Full support 33
    PrefixedNotesFull support 33PrefixedEdge?Firefox No support NoIE No support No

    Samsung Internet SpeechRecognition

    Opera No support No

    Samsung Internet SpeechRecognition

    Chrome Full support 33 Safari No support No WebView Android? Chrome Android Full support Yes Full support Yes
    Implemented with the vendor prefix: webkit NotesYou"ll need to serve your code through aweb serverfor recognition to work.Edge Mobile?

    Firefox Android No support No

    Opera Android No support No Safari iOS No support No Samsung Internet Android ? Legend Full support Full support No support No support Compatibility unknown Compatibility unknown

    SpeechSynthesis

    The compatibility table on this page is generated from structured data. If you"d like to contribute to the data, please check out Draft

    Update compatibility data on GitHub

    Initial definitionBrowser compatibility
    SpeechRecognitionhttps://github.com/mdn/browser-compat-data and send us a pull request.DesktopMobileChromeEdgeFirefoxInternet ExplorerOperaSafariAndroid webviewChrome for AndroidEdge Mobile