HomeAuthorContact
Artificial Intelligence
How to fix Azure Speech-To-Text audio format and sampling rate limitations
Laurent Egbakou
Laurent Egbakou
March 03, 2023
2 min
Our work may have helped you ?
Support Us

Table Of Contents

01
Introduction
02
Prerequisites
03
Support all audio formats
04
Using ffmpeg to adjust the sampling rate
05
Speech SDK in action
06
Support
07
Useful links

Introduction

Speech recognition technology has become increasingly prevalent in various applications, including virtual assistants, subtitling, transcription, customer services, accessibility and more. Azure Cognitive Services provides a powerful and accurate Speech-to-Text(STT) API that can recognize speech from audio files of various formats, languages, and quality. With this API and its SDKs, users can easily transcribe speech from WAV audio files.

However, the STT services SDK only supports audio in WAV format (16 kHz or 8 kHz, 16-bit, and mono PCM). If your audio is not in WAV or PCM format, you must use additional tools like GStreamer or ffmpeg.

In this blog post, I will share with you how to easily transcribe audio of any format and with different sampling rates using Python.

Prerequisites

To follow the steps in this blog post, you will need to have the following prerequisites:

  • Create a Speech resource in the Azure portal and then get the Speech resource key and region
  • Python installed on your local machine.
  • ffmpeg installed on your machine. Ffmpeg is a cross-platform solution used for recording, converting, and streaming audio and video. We'll use it to convert the audio sampling rate to the rate supported by Azure Cognitive Services. Make sure that the ffmpeg command line is working after installation.

Support all audio formats

To support all audio formats, we can use the popular Pydub python library. Note that Pydub requires ffmpeg to be installed to open and save non-wave files like mp3.

You can install the pydub library using pip.

pip install pydub

What should we do next?

Once you receive the audio file from the user request, you should:

  • Get the extension of the audio file.

  • Renamed the filename using a unique name.

  • Save the audio file. We can use the iofiles python library to support asynchronous file operations. iofiles can be installed using : pip install iofiles

  • If the extension is not wav, convert it to wav and set the sampling rate to 16kHz.

from pydub import AudioSegment import aiofiles.os import subprocess from pathlib import Path # audio_file is the input(FastApi UploadFile object for example) extension = audio_file.filename.split(".")[-1] filename_renamed = "random_unique_name" # Where you want to store files temporary tmp_file_store_path = "uploads" # file recevied path user_file_path = f"{tmp_file_store_path}/{filename_renamed}.{extension}" # file path after conversion into wav format file_path = f"{tmp_file_store_path}/{filename_renamed}.wav" # file to pass to the speechsdk to get a response from Azure Cognitve Services speechsdk_input_file = file_path # Save the file, so pythub can convert it later async with aiofiles.open(user_file_path, 'wb') as out_file: content = await audio_file.read() await out_file.write(content) # if the extension is not wav, convert it to wav if extension != "wav": audio = AudioSegment.from_file(user_file_path, format=extension) audio = audio.set_frame_rate(16000) audio.export(file_path, format="wav") # if wav file (Follow to the Next part) # Here you can use the speech sdk and pass the file_path # Do not forget to delete all temprorary files created as part of the user request

Using ffmpeg to adjust the sampling rate

If the user file is a WAV file, we can use ffmpeg as a subprocess to convert it to another 16kHz WAV file.

16kHz_file_path = f"{new_file_path_without_extension}_16kHz.wav" if extension == 'wav': speechsdk_input_file = 16khz_file_path subprocess.call(["ffmpeg", "-i", file_path, "-ar", "16000", f"{16kHz_file_path}", "-y"])

That's great, but what about containerized applications? Does the base image already include ffmpeg?

You will need to install it after specifying the base image inside your Dockerfile.

The following example shows how to do it:

FROM <base image> # ... RUN export DEBIAN_FRONTEND=noninteractive \ && apt-get -qq update \ && apt-get -qq install --no-install-recommends \ ffmpeg \ && rm -rf /var/lib/apt/lists/* # ... WORKDIR /work-dir # ...

Speech SDK in action

Now, we can use the Speech SDK to recognize the converted audio file.

To install the Speech SDK, run the following command:

pip install azure-cognitiveservices-speech

Here is an example of how you can use the SDK to pass the speechsdk_input_file and get a response from Azure Cognitive Services:

import azure.cognitiveservices.speech as speechsdk speech_config = speechsdk.SpeechConfig(subscription=settings.AZURE_SPEECH_KEY, region=settings.AZURE_SPEECH_REGION) # Create the audio config audio_config = speechsdk.audio.AudioConfig(filename=f"{speechsdk_input_file}") # Create the speech recognizer speech_recognizer = speechsdk.SpeechRecognizer(speech_config=azure_speech_config, audio_config=audio_config) # Perform recognition. `recognize_async` does not block until recognition is complete, # so other tasks can be performed while recognition is running. # However, recognition stops when the first utterance has been recognized. # For long-running recognition, use continuous recognitions instead. result_future = speech_recognizer.recognize_once_async() # Wait for the recognition to complete result = result_future.get() # Check the result if result.reason == speechsdk.ResultReason.RecognizedSpeech: # delete temporary files await delete_file_from_temp_async(Path(file_path)) await delete_file_from_temp_async(Path(f"{16kHz_file_path}")) logger.info("Recognized: {}".format(result.text)) # return result.text elif result.reason == speechsdk.ResultReason.NoMatch: logger.warning("No speech could be recognized: {}".format(result.no_match_details)) elif result.reason == speechsdk.ResultReason.Canceled: cancellation_details = result.cancellation_details logger.error("Speech Recognition canceled: {}".format(cancellation_details.reason)) if cancellation_details.reason == speechsdk.CancellationReason.Error: logger.info("Error details: {}".format(cancellation_details.error_details)) # return None

The delete_file_from_temp_async function can be written using iofiles as follows:

async def delete_file_from_temp_async(file_path: Path) try: if file_path.exists(): await aiofiles.os.remove(file_path) except Exception as e: logger.error(f"Error while deleting file from temp: {e}")

Support

If you found this blog post helpful, please share it on your favorite social media platform. Also, don't forget to follow me on GitHub and Twitter.

To send me a message, please use the contact form or direct message me on Twitter.


Tags

#ai#ffmpeg#speech-to-text

Share


Laurent Egbakou

Laurent Egbakou

Microsoft MVP | Founder

Microsoft MVP | Software Engineer ☁️ | Blogger | Open source contributor

Expertise

.NET
Go
Python
Cloud
Open Source

Social Media

linkedintwitterwebsitegithub
Microsoft MVPAzure Developer Associate Badge

Related Posts

Rasa model update trick
Update your Rasa NLU model without retraining
March 05, 2023
2 min
© 2024, All Rights Reserved.
Powered By @lioncoding

French Content 🔜

Quick Links

Advertise with usAbout UsContact UsHire Us

Social Media