Azure TTS – it is easy and it is fast

Microsoft offers convenient Text To Speech synthesis via Azure.

It is easy – create the resource in Azure and then use the API (create web requests via preffered script/application and get sound file in return).

Example PowerShell script is



#single word process with variables - clean code 
#v3 - changed the path ( $outputFilePath )to be initially set and the output loop uses local variable (path+word+"mp3"

$outputFilePath = "C:\temp\some_folder\azure services\temp_test\"

$words = "зебра
жираф
кон
мишка
компютър
лаптоп
таблет
телефон"
# can be used also for multiple words with each on new line
$wordsArr = $words.Split() | Sort-Object | Get-Unique | Where-Object {$_.Length -gt 1}

#           $wordsArr[1]

clear
$wordsArr | ForEach-Object {
    $word = $_
    Write-Host $word



# Create a hashtable for the headers
$headers = @{
    #the key. in this case key 1
    'Ocp-Apim-Subscription-Key' = 'the-key'
    #the output format
    "X-Microsoft-OutputFormat"  = "audio-24khz-48kbitrate-mono-mp3"
    #some user agent (taken probably from Crome)
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"    
}

$postBody = '
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="bg-BG">
<voice name="bg-BG-KalinaNeural">
<lang xml:lang="bg-BG">
'+$word+'
</lang>
</voice></speak>'
    # the region (in the request uri) must match the TTS service location
    # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-text-to-speech?tabs=streaming#regions-and-endpoints
$reqURI = "https://germanywestcentral.tts.speech.microsoft.com/cognitiveservices/v1"
$outputFile = $outputFilePath+$word+".mp3"
$ttsResponse = Invoke-WebRequest -Uri $reqURI -Method Post -Body $postBody -Headers $headers -OutFile $outputFile  -ContentType 'application/ssml+xml; charset=utf-8' 
}

What does the script do? It splits new line separated words (can be usedalso for long texts, but I needed it for words. Btw it also removes the duplicates.), creates headers and the body and then sends the requests. The replies are saved in different files for reuse.

The key is removed, but it can be found from Azure-> the TTS service-> options-> Key.

The network restrictions are easy to set. In my case the access had to be enabled only from my IP (real static IP). Afterwards the sound can be served from the apache.

Microsoft offers Azure AI Speech Studio where the web frontend can be used to perform some voice modifications and also to doenlowad the SSML (XML) file to inspect/modify the code (voice, volume, intonation, rate, pitch, etc.).

Pricing (besides the Free tier that is available suitable for many home projects) is “Neural (real-time and batch): $15 per 1M characters“. I can not even calculate this in sentence or word pricing. For books- it might get much, but for small batches it is neglectable.

Microsoft offers also speech to text, speaker identification and verification, custom voices for TTS and lot more.