[ Freelance Traveller Home Page | Search Freelance Traveller | Site Index ]

*Freelance Traveller

The Electronic Fan-Supported Traveller® Resource

Vilani Speech Synthesis with SSML

This article originally appeared in the September/October 2019 issue.

Author’s Note: In this article, “Windows PowerShell” refers to the version of PowerShell distributed with/as part of Windows 7 or later, or which is included with the Windows Management Framework updated for those versions. “PowerShell Core” refers to those versions of PowerShell other than Windows PowerShell. “PowerShell”, not otherwise specified, refers to both Windows PowerShell and PowerShell Core.

This article makes significant use of the IPA characters in Unicode. Your browser should use a font for monospaced text that includes these characters; on Windows systems, both Courier New and Andale Mono will work.

All code from this article can be downloaded from https://www.freelancetraveller.com/infocenter/software/ssml.zip

If you’ve got a Windows computer (Windows 7 or later), your computer can talk pretty easily:

Start up a Windows PowerShell session—it doesn’t matter whether you use the ISE or the console version of Windows PowerShell—and type the code in listing 1 at the prompt.

# Listing 1: Basic Speech Commands (Windows PowerShell)

Add-Type -AssemblyName System.Speech
$voice = New-Object -TypeName System.Speech.Synthesis.SpeechSynthesizer
$voice.Speak("Good day, ladies and gentlemen")

The voice quality is pretty good, although the intonation is somewhat mechanical—the result actually sounds better than the voice from Stephen Hawking’s voder, though the rhythm and intonation is similar.

Other systems (e.g., Macintosh or Linux) have their own speech synthesis (sometimes called TTS—text-to-speech) systems, which may or may not be accessible from PowerShell Core on those systems. You will need to consult the documentation for your operating system and TTS software.

But even in Windows, it’s really only this simple if the text you use in the $voice.Speak(…) statement is in the language that your Windows system uses as the default user interface language—for me, US English. If you try to use text from a language whose orthographic conventions (that is, the way sounds are written) are significantly different from your system default language, you’ll get something that will sound badly wrong, and in fact you may even end up having part or all of your text spelled out. On my system, for example, trying to get the standard voice (for US English) to speak French has pretty horrible results. Trying to use the English TTS engine with a language that doesn’t even use the Latin alphabet (e.g., Russian, Hebrew, or Chinese) throws an error.

You can, of course, install additional voices for different languages, and in some languages, for different dialects or accents (for example, Windows has English voices for US, Canada, England, Ireland, Australia, and India) or both genders. If you’re willing to pay for third-party voices, you can even get children’s voices or elderly voices. I’ve installed other Microsoft (free, built-in to Windows) voices on my system, so if I wanted my computer to say something in French, I could enter the code in Listing 2.

# Listing 2: Windows PowerShell Speaks French

$voice.SelectVoice("Microsoft Hortense Desktop")
$voice.Speak("Bonjour mesdames et messieurs")

Naturally, you can incorporate these statements into a script, and have complex “canned” dialogues, or you can write a script that reads your input and then speaks it.

What happens, though, if you want to use a language that isn’t available (for example, obscure languages like Xhosa, or fictional languages like Klingon), either as a free Microsoft voice or as a third-party voice? Or if you want to insert a single word or short phrase in one language into the middle of a text in another? For both situations, the World Wide Web Consortium (W3C) has defined Speech Synthesis Markup Language (SSML), based on XML and allowing the user to specify exact pronunciation using the International Phonetic Alphabet (IPA).

A full treatment of SSML is beyond the scope of this article; we will only be discussing how to generate an IPA pronunciation and insert it into an SSML framework.

Most TTS systems, not just those for Windows, will support SSML. PowerShell Core is available for Windows, Macintosh, and Linux systems, so the PowerShell code in the rest of this article is applicable to any system, unless otherwise noted.

A minimum SSML string for the Windows text-to-speech (TTS) subsystem is given in Listing 3a; Listing 3b includes the XML preamble and DOCTYPE preambles that TTS systems other than Windows may require.

# Listing 3a: Minimal SSML for Windows TTS (PowerShell $voice.SpeakSSML(…))

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>

# Listing 3b: SSML with preambles for non-Windows TTS (check your TTS system documentation)

<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C/DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>

To tell Windows PowerShell to use SSML for speech generation, use $voice.SpeakSSML(…) instead of $voice.Speak(…) (See listing 4).

# Listing 4: Using $voice.SpeakSSML(…) in Windows PowerShell

$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">Good day, ladies and gentlemen.</speak>'
$voice.SpeakSSML($ssml)

Doing this doesn’t get you anything beyond what we've already seen with $voice.Speak(…), however; we need to insert another SSML tag to use IPA: the <phoneme> tag.

Suppose we want our default US English voice to say “The French for ‘Hello’ is ‘Bonjour’.”. If we simply pass that string to the TTS engine, it will completely mangle the French word. We use the <phoneme> tag to tell the (English) TTS engine how to pronounce the French word (see listing 5).

# Listing 5: Using the <phoneme> tag to insert one language into another

$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">'
$ssml += 'The French for "Hello" is '
$ssml += '<phoneme alphabet="ipa" ph="boɴˈʒɯʁ">"Bonjour"</phoneme>.</speak>'

If we then feed this to the TTS engine, we will get what sounds like an American who knows French, but still has an American accent.

In the <phoneme> tag, we provide the ‘alphabet’ attribute to tell the TTS engine what phonetic transcription system we will be using to represent the pronunciation. All SSML processors that support the <phoneme> tag are required to support IPA; other phonetic representation systems may be supported at the TTS engine author’s discretion. The ‘ph’ attribute provides the pronunciation of the word or phrase, as represented in the phonetic transcription system named in the ‘alphabet’ attribute.

We now have enough information on SSML to be able to have our computer insert individual Vilani words into phrases in our computer’s primary TTS language. What we don’t have is a way of transcribing Vilani into IPA. I went through extant information on the Vilani language, came up with the IPA equivalents for the “standard” Latin-alphabet orthography for Vilani, and wrote it out into a file that will be used by code in this article. That file, VILANI.IPA, is included in SSML.ZIP. See the boxed text below for how to create a language IPA definition file.

The PowerShell Advanced Function (also called a ‘script cmdlet’) in Listing 6 will take as parameters a language identifier and a string containing a word ostensibly in that language, and will use the rules defined in a file such as described in the sidebar to emit a string that contains the IPA for the correct pronunciation of the input word. Note that the rules file must be named «language».ipa, where «language» is the language with which you are working (Vilani, in our example).

# Listing 6: Convert Text to IPA according to language rules - This function is part of ssml.ps1 in the zip file
		
function ConvertTo-IPA {
    [CmdletBinding()]

    Param(
        [Parameter(Mandatory=$true)]
        [string]$language,

        [Parameter(Mandatory=$true)]
        [string]$word
    )

    $langfile = $language + ".ipa"
    $phonemetable = (Import-CSV -Path $langfile -Delimiter '=')
    ForEach($phoneme in $phonemetable) {
        $word = $word -replace $phoneme.ortho,$phoneme.ipa
    }
    return $word
}

Now, we need to insert this IPA string into a <phoneme> tag. The PowerShell Advanced Function/script cmdlet in Listing 7 will take as parameters a language identifier and a string containing a word ostensibly in that language, and will use the function from Listing 6 to generate an IPA string, and then emit the <phoneme> tag that will allow our TTS system to pronounce the word.

# Listing 7: Generate a  tag with IPA pronunciation - This function is part of ssml.ps1
		
function New-SSMLPhonemeTag {
    [CmdletBinding()]

    Param(
       [Parameter(Mandatory=$true)]
       [string]$language,

       [Parameter(Mandatory=$true)]
       [string]$word
    )

    $phonemetag = '<phoneme alphabet="ipa" ph="'
    $phonemetag += (ConvertTo-IPA -word $word -language $language)
    $phonemetag += '">' + $word + '</phoneme>'

    return $phonemetag
}

As this returns the string to be inserted into the SSML, you can call it as part of your effort to build the SSML string (see listing 8)

# Listing 8: Generating SSML with <phoneme> tags

$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">'
$ssml += 'The Vilani word that means "a change in lighting that reveals new detail" is "' + (New-SSMLPhonemeTag -word kurishdam -language vilani) + '".'
$ssml += '</speak>'

NOTE: The pronunciation generated by these functions does not take into account any rules for stress or tone that may differ from those of the default TTS engine language. You may want to output the generated SSML (or, later on in this article, the PLS lexicon) to a file and hand-edit it to reflect those additional rules.

The <phoneme> tag isn’t really the right solution for entire phrases or paragraphs in an unsupported language, however. The ideal solution would be to create or obtain a TTS engine for the language; however, we are assuming that that’s not an option. You can, however, add vocabulary to an existing TTS engine using a pronunciation lexicon. The W3C has a specification for this, the Pronunciation Lexicon Specification (PLS). This is an XML-based file format that pairs orthography with pronunciation, much like the <phoneme> tag in an SSML document does. However, when a pronunciation lexicon is active, one may pass strings in the lexicon’s language to the TTS engine, either directly or as part of a SSML document (depending on the TTS engine’s limitations), without individual <phoneme> tags, and have it pronounce the words correctly (see listing 9).

<!-- Listing 9: SSML to load a pronunciation lexicon, then use it -->
		
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <lexicon uri="file:///usr/traveller/vilani/lexicon.pls"/>
    Dishimkhirni lekane baasa ka amaargi in disaninu ka iirbar in sisadikud. Dirgekii ka darkaamku in midu in dinekhinumninu ka khurer khinumash.
</speak>

(The Windows .NET SpeechSynthesizer class also has a method .AddLexicon(…) to load a PLS file. There is a known bug with the “Microsoft Zira Desktop” voice; this voice ignores loaded lexicons.)

According to the W3C specification for PLS, a minimal PLS header would consist of the XML prolog, followed by the <lexicon> element defining the namespace, alphabet, and language (see listing 10).

<!-- Listing 10: A Minimal PLS Header -->

<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
</lexicon>

Note that some TTS systems require the xml:lang attribute to match the ‘native’ language of the TTS voice (Windows is one such). In those cases, you will need separate copies of the lexicon for each language you wish to apply the lexicon to. As with the <phoneme> tag in SSML, support for IPA is mandated; support for other pronunciation representations is at the TTS engine author’s discretion.

The <lexicon> element encloses multiple <lexeme> elements, each representing a single “word” and its pronunciation. Each <lexeme> element encloses one or more <grapheme> elements, representing the way the word is written, and one or more <phoneme> elements, representing the pronunciation. For the purposes of this article, we will assume that a lexeme encloses exactly one grapheme and one phoneme. (see listing 11)

<!-- Listing 11: A Lexicon with a <lexeme> element -->
		
<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
    <lexeme>
        <grapheme>bonjour</grapheme>
        <phoneme>boɴˈʒɯʁ</phoneme>
    </lexeme>
</lexicon>

Given the lexicon from listing 11, once loaded into an English voice, we could use the word “bonjour” without having to include pronunciation data “on the fly”.

The PowerShell Advanced Function/script cmdlet in listing 12 takes a text file and a language IPA definition file, and uses the ConvertTo-IPA function from Listing 6 to generate a PLS lexicon for the language including all the words in the text file. It is assumed that the text file will contain one word per line. The only required parameter is the language name; if the vocabulary text file or output file names are omitted, they will default to the language name followed by .txt and .pls respectively (i.e., if the language is vilani, the language data will be read from vilani.ipa, the vocabulary from vilani.txt, and the output lexicon will be vilani.pls)

# Listing 12: A PLS Lexicon Generator - This function is part of ssml.ps1 in the zip file
		
function New-PLSLexicon {
    [CmdletBinding()]

    param(
        [Parameter(Mandatory=$true)]
        [string]$language,

        [string]$wordfile,

        [string]$outfile
    )

    $lexicon = @()
    if ($wordfile -eq "") { $wordfile = $language + '.txt' }
    if ($outfile  -eq "") { $outfile  = $language + '.pls' }
    $wordlist = Get-Content $wordfile
    $lexicon += '<?xml version="1.0"?>'
    $lexicon += '<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">'
    ForEach ($word in $wordlist) {
        $lexicon += '  <lexeme>'
        $lexicon += '    <grapheme>' + $word + '</grapheme>'
        $lexicon += '    <phoneme>' + (ConvertTo-IPA -word $word -language $language) + '</phoneme>'
        $lexicon += '  </lexeme>'
    }
    $lexicon += '</lexicon>'
    Set-Content -Encoding Unicode -Path $outfile -Value $lexicon
}

References

SSML 1.0: https://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
SSML 1.1: https://www.w3.org/TR/speech-synthesis11/
IPA: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
PLS: https://www.w3.org/TR/pronunciation-lexicon/