Introduction

Thanks to WebAssembly, the Vocalizer Text-to-Speech engine can now run locally inside the user's browser. This avoids the need of a complex, expensive (and often slow) cloud-based solution for your web-based applications.

With our Vocalizer for WebApps SDK you will be able to create true multiplatform applications that can run on Windows, OSX, iPhone and Android with the same code base.

Click here to play with our our online demo.

What is WebAssembly

From The WebAssembly Official Site:

WebAssembly (abbreviated WASM) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications.

WebAssembly is the resulting work of members representing the four major browsers. Currently, Chrome, Edge, Firefox and WebKit-based browsers (i.e. Safari) fully support WebAssembly.

           

Some WebAssembly Highlights:

  • Efficient and fast: WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

  • Safe: WebAssembly describes a memory-safe, sandboxed execution environment. When embedded in the web, WebAssembly will enfore the same-origin and permissions security policies of the browser.

  • Part of the open web platform: WebAssembly is designed to maintain the versionless, feature-tested, and backwards-compatible nature of the web. WebAssembly modules will be able to call into and out of the JavaScript context and access browser functionality through the same Web APIs accessible from JavaScript.

In addition to running inside a standard browser, you can also develop desktop and mobile applications using frameworks such as Electron or Ionic.

New oportunities

Nuance and Code Factory have once again partnered to bring the Vocalizer Embedded Text-to-Speech engine to this new technology which opens the door to a wide range of possibilities to our customers.

This SDK will provide you:

  • A simple and intuitive JavaScript API to make your web application speech-enabled.
  • Extensive Documentation and Examples (Web, Electron, Ionic and Node).
  • Access to the best Text-to-Speech engine with more than 70 voices in over 45 languages.
  • Optional integration support from Code Factory.

Browser and Framework Versions Required

  • Chrome: 57
  • Firefox: 52
  • Safari: 11
  • Safari Mobile: 11
  • Microsoft Edge: 16
  • Opera: 44
  • Samsung Internet: 7.0
  • Node.js: 8.0.0
  • Electron: any with Chromium >= 57
  • Ionic: 3

Vocalizer for WebApps SDK Version History

Date SDK Version TTS Version Description
2019/05/06 3.3.5_r1 3.3.5 Initial SDK release

Setting up your server

In addition to your application, your server must also be configured to supply the Text-to-Speech modules.

These modules include:

  • A JavaScript file with the API that you will use to interact with the Text-to-Speech engine (webtts.js)
  • An additional WASM file (webtts.wasm) that contains the Text-to-Speech engine in WebAssembly format.
  • A file containing the metadata information of the data files (voices, languages, etc.) needed by the engine.
  • The data files containing voice and language data.

Directory Structure

Here is an example directory structure of a typical web application that uses the Text-to-Speech engine:

/index.html
/webtts.js
/webtts.wasm
/data/files.metadata
/data/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-compact_2-2-1.hdr
/data/languages/enu/speech/components/enu_zoe_embedded-compact_2-2-1.dat
/data/languages/common/sysdct.dat
/data/languages/common/clm.dat
/data/languages/common/synth_med_fxd_bet3f22.dat
/data/languages/common/lid.dat

Here is a description of each component:

  • /index.html is the application's main page.
  • /webtts.js contains all the functions that you will call to interact with the Text-to-Speech engine.
  • /webtts.wasm contains additional components needed by the engine.
  • /data/files.metadata contains information (url, size, etc.) of all the files that are needed by the Text-to-Speech engine. As far as the engine is concerned, any file that is not listed in the metadata does not exist.
  • /data/languages/* contains the data files directly extracted from the voice ZIPs supplied together with the SDK.

You can choose any other directory structure that best suits your application needs. For example, you may want to store the engine JavaScript and WASM file in a separate directory, and store the metadata in another folder, or even server.

How the engine code is loaded into your Application

Like all the code of your application, the Text-to-Speech engine files also reside in your server. When your application is loaded by the browser, all the dependencies will be resolved, and the script files related to the Text-to-Speech engine will be automatically downloaded. (Note that this does not include the voice data files, which will be covered in a further section)

Typically, your application's main HTML page will load the Text-to-Speech API script like this:

<script type="text/javascript" src="webtts.js">

The rest of the loading process happens transparently in the background. The WASM and other JavaScript files will be downloaded and processed by the browser when needed.

Once the HTML is completely loaded, the Text-to-Speech engine is ready to be used.

How data files are handled

The Text-to-Speech engine has two parts: code and data. As explained in the previous section, the engine code is loaded just like a normal JavaScript file by the browser, but the voice data still remains in the server.

The engine knows about what files are available to it from the information contained in the metadata file. The metadata provides a list of files that will be needed by the Text-to-Speech engine during synthesis.

For example, a simple metadata file might look like this:

{
  "files": [{
    "url": "/data/languages/enu/speech/ve/ve_pipeline_enu_zoe.hdr",
    "md5": "80cee58b5999ddfe879eddbb289946af",
    "name": "ve_pipeline_enu_zoe-mlsc_22_embedded-premium_2-4-2.hdr",
    "size": 4856
  }, {
    "url": "/data/languages/enu/speech/components/enu_zoe.dat",
    "md5": "2f0b39db88c23f83363b1867cfebbc21",
    "name": "enu_zoe-mlsc_embedded-premium_2-4-2.dat",
    "size": 665773946
  }]
}

For each file, the metadata provides information of its URL relative to the application, the file name, size and a hash that uniquely identifies the contents of the file.

In order to tell the engine where to find the metadata, its URL must be passed to the initialization function. Here is an example:

ttsInitialize({
  env: 'web',
  metadata: '/data/files.metadata',
  ... });

Optionally, you can pass the metadata string itself instead of a URL to the file. The documentation for the engine initialization provides more information about this.

Generating a metadata file

The SDK comes with a Python command-line utility to automatically build the metadata from a given directory. The basic usage is as follows:

do_metadata.py --dir ./data --output ./data/files.metadata

The example above will recursively scan the ./data directory in search for .dat and .hdr files (the default wildcards). Each file will be processed and added to the metadata file named files.metadata

This utility also has some additional command-line options:

  • --silent: silent execution. No output is shown.
  • --include wildcard: specifies the wildcards. For example --include *embedded-pro*
  • --baseurl urlstr: prepends the given URL string to the URL of each file in the metadata. For example: --baseurl http://dataserver.foo.com/voices

As a general rule, it's best that the data files reside in a directory relative to your application. If this is not possible, the --baseurl option will allow you to specify any URL for the data files. However, beware of Cross Origin Resource Sharing (CORS) restrictions.

See examples/pack_data.sh for a few examples on how to generate metadata files for various scenarios.

How Data Files are loaded

Through the metadata file, the engine is informed of the data files that are available. The engine can be configured in two possible ways, each handling data files differently:

  • local: All the data files reside in the local computer (Node and Electron environment). The URL property of each file in the metadata list points to a directory in the local file system.

  • remote: Data files reside in the server and will be downloaded during the engine's initialization process.

The operating mode can be specified during initialization as follows:

ttsInitialize({
  env: 'web',
  metadata: '/data/files.metadata',
  data: 'remote'
});

Caching Data Files

When the data files are downloaded from the server, they will automatically be cached in the browser's Indexed DB. This prevents further launches of your application from re-downloading the data files.

If you want to avoid this default behavior, you can initialize the TTS like this:

ttsInitialize({
  env: 'web',
  metadata: '/data/files.metadata',
  data: 'remote'
  cache: 'none'
});

The user can, at any time, clear the data files if storage space is needed. Also, the Text-to-Speech API provides a function to programatically remove all data files that may be stored in the browser's Indexed DB.

Examples

The Vocalizer for WebApps SDK ships with a several example that demonstrate the functionality of the API in various of the supported environments.

Web Example

This is example is meant to run in a web browser and is the one that tries to cover the widest range of functions in the TTS API. It's also the one with the simplest UI, because we want this example to show only how to use the API, and remove any any code that is not strictly necessary.

This example shows the following functionality:

  • Monitor progress of voice data download by using a progress bar.
  • Control audio playback (stop, pause and resume)
  • Cursor tracking by highlighting the word that is currently being spoken.
  • Real-time audio signal display through an oscilloscope.
  • Monitor engine state changes.
  • Real-time lip synchronization through a minimalistic face.

The example is written in HTML and JavaScript and can be found in the folder examples/web.

Demo Example

This example is very similar to the Web Example, but it has a more elaborated UI, showing process in a circle bar. It also uses the Compact voices (as opposed to the Embedded Pro voices of the Web Example), so it's muct faster to load.

A real life implementation of this demo can be found in the Code Factory page

The example is written in HTML, JavaScript and CSS and can be found in the folder examples/demo.

Electron Example

This example targets the Electron environment, which allows you to create Windows and OSX desktop application.

Electron is a very interesting environment because it has the advantages of a Node.js environment (that is, it allows access to the local file system) and also the Web environment (because the UI runs inside Chromium).

Therefore, the TTS in an Electron application can be initialized in either local or remote modes depending on where you want the voice data to reside.

In this particular example, the Text-to-Speech voice data in bundled in the application's package itself (in a folder called data). Therefore, the metadata file looks something like this:

{
  "files":[{
    "url":"../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr",
    "md5":"0af3f837b7d0886494b5ad7713466e00",
    "name":"ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr",
    "size":3896
    ...
  }]
}

Note how the URL property of each file in the metadata list contains a directory name rather than a URL. To determine the folder from which each voice data file directory is resolved, the engine uses the parameter localroot specified during initialization.

For example:

ttsInitialize({
 env: 'electron',
 data: 'local',
 localroot: `${__dirname}`,
 metadata: `${__dirname}/../data/files.local.metadata`});

dirname is a Node variable that points to the current module's folder inside the local file system.
For example, if dirname is /Users/Alex/Example/app and the entry in the metadata file is ../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr the file will be loaded from /Users/Alex/Example/app/../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr

The example can be found in the examples/electron folder.

Ionic Example

This example shows you how to integrate the Text-to-Speech engine into an Ionic application that will run in both iOS and Android mobile phones.

It's not very different from the Demo Example: the data is downloaded from the server and cached inside the application for later use, but the main difference is how the engine is initialized from the applcation:

ttsInitialize({
 env: 'ionic',
 scriptsRoot: 'assets/js',
 cache: 'browser',
 data: 'remote',
 metadata: "assets/files.server.compact.metadata"});

In this case, we tell the engine where webtts.js and webtts.wasm are located through the scriptsRoot property.

This example can be found in the examples/ionic folder and most of the code can be found under examples/ionic/src/app/home

Node.js Example

This example shows how the TTS can be integrated into a Node application.

Node does not by default support audio streaming, so this example shows how to get the raw PCM audio from the TTS and create a wav file.

This example can be found in the eamples/node folder.

License

The WebAssembly Text-to-Speech SDK for Vocalizer Embedded is copyrighted and propietary software of Code Factory, SL

Nuance, Vocalizer and their respective logos are registrered trademarks of Nuance Communications, Inc. All rights reserved.

Chrome, Safari, Firefox, Microsoft Edge, Android, iPhone, OSX, iOS, WebAssembly, Ionic, Electron and WebGL are registered trademarks of their respective owners.

Contains code generated by Emscripten. Copyright © 2010-2014 Emscripten authors. Used under the MIT and the University of Illinois/NCSA Open Source Licenses. You can see the full license here.

Uses vc_vector, a fast and simple C vector implementation. Copyright © 2016 Skogorev Anton. Used under the MIT license.

Uses portions of code inspired by the Fetch Progress Indicators. Copyright © 2018 Anthum, Inc. Used under the MIT license. See full license here.

The Demo example uses and bundles the Code Mirror text editor. Copyright © 2017 by Marijn Haverbeke and others. Used under the MIT license. See full license here.

The Demo example uses and bundles the CSS Percentage Circle by Andre Firchow.

The Node.js example uses and bundles the Web Audio API library. Copyright © 2013 Sébastien Piquemal. Used under the MIT license. See full license here.

The Node.js example uses and bundles the node-speaker library. Copyright © their authors. Used under the MIT license.

The Electron and Ionic examples may contain dependency modules automatically generated by their respective build environments. Check the LICENSE.md file in the Electron and Ionic example folders respectively.

Code Factory, SL claims no ownership and provides no warranty of the third-party components and libraries listed above.

The information in this document is subject to change without notice and should not be construed as a commitment by Code Factory SL, who assumes no responsibility for any errors that may appear in this document. In no event shall Code Factory, SL be liable for incidental or consequential damages arising from use of this document or the software described in this document.

The use of the WebAssembly Text-to-Speech SDK for Vocalizer Embedded is not free and must always be integrated into an application after a proper legal contract has been signed with Code Factory. The contract will define the nature and scope of the application that integrates and uses the SDK.

Please visit our Legal Information page for more details about Code Factory and its Code of Conduct.

Taling 3D Head courtesy of Javier Arevalo Baeza (@TheJare). Developed in three.js and used under the MIT license.

Developed in the winter of 2019 by Eduard Sanchez Palazon.

JavaScript API

ttsInitialize

Initializes the Text-to-Speech engine.

ttsInitialize(params: Object, progressFunc: function): Promise
Parameters
params (Object) Initialization parameters.
progressFunc (function) [progressFunc=null] The callback function that will be invoked with progress information.
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

The initialization parameters are passed through an object with key/value property pairs. Here are the supported initialization parameters:

  • env: Environment. Possible values are: web, node, electron and ionic.
  • metadata: URL of the metadata file that contains information about all the data files available during the Text-to-Speech session. Optionally, you can pass the metadata string itself instead of a URL.
  • data: Where the voice data files reside. Possible values are: local, and remote.
  • cache: If the voice data is downloaded from the server, this parameter tells the engine where the files should be cached. Possible values are: browser (files will be cached in the browser's Indexed DB) and none (data files will not be cached).
  • scriptsRoot: This parameter specifies the path to the webtts.js and webtts.wasm files. Note that the path can also be specified in the <script> tag of the HTML page.
  • localroot: If the voice data files are located in the local file system (data = local) this parameter tells the engine the root folder that contains the data. The 'URL' parameter of each file in the metadata is relative to this folder.

During initialization, if a progress function has been provided, it will be called back and given information about the status of the initialization process. This is mostly used to provide information to the user about the download of the data files, as it's the most time-consuming task during initialization.

Note that the initialization can be canceled by a call to ttsRelease

Example
initProgressFunc=function(msg) {
  if ( msg.task == "download" ) {
    if ( msg.name != null )
      document.getElementById('status').innerHTML='Downloading ' + msg.name;
    else document.getElementById('status').innerHTML='Downloading data...';

    // Update progress bar
    if ( msg.totalFilesProgress > 0 ) {
      document.getElementById('progress').value=msg.totalFilesProgress;
    }
  }
  else if ( msg.task == "save" ) {
    document.getElementById('status').innerHTML='Saving ' + msg.fileName + "...";
  }
  else if ( msg.task == "load" ) {
     document.getElementById('status').innerHTML='Loading ' + msg.fileName + "...";
  }
};

initCompleteFunc=function(success) {
  if ( success ) {
    // Show the engine version information if it's available
    var vi=ttsGetVersionInfo();

    if ( vi != null ) {
      document.getElementById('status').innerHTML="Engine version: "+vi.major+"."+vi.minor+" ("+vi.buildDay+"/"+vi.buildMonth+"/"+vi.buildYear+")<br>"+vi.sdkProvider;
    }

    // Set the current voice
    var voiceList=ttsGetVoiceList();

    if ( voiceList.length > 0 )
      ttsSetCurrentVoice(voiceList[0]);
  }
  else {
    document.getElementById('status').innerHTML='Error during initialization!';
  }
};

...

ttsInitialize({
    env: 'web',
    metadata: '/data/files.metadata',
    data: 'remote'
}, initProgressFunc)
.then((success) => initCompleteFunc(success));

ttsRelease

Uninitializes the Text-to-Speech engine previously initialized by ttsInitialize

ttsRelease(): Promise
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

ttsSpeak

Speaks the given string. The function can optionally be passed callback functions that receive information about the progress of the synthesis and playback process.

ttsSpeak(str: string, progressFunc: function): Promise
Parameters
str (string) The text to be spoken.
progressFunc (function) [progressFunc=null] The callback function that will be called with progress information.
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

If synthesis of a previous text is in progress, this function will add the text to the speech queue and so that it is processed later. If ttsStop is called, Promises will not be resolved for neither the current or any of the queued texts.

Compatibility Note

In Safari and later versions of Chrome, playing audio always must be in response to a user-initiated action. Therefore, do not call ttsSpeak in respose to an onLoad event, or any other event that is not in direct response to a user interaction (for exemple, a tap or a button click).

As a result, this won't work:

<BODY onload='ttsSpeak('Welcome to my page')'>

Instead, do something like this:

<BUTTON onclick='ttsSpeak('Hello!')'>Test me!</BUTTON>

Speech Events

The speech callback functions are always synchronized real-time with the audio device, so that events always occur at the time when the user hears them.

The Promise returned by ttsSpeak will resolve when the audio device is done playing all audio belonging to the given text string. The Promise is fulfilled with a boolean value indicating the success of the operation. Failure to synthesize text is likely due to ttsSpeak being called with a wrong engine state (such as "uninitialized"). See ttsGetState for more information about engine states.

The progressFunc function will be called with data of events (such as bookmarks, words, lip information, etc.) that occur during the synthesis process. The function is called with an object as a parameter, with a property named "type" that contains the type of the event.

Here is a list of events that occur during synthesis:

  • "word": Word event. Speech has reached a word in the input text.

    Properties of this event type:

      cntSrcPos: Character index of the beginning of the word.
      cntSrcTextLen: Length of the word in characters.

  • "bookmark": Speech has reached a bookmark event. Bookmarks are embedded in the input text by control sequences.

    Properties of this event type:

      cntSrcPos: Character index of the beginning of the bookmark.
      cntSrcTextLen: Length of the bookmark in characters.
      name: The name of the bookmark.

  • "lipsync": This even is fired to inform of the mouth position of the phoneme being synthesized.

    Properties of this event type:

      cntSrcPos: Character index of the beginning of the phoneme.
      cntSrcTextLen: Length of the phoneme in characters.
      cntSrcPos: Character index of the beginning of the phoneme.
      sJawOpen: Opening angle of the jaw on a 0 to 255 linear scale, where 0 = fully closed, and 255 = completely open.
      sTeethUpVisible: Indicates if upper teeth are visible on a 0 to 255 linear scale, where 0 = upper teeth are completely hidden, 128 = only the teeth are visible, and 255 = upper teeth and gums are completely exposed.
      sTeethLoVisible: Indicates if lower teeth are visible on a 0 to 255 linear scale, where 0 = lower teeth are completely hidden, 128 = only the teeth are visible, and 255 = lower teeth and gums are completely exposed.
      sMouthHeight: Mouth height on a 0 to 255 linear scale, where 0 = minimum height (mouth and lips are closed) and 255 = maximum possible height for the mouth.
      sMouthWidth: Mouth or lips width on a 0 to 255 linear scale, where 0 = minimum width (mouth and lips are puckered) and 255 = maximum possible width for the mouth.
      sMouthUpturn: Indicates how much the mouth is turned up at the corners on a 0 to 255 linear scale, where 0 = mouth corners turning down, 128 = neutral, and 255 = mouth is fully upturned.
      sTonguePos: Indicates the tongue position relative to the upper teeth on a 0 to 255 linear scale, where 0 = tongue is completely relaxed, and 255 = tongue is against the upper teeth.
      sLipTension: Lip tension on a 0 to 255 linear scale, where 0 = lips are completely relaxed, and 255 = lips are very tense.
      LHPhoneme: Matching L&H+ phonetic symbol.

  • "audio": This event is fired to pass the audio samples currently being played.

    Properties of this event type:

      buffer: A Float32Array that contains the audio samples (-1.0 to 1.0 per sample).

Control Sequences

A control sequence is a piece of text that is not to be read out, but instead offers the possibility to intervene in the automatic pronunciation process. In this way the user can alter the way in which a text will be read, and acquire full control over the pronunciation of the input text. Control sequences can also be used to insert bookmarks in the text.

Control sequences are always preceded by an escape character (Hexadecimal 0x1B). You can directly embed control sequence in the input text like this:

ttsSpeak("Hello. \x1B\\pause=1000\\ This is some text with a pause in the middle.");

Setting the language of the text

Use the control sequence <ESC>\lang=lng_code\ to indicate that the input text starting at that location is in the language lng_code. The value lng_code is a 3-letter language code.

Example:

Follow \x1B\\lang=frf\\ \x1B\\toi=lhp\\ 'Ry_d$_la_vjE.jaR.'djER \x1B\\toi=orth\\ \x1B\\lang=enu\\ for 100 meter.

Note that it depends on the multilingual capabilities of the voice whether the voice can take the language of the text into account.

Setting the type of prosodic boundary

Insert <ESC>\nlu=BND:\ to set the type of prosodic boundary inserted after the following word.

Possible strength values are:

  • W: Weak phrase boundary (no silence in speech)
  • S: Strong phrase boundary (silence in speech)
  • N: No boundary

Example:

Ich sehe \x1B\\nlu=BND:S\\ Hans morgen im Kino.

Setting the word prominence level

Insert <ESC>\nlu=PRM:level\ to set the prominence level on the following word.

Possible prominence level values are:

  • 0: Reduced
  • 1: Stressed
  • 2: Accented
  • 3: Emphasized

Example:

Ich sehe \x1B\\nlu=PRM:3\\ Hans morgen im Kino.

Inserting a Pause

This control sequence inserts a pause of a specified duration at a specific location in the text. The supported range is 1 to 65535 milliseconds

Example:

His name is \x1B\\pause=300\\ Michael.

Changing the volume

The control sequence <ESC>\vol=\ sets the volume to the specified level, where level is a value between 0 (no volume) and 100 (the maximum volume), where 80 is the default volume.

Example:

\x1B\\vol=10\\ I can speak rather quietly, \x1B\\vol=90\\ but also very loudly.

Changing the pitch

The control sequence <ESC>\pitch=\ scales the inherent pitch of the voice with a factor . The value is between 50 (half the inherent pitch, i.e. one octave lower) and 200 (two times the inherent pitch, i.e. one octave higher). The default value is 100.

Example:

I can \x1B\\pitch=80\\ speak lower \x1B\\rate=120\\ or speak higher.

Changing the speaking rate

The control sequence <ESC>\rate=level\ sets the speaking rate to the specified value, where level is between 50 (half the default rate) and 400 (four times the default rate), where 100 is the default speaking rate.

Example:

I can \x1B\\rate=150\\ speed up the rate \x1B\\rate=75\\ or slow it down.

Controlling end-of-sentence detection

The control sequences <ESC>\eos=1\ and <ESC>\eos=0\ control end of sentence detection, with <ESC>\eos=1\ forcing a sentence break and <ESC>\eos=0\ suppressing a sentence break. To suppress a sentence break, the <ESC>\eos=0\ must appear immediately after the symbol that triggers the break (such as after a period). To disable automatic end-of-sentence detection for a block of text, use <ESC>\readmode=explicit_eos\ as described below.

Example:

Tom lives in the U.S. \x1B\\eos=1\\ So does John. 180 Park Ave. \x1B\\eos=0\\ Room 24

Controlling the read mode

The control sequence <ESC>\readmode=mode\ can change the reading mode from sentence mode (the default) to various specialized modes:

Possible reading modes are:

  • sent: Sentence mode (the default)
  • char: Character mode (similar to spelling)
  • word: Word-by-word mode
  • line: Line-by-line mode
  • explicit_eos: Explicit end-of-sentence mode (sentence breaks only where indicated by <ESC>\eos=1\)

Example:

This input will be read sentence by sentence:

\x1B\\readmode=sent\\ Please buy green apples. You can also get pears.

The word "Apples" will be spelled:

\x1B\\readmode=char\\ Apples

This input will be read as a list, with a pause at the end of each line:

\x1B\\readmode=line\\
Bananas
Low-fat milk
Whole wheat flour

This input will be read as one sentence:

\x1B\\readmode=explicit_eos\\
Bananas.
Low-fat milk.
Whole wheat flour.

Resetting control sequences to the default

The control sequence <ESC>\rst\ resets all parameters to the original settings used at the start of synthesis.

Examples:

\x1B\\vol=10\\ The volume is set to a low value. \x1B\\rst\\ Now it is reset to its default value.
\x1B\\rate=75\\ The rate is set to a low value. \x1B\\rst\\ Now it is reset to its default value.

Guiding text normalization

The control sequence <ESC>\tn=\ is used to guide the text normalization processing step.

Possible text normalization values are:

  • spell: Instruct text normalization to startspelling out the input text that follows.
  • address: Inform text normalization to expand the text that follows as an address.
  • sms: Inform text normalization to expand the text that follows as an SMS message.
  • normal: Reset to the regular text normalization.

The end of a text fragment that should be normalized in a special way is tagged with <ESC>\tn=normal.

Examples:

\x1B\\tn=address\\ 244 Perryn Rd Ithaca, NY \x1B\\tn=normal\\ That’s spelled \x1B\\tn=spell\\Ithaca \x1B\\tn=normal\\

\x1B\\tn=sms\\ Carlo, can u give me a lift 2 Helena's house 2nite? David\x1B\\tn=normal\\

Changing the voice

The control sequence <ESC>\voice=\ changes the speaking voice, which also forces a sentence break.

Examples:

\x1B\\voice=samantha\\ Hello, this is Samantha.
\x1B\\voice=tom\\ Hello, this is Tom.

Setting the spelling pause duration

The control sequence <ESC>\spell=\ sets the intercharacter pause to the specified value in msec.

For example:

The part code is \x1B\\tn=spell\\ \x1B\\spell=200\\a134b \x1B\\tn=normal\\

Note: The spelling pause duration does not affect the spelling done by <ESC>\readmode=char\ because that mode treats each character as a separate sentence. To adjust the spelling pause duration for <ESC>\readmode=char\, set the end of sentence pause duration using<ESC>\wait\ instead.

Inserting phonetic text, Pinyin text for Chinese languages or diacritized text

By default Vocalizer Embedded considers the input as orthographic text, but it also supports other types of input:
  • Phonetic text
  • Pinyin text for Chinese languages. Pinyin is a Romanized form that represents Chinese ideographs using Latin letters and numbers.
  • Diacritized orthographic text for languages like Arabic and Hebrew. In these languages regular written text may leave out the vowels. The diacritized form is the counterpart with all vowels explicitly represented by diacritics.

The control sequence <ESC>\toi=<type>\ marks the type of the input starting after the control sequence:

  • <ESC>\toi=lhp\ Phonetic text in the phonetic alphabet L&H+
  • <ESC>\toi=nts\ Phonetic text in the phonetic alphabet NT-SAMPA
  • <ESC>\toi=pyt\ Pinyin text in Chinese languages
  • <ESC>\toi=diacritized\ Diacritized text
  • <ESC>\toi=orth\ Orthographic text (default)

The control sequences that start phonetic text in L&H+ or NT-SAMPA can be extended as:

<ESC>\toi=<lhp | nts>:”<orth_text>”\<phon_text>

This defines <orth_text> as the orthographic counterpart of the phonetic fragment <phon_text>. Vocalizer Embedded uses such a phonetic + orthographic fragment similarly to a phonetic user dictionary entry.

It may also entirely fall back to the orthographic alternative if it can’t realize the phonetic fragment.

Example:

\x1B\\lang=iti\\ \x1B\\toi=nts:"Romano Prodi"\\ ro|'ma|no prO|di \x1B\\toi=orth\\

Note that Vocalizer Embedded does not support such an orthographic counterpart for Pinyin text or diacritized text. It is possible to provide Vocalizer Embedded with the Pinyin text for an orthographic character in Chinese input. Use the control sequence: <ESC>\tagpyt=<pinyin> to define <pinyin> as the Pinyin text for the following Chinese character.

Example:

“基金大\x1B\\tagpyt=sha4\\厦”

is read as “ji1.jin1.da4.sha4”.

Entering phonetic Input

Nuance Vocalizer Embedded supports phonetic input, so that words of which the spelling deviates from the pronunciation rules of a given language (e.g. foreign words or acronyms unknown to the system) can still be correctly pronounced.

The phonetic input is composed of symbols of a phonetic alphabet. Vocalizer Embedded supports 2 phonetic alphabets, both of which can conveniently be entered from a keyboard:

  • L&H+ is a Nuance specific alphabet. In the Language and voice documentation you will find the L&H+ Phonetic Alphabet of the language concerned.
  • The NT-SAMPA phonetic alphabet is a proprietary standard of NavTeq modeled after SAMPA and X-SAMPA. The NavTeq Voice Reference Guide defines the list of phonetic symbols per language.

Using the control sequence for phonetic text a possible phonetic input (as a replacement for the English word “zero”) can be:

<ESC>\toi=lhp\ ‘zi.R+o&U \toi=orth\

Setting the end-of-sentence pause duration

The control sequence <ESC>\wait=value\ sets the end of sentence pause duration (wait period) to a value between 0 and 9, where the pause will be 200 msec multiplied by that number.

Examples:

\x1B\\wait=2\\ There will be a short wait period after this sentence.
\x1B\\wait=9\\ This sentence will be followed by a long wait period. Did you notice the difference?

Inserting a bookmark

The control sequence <ESC>\mrk=name\ marks the position where it appears in the input text with the bookmark string , and has Vocalizer Embedded track this position throughout the Text-To-Speech conversion. After synthesis it delivers a bookmark marker that refers to this position in the input text.

Example:

This bookmark \x1B\\mrk=ref_1\\ marks a reference point. Another \x1B\\mrk=ref_2\\ does the same.

Example
// Example of the ttsSpeak function. As text is being spoken, highlight the words in the text area, update the lips position and
// show the audio wave in an oscilloscope window.

speakProgressFunc=function(msg) {
  if ( msg['type'] == 'word' ) {
    var textArea=document.getElementById('input_id');

    // Highlight words as they are spoken.
    textArea.setSelectionRange(msg['cntSrcPos'], msg['cntSrcPos']+msg['cntSrcTextLen']);
  } else if ( msg['type'] == 'lipsync' ) {
    updateLips(msg);
  }
  else if ( msg['type'] == 'audio' ) {
    updateOsci(msg);
  }
};

var str=document.getElementById('input_id').value;
ttsSpeak(str, speakProgressFunc).then((success) => {
  if ( success )
    console.log("Text spoken successfuly");
});

ttsStop

Stops synthesis of the current text, clears the speech queue and resets the audio device.

Promises returned by ttsSpeak will not be fulfilled for neither the current or pending speech requests.

ttsStop(): Promise
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
Example
ttsStop().then((success) => {
   resetUi();
});

ttsGetVoiceList

Returns the list of available voices.

ttsGetVoiceList(): Array
Returns
Array: An array of voices.

For each voice, the following properties are returned:

  • name: The name of the voice. For example 'Zoe'.
  • language: The language of the voice. For example 'US English'.
  • vop: The voice operating point (quality). For example: embedded-pro
  • age: The age of the speaker.
  • type: Voice type of the speaker (male, female, or neutral).

ttsSetCurrentVoice

Sets the voice to be used for speech synthesis.

ttsSetCurrentVoice(voice: Object): Promise
Parameters
voice (Object) The voice to use for synthesis.
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

The voice must be one returned by the ttsGetVoiceList function.

ttsGetCurrentVoice

Gets the voice currently used for synthesis.

ttsGetCurrentVoice(): Object
Returns
Object: An object representing the voice.

ttsSetSpeechParams

Sets speech parmeters to be used during speech synthesis.

ttsSetSpeechParams(params: Object): Promise
Parameters
params (Object) The parameters to be set.
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

The parameters are set via an object of key/valus pairs. Possible key values are:

  • speed: Speech rate level, which is a scale factor (in %) on the default speech rate of the current voice. The valid range is [50..400], with 50 having the voice speak 2x slower, and 400 having the voice speak 4x faster. Default value: 100 (%)

  • pitch: Pitch level, a scale factor (in %) on the inherent pitch of the current voice. The range is [50..200]; with value 50 the voice speaks one octave lower (pitch :2), with value 200 the voice speaks one octave higher (pitch x2). Default value: 100 (%)

  • volume: Volume level on a 0 to 100 scale. For each 10 points on the scale the volume changes by 3 dB. Default value: 80

  • waitFactor: Wait period inserted between two text units (e.g. sentences), on a scale from 0 to 9. Each unit is equivalent to 200ms of silence. Default value: 1

Example
ttsSetSpeechParams({
  speed: 120,
  pitch: 100,
});

ttsGetState

ttsGetState(): String
Returns
String: A string indicating the current state of the Text-to-Speech engine.

Possible values are:

  • "uninitialized": This is the default state. The engine is in an uninitialized state and can't synthesize any text. Call ttsInitialize to begin the initialization process.

  • "initializing": The engine is currently being initialized.

  • "initialized": The engine is initialized and ready to synthesize text. When text is done synthesizing, the engine returns to this state.

  • "speaking": The engine is currently synthesizing text and streaming it through the audio device.

  • "paused": The engine is currently paused in response to a call to ttsPause

  • "uninitializing": The engine is performing the uninitialization process.

ttsAddStateChangeListener

Adds a listener function that will be called whenever the state of the engine changes.

ttsAddStateChangeListener(listener: function)
Parameters
listener (function) The listener function.
Example
updateButtonState=function(state) {
  if ( state == "initialized" ) {
    // Enable the button to speak some text
    speakButton.disabled=false;
  }
}

...

ttsAddStateChangeListener(updateButtonState);

ttsRemoveStateChangeListener

Removes a state change listener function previously registered by ttsAddStateChangeListener

ttsRemoveStateChangeListener(listener: function)
Parameters
listener (function) The listener function.
Example
ttsRemoveStateChangeListener(updateButtonState);

ttsPause

Pauses the synthesis process.

ttsPause(): Promise
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

Operation failure is likely due to the engine not being in a "speaking" state (see ttsGetState)

ttsResume

Resumes the synthesis process previously paused by a call to ttsPause

ttsResume(): Promise
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

Operation failure is likely due to the engine not being in a "paused" state (see ttsGetState)

ttsIsPaused

Checks whether or not the engine is currently paused.

ttsIsPaused(): Boolean
Returns
Boolean: true or false

ttsIsSpeaking

Checks whether or not the engine is currently speaking.

ttsIsSpeaking(): Boolean
Returns
Boolean: true or false

ttsIsInitialized

Checks whether or not the engine is initialized.

ttsIsInitialized(): Boolean
Returns
Boolean: true or false

ttsGetVersionInfo

Gets the version information of the Text-to-Speech engine.

ttsGetVersionInfo(): Object
Returns
Object: An object with version and copyright information with the following properties:
  • ttsProvider: Name and/or copyright information of the Text-to-Speech provider. For example: Nuance Communications, Inc.

  • sdkProvider: Name and/or copyright information of the WASM SDK provider. For example: Code Factory, S.L.

  • major: Major version of the engine. For example: 3

  • minor: Minor version of the engine. For example: 3

  • maint: Maintenance (build) number of the engine. For example: 4

  • buildInfoStr: String containing additional information of the build. For example: Vocalizer Embedded 3.3.4

  • buildDay: Day of the month when the build was made. For example: 1

  • buildMonth: Month when the build was made. For example: 6

  • buildYear: Year when the build was made. For example: 2019

ttsSetAudioVolume

Sets the audio device volume.

ttsSetAudioVolume(value: Float)
Parameters
value (Float) The volume level (0 to 1.0)

ttsDeleteLocalStorage

Deletes all files and data that have been stored in the Indexed DB for later use.

ttsDeleteLocalStorage(): Promise
Returns
Promise: A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.

This function effectively removes all cached data. Therefore, in the next Text-to-Speech session all the files will have to be downloaded from the server again.

Note that this function can only be called when the engine is in an "uninitialized" state (See ttsGetState)

ttsGetLocalStorageInfo

Gets information about the amount of space currently being used by the cached voice data in the browser's Inexed DB.

ttsGetLocalStorageInfo(): Object
Returns
Object: An object that contains information about the storage. null if the operation was not successful.

The object returned contains the following properties:

  • assetCount: The number of files currently stored.
  • assetsSize: The total size (in bytes) of the stored data.

ttsGetLastErrorMessage

Gets a string representation of the last error that occured in the engine. You must call this function whenever an asynchronous Text-to-Speech function call is resolved with an error.

ttsGetLastErrorMessage(): String
Returns
String: The text of the error (in English).