Copyright © 2019 by Code Factory, SL.
All rights reserved.
Legal Info
Thanks to WebAssembly, the Vocalizer Text-to-Speech engine can now run locally inside the user's browser. This avoids the need of a complex, expensive (and often slow) cloud-based solution for your web-based applications.
With our Vocalizer for WebApps SDK you will be able to create true multiplatform applications that can run on Windows, OSX, iPhone and Android with the same code base.
Click here to play with our our online demo.
From The WebAssembly Official Site:
WebAssembly is the resulting work of members representing the four major browsers. Currently, Chrome, Edge, Firefox and WebKit-based browsers (i.e. Safari) fully support WebAssembly.
Some WebAssembly Highlights:
In addition to running inside a standard browser, you can also develop desktop and mobile applications using frameworks such as Electron or Ionic.
Nuance and Code Factory have once again partnered to bring the Vocalizer Embedded Text-to-Speech engine to this new technology which opens the door to a wide range of possibilities to our customers.
This SDK will provide you:
Date | SDK Version | TTS Version | Description |
---|---|---|---|
2019/05/06 | 3.3.5_r1 | 3.3.5 | Initial SDK release |
In addition to your application, your server must also be configured to supply the Text-to-Speech modules.
These modules include:
Here is an example directory structure of a typical web application that uses the Text-to-Speech engine:
/index.html
/webtts.js
/webtts.wasm
/data/files.metadata
/data/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-compact_2-2-1.hdr
/data/languages/enu/speech/components/enu_zoe_embedded-compact_2-2-1.dat
/data/languages/common/sysdct.dat
/data/languages/common/clm.dat
/data/languages/common/synth_med_fxd_bet3f22.dat
/data/languages/common/lid.dat
Here is a description of each component:
You can choose any other directory structure that best suits your application needs. For example, you may want to store the engine JavaScript and WASM file in a separate directory, and store the metadata in another folder, or even server.
Like all the code of your application, the Text-to-Speech engine files also reside in your server. When your application is loaded by the browser, all the dependencies will be resolved, and the script files related to the Text-to-Speech engine will be automatically downloaded. (Note that this does not include the voice data files, which will be covered in a further section)
Typically, your application's main HTML page will load the Text-to-Speech API script like this:
<script type="text/javascript" src="webtts.js">
The rest of the loading process happens transparently in the background. The WASM and other JavaScript files will be downloaded and processed by the browser when needed.
Once the HTML is completely loaded, the Text-to-Speech engine is ready to be used.
The Text-to-Speech engine has two parts: code and data. As explained in the previous section, the engine code is loaded just like a normal JavaScript file by the browser, but the voice data still remains in the server.
The engine knows about what files are available to it from the information contained in the metadata file. The metadata provides a list of files that will be needed by the Text-to-Speech engine during synthesis.
For example, a simple metadata file might look like this:
{
"files": [{
"url": "/data/languages/enu/speech/ve/ve_pipeline_enu_zoe.hdr",
"md5": "80cee58b5999ddfe879eddbb289946af",
"name": "ve_pipeline_enu_zoe-mlsc_22_embedded-premium_2-4-2.hdr",
"size": 4856
}, {
"url": "/data/languages/enu/speech/components/enu_zoe.dat",
"md5": "2f0b39db88c23f83363b1867cfebbc21",
"name": "enu_zoe-mlsc_embedded-premium_2-4-2.dat",
"size": 665773946
}]
}
For each file, the metadata provides information of its URL relative to the application, the file name, size and a hash that uniquely identifies the contents of the file.
In order to tell the engine where to find the metadata, its URL must be passed to the initialization function. Here is an example:
ttsInitialize({
env: 'web',
metadata: '/data/files.metadata',
...
});
Optionally, you can pass the metadata string itself instead of a URL to the file. The documentation for the engine initialization provides more information about this.
The SDK comes with a Python command-line utility to automatically build the metadata from a given directory. The basic usage is as follows:
do_metadata.py --dir ./data --output ./data/files.metadata
The example above will recursively scan the ./data directory in search for .dat and .hdr files (the default wildcards). Each file will be processed and added to the metadata file named files.metadata
This utility also has some additional command-line options:
As a general rule, it's best that the data files reside in a directory relative to your application. If this is not possible, the --baseurl option will allow you to specify any URL for the data files. However, beware of Cross Origin Resource Sharing (CORS) restrictions.
See examples/pack_data.sh for a few examples on how to generate metadata files for various scenarios.
Through the metadata file, the engine is informed of the data files that are available. The engine can be configured in two possible ways, each handling data files differently:
The operating mode can be specified during initialization as follows:
ttsInitialize({
env: 'web',
metadata: '/data/files.metadata',
data: 'remote'
});
When the data files are downloaded from the server, they will automatically be cached in the browser's Indexed DB. This prevents further launches of your application from re-downloading the data files.
If you want to avoid this default behavior, you can initialize the TTS like this:
ttsInitialize({
env: 'web',
metadata: '/data/files.metadata',
data: 'remote'
cache: 'none'
});
The user can, at any time, clear the data files if storage space is needed. Also, the Text-to-Speech API provides a function to programatically remove all data files that may be stored in the browser's Indexed DB.
The Vocalizer for WebApps SDK ships with a several example that demonstrate the functionality of the API in various of the supported environments.
This is example is meant to run in a web browser and is the one that tries to cover the widest range of functions in the TTS API. It's also the one with the simplest UI, because we want this example to show only how to use the API, and remove any any code that is not strictly necessary.
This example shows the following functionality:
The example is written in HTML and JavaScript and can be found in the folder examples/web.
This example is very similar to the Web Example, but it has a more elaborated UI, showing process in a circle bar. It also uses the Compact voices (as opposed to the Embedded Pro voices of the Web Example), so it's muct faster to load.
A real life implementation of this demo can be found in the Code Factory page
The example is written in HTML, JavaScript and CSS and can be found in the folder examples/demo.
This example targets the Electron environment, which allows you to create Windows and OSX desktop application.
Electron is a very interesting environment because it has the advantages of a Node.js environment (that is, it allows access to the local file system) and also the Web environment (because the UI runs inside Chromium).
Therefore, the TTS in an Electron application can be initialized in either local or remote modes depending on where you want the voice data to reside.
In this particular example, the Text-to-Speech voice data in bundled in the application's package itself (in a folder called data). Therefore, the metadata file looks something like this:
{
"files":[{
"url":"../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr",
"md5":"0af3f837b7d0886494b5ad7713466e00",
"name":"ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr",
"size":3896
...
}]
}
Note how the URL property of each file in the metadata list contains a directory name rather than a URL. To determine the folder from which each voice data file directory is resolved, the engine uses the parameter
localroot specified during initialization.
For example:
ttsInitialize({
env: 'electron',
data: 'local',
localroot: `${__dirname}`,
metadata: `${__dirname}/../data/files.local.metadata`});
dirname is a Node variable that points to the current module's folder inside the local file system.
For example, if
dirname is /Users/Alex/Example/app and the entry in the metadata file is ../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr
the file will be loaded from /Users/Alex/Example/app/../data/voicedata/languages/enu/speech/ve/ve_pipeline_enu_zoe_22_embedded-pro_2-2-1.hdr
The example can be found in the examples/electron folder.
This example shows you how to integrate the Text-to-Speech engine into an Ionic application that will run in both iOS and Android mobile phones.
It's not very different from the Demo Example: the data is downloaded from the server and cached inside the application for later use, but
the main difference is how the engine is initialized from the applcation:
ttsInitialize({
env: 'ionic',
scriptsRoot: 'assets/js',
cache: 'browser',
data: 'remote',
metadata: "assets/files.server.compact.metadata"});
In this case, we tell the engine where webtts.js and webtts.wasm are located through the scriptsRoot property.
This example can be found in the examples/ionic folder and most of the code can be found under examples/ionic/src/app/home
This example shows how the TTS can be integrated into a Node application.
Node does not by default support audio streaming, so this example shows how to get the raw PCM audio from the TTS and create a wav file.
This example can be found in the eamples/node folder.
The WebAssembly Text-to-Speech SDK for Vocalizer Embedded is copyrighted and propietary software of Code Factory, SL
Nuance, Vocalizer and their respective logos are registrered trademarks of Nuance Communications, Inc. All rights reserved.
Chrome, Safari, Firefox, Microsoft Edge, Android, iPhone, OSX, iOS, WebAssembly, Ionic, Electron and WebGL are registered trademarks of their respective owners.
Contains code generated by Emscripten. Copyright © 2010-2014 Emscripten authors. Used under the MIT and the University of Illinois/NCSA Open Source Licenses. You can see the full license here.
Uses vc_vector, a fast and simple C vector implementation. Copyright © 2016 Skogorev Anton. Used under the MIT license.
Uses portions of code inspired by the Fetch Progress Indicators. Copyright © 2018 Anthum, Inc. Used under the MIT license. See full license here.
The Demo example uses and bundles the Code Mirror text editor. Copyright © 2017 by Marijn Haverbeke and others. Used under the MIT license. See full license here.
The Demo example uses and bundles the CSS Percentage Circle by Andre Firchow.
The Node.js example uses and bundles the Web Audio API library. Copyright © 2013 Sébastien Piquemal. Used under the MIT license. See full license here.
The Node.js example uses and bundles the node-speaker library. Copyright © their authors. Used under the MIT license.
The Electron and Ionic examples may contain dependency modules automatically generated by their respective build environments. Check the LICENSE.md file in the Electron and Ionic example folders respectively.
Code Factory, SL claims no ownership and provides no warranty of the third-party components and libraries listed above.
The information in this document is subject to change without notice and should not be construed as a commitment by Code Factory SL, who assumes no responsibility for any errors that may appear in this document. In no event shall Code Factory, SL be liable for incidental or consequential damages arising from use of this document or the software described in this document.
The use of the WebAssembly Text-to-Speech SDK for Vocalizer Embedded is not free and must always be integrated into an application after a proper legal contract has been signed with Code Factory. The contract will define the nature and scope of the application that integrates and uses the SDK.
Please visit our Legal Information page for more details about Code Factory and its Code of Conduct.
Taling 3D Head courtesy of Javier Arevalo Baeza (@TheJare). Developed in three.js and used under the MIT license.
Developed in the winter of 2019 by Eduard Sanchez Palazon.
Initializes the Text-to-Speech engine.
(Object)
Initialization parameters.
(function)
[progressFunc=null]
The callback function that will be invoked with progress information.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
The initialization parameters are passed through an object with key/value property pairs. Here are the supported initialization parameters:
During initialization, if a progress function has been provided, it will be called back and given information about the status of the initialization process. This is mostly used to provide information to the user about the download of the data files, as it's the most time-consuming task during initialization.
Note that the initialization can be canceled by a call to ttsRelease
initProgressFunc=function(msg) {
if ( msg.task == "download" ) {
if ( msg.name != null )
document.getElementById('status').innerHTML='Downloading ' + msg.name;
else document.getElementById('status').innerHTML='Downloading data...';
// Update progress bar
if ( msg.totalFilesProgress > 0 ) {
document.getElementById('progress').value=msg.totalFilesProgress;
}
}
else if ( msg.task == "save" ) {
document.getElementById('status').innerHTML='Saving ' + msg.fileName + "...";
}
else if ( msg.task == "load" ) {
document.getElementById('status').innerHTML='Loading ' + msg.fileName + "...";
}
};
initCompleteFunc=function(success) {
if ( success ) {
// Show the engine version information if it's available
var vi=ttsGetVersionInfo();
if ( vi != null ) {
document.getElementById('status').innerHTML="Engine version: "+vi.major+"."+vi.minor+" ("+vi.buildDay+"/"+vi.buildMonth+"/"+vi.buildYear+")<br>"+vi.sdkProvider;
}
// Set the current voice
var voiceList=ttsGetVoiceList();
if ( voiceList.length > 0 )
ttsSetCurrentVoice(voiceList[0]);
}
else {
document.getElementById('status').innerHTML='Error during initialization!';
}
};
...
ttsInitialize({
env: 'web',
metadata: '/data/files.metadata',
data: 'remote'
}, initProgressFunc)
.then((success) => initCompleteFunc(success));
Uninitializes the Text-to-Speech engine previously initialized by ttsInitialize
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
Speaks the given string. The function can optionally be passed callback functions that receive information about the progress of the synthesis and playback process.
(string)
The text to be spoken.
(function)
[progressFunc=null]
The callback function that will be called with progress information.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
If synthesis of a previous text is in progress, this function will add the text to the speech queue and so that it is processed later. If ttsStop is called, Promises will not be resolved for neither the current or any of the queued texts.
As a result, this won't work:
<BODY onload='ttsSpeak('Welcome to my page')'>
Instead, do something like this:
<BUTTON onclick='ttsSpeak('Hello!')'>Test me!</BUTTON>
The speech callback functions are always synchronized real-time with the audio device, so that events always occur at the time when the user hears them.
The Promise returned by ttsSpeak will resolve when the audio device is done playing all audio belonging to the given text string. The Promise is fulfilled with a boolean value indicating the success of the operation. Failure to synthesize text is likely due to ttsSpeak being called with a wrong engine state (such as "uninitialized"). See ttsGetState for more information about engine states.
The progressFunc function will be called with data of events (such as bookmarks, words, lip information, etc.) that occur during the synthesis process. The function is called with an object as a parameter, with a property named "type" that contains the type of the event.
Here is a list of events that occur during synthesis:
Properties of this event type:
Properties of this event type:
Properties of this event type:
Properties of this event type:
A control sequence is a piece of text that is not to be read out, but instead offers the possibility to intervene in the automatic pronunciation process. In this way the user can alter the way in which a text will be read, and acquire full control over the pronunciation of the input text. Control sequences can also be used to insert bookmarks in the text.
Control sequences are always preceded by an escape character (Hexadecimal 0x1B). You can directly embed control sequence in the input text like this:
ttsSpeak("Hello. \x1B\\pause=1000\\ This is some text with a pause in the middle.");
Example:
Follow \x1B\\lang=frf\\ \x1B\\toi=lhp\\ 'Ry_d$_la_vjE.jaR.'djER \x1B\\toi=orth\\ \x1B\\lang=enu\\ for 100 meter.
Note that it depends on the multilingual capabilities of the voice whether the voice can take the language of the text into account.
Possible strength values are:
Example:
Ich sehe \x1B\\nlu=BND:S\\ Hans morgen im Kino.
Possible prominence level values are:
Example:
Ich sehe \x1B\\nlu=PRM:3\\ Hans morgen im Kino.
Example:
His name is \x1B\\pause=300\\ Michael.
Example:
\x1B\\vol=10\\ I can speak rather quietly, \x1B\\vol=90\\ but also very loudly.
Example:
I can \x1B\\pitch=80\\ speak lower \x1B\\rate=120\\ or speak higher.
Example:
I can \x1B\\rate=150\\ speed up the rate \x1B\\rate=75\\ or slow it down.
Example:
Tom lives in the U.S. \x1B\\eos=1\\ So does John. 180 Park Ave. \x1B\\eos=0\\ Room 24
Possible reading modes are:
Example:
This input will be read sentence by sentence:
\x1B\\readmode=sent\\ Please buy green apples. You can also get pears.
The word "Apples" will be spelled:
\x1B\\readmode=char\\ Apples
This input will be read as a list, with a pause at the end of each line:
\x1B\\readmode=line\\
Bananas
Low-fat milk
Whole wheat flour
This input will be read as one sentence:
\x1B\\readmode=explicit_eos\\
Bananas.
Low-fat milk.
Whole wheat flour.
Examples:
\x1B\\vol=10\\ The volume is set to a low value. \x1B\\rst\\ Now it is reset to its default value.
\x1B\\rate=75\\ The rate is set to a low value. \x1B\\rst\\ Now it is reset to its default value.
Possible text normalization values are:
The end of a text fragment that should be normalized in a special way is tagged with <ESC>\tn=normal.
Examples:
\x1B\\tn=address\\ 244 Perryn Rd Ithaca, NY \x1B\\tn=normal\\ That’s spelled \x1B\\tn=spell\\Ithaca \x1B\\tn=normal\\
\x1B\\tn=sms\\ Carlo, can u give me a lift 2 Helena's house 2nite? David\x1B\\tn=normal\\
Examples:
\x1B\\voice=samantha\\ Hello, this is Samantha.
\x1B\\voice=tom\\ Hello, this is Tom.
For example:
The part code is \x1B\\tn=spell\\ \x1B\\spell=200\\a134b \x1B\\tn=normal\\
Note: The spelling pause duration does not affect the spelling done by <ESC>\readmode=char\ because that mode treats each character as a separate sentence. To adjust the spelling pause duration for <ESC>\readmode=char\, set the end of sentence pause duration using<ESC>\wait\ instead.
The control sequence <ESC>\toi=<type>\ marks the type of the input starting after the control sequence:
The control sequences that start phonetic text in L&H+ or NT-SAMPA can be extended as:
<ESC>\toi=<lhp | nts>:”<orth_text>”\<phon_text>
This defines <orth_text> as the orthographic counterpart of the phonetic fragment <phon_text>. Vocalizer Embedded uses such a phonetic + orthographic fragment similarly to a phonetic user dictionary entry.
It may also entirely fall back to the orthographic alternative if it can’t realize the phonetic fragment.
Example:
\x1B\\lang=iti\\ \x1B\\toi=nts:"Romano Prodi"\\ ro|'ma|no prO|di \x1B\\toi=orth\\
Note that Vocalizer Embedded does not support such an orthographic counterpart for Pinyin text or diacritized text. It is possible to provide Vocalizer Embedded with the Pinyin text for an orthographic character in Chinese input. Use the control sequence: <ESC>\tagpyt=<pinyin> to define <pinyin> as the Pinyin text for the following Chinese character.
Example:
“基金大\x1B\\tagpyt=sha4\\厦”
is read as “ji1.jin1.da4.sha4”.
Nuance Vocalizer Embedded supports phonetic input, so that words of which the spelling deviates from the pronunciation rules of a given language (e.g. foreign words or acronyms unknown to the system) can still be correctly pronounced.
The phonetic input is composed of symbols of a phonetic alphabet. Vocalizer Embedded supports 2 phonetic alphabets, both of which can conveniently be entered from a keyboard:
Using the control sequence for phonetic text a possible phonetic input (as a replacement for the English word “zero”) can be:
<ESC>\toi=lhp\ ‘zi.R+o&U \toi=orth\
Examples:
\x1B\\wait=2\\ There will be a short wait period after this sentence.
\x1B\\wait=9\\ This sentence will be followed by a long wait period. Did you notice the difference?
Example:
This bookmark \x1B\\mrk=ref_1\\ marks a reference point. Another \x1B\\mrk=ref_2\\ does the same.
// Example of the ttsSpeak function. As text is being spoken, highlight the words in the text area, update the lips position and
// show the audio wave in an oscilloscope window.
speakProgressFunc=function(msg) {
if ( msg['type'] == 'word' ) {
var textArea=document.getElementById('input_id');
// Highlight words as they are spoken.
textArea.setSelectionRange(msg['cntSrcPos'], msg['cntSrcPos']+msg['cntSrcTextLen']);
} else if ( msg['type'] == 'lipsync' ) {
updateLips(msg);
}
else if ( msg['type'] == 'audio' ) {
updateOsci(msg);
}
};
var str=document.getElementById('input_id').value;
ttsSpeak(str, speakProgressFunc).then((success) => {
if ( success )
console.log("Text spoken successfuly");
});
Stops synthesis of the current text, clears the speech queue and resets the audio device.
Promises returned by ttsSpeak will not be fulfilled for neither the current or pending speech requests.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
ttsStop().then((success) => {
resetUi();
});
Returns the list of available voices.
Array
:
An array of voices.
For each voice, the following properties are returned:
Sets the voice to be used for speech synthesis.
(Object)
The voice to use for synthesis.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
The voice must be one returned by the ttsGetVoiceList function.
Gets the voice currently used for synthesis.
Object
:
An object representing the voice.
Sets speech parmeters to be used during speech synthesis.
(Object)
The parameters to be set.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
The parameters are set via an object of key/valus pairs. Possible key values are:
ttsSetSpeechParams({
speed: 120,
pitch: 100,
});
String
:
A string indicating the current state of the Text-to-Speech engine.
Possible values are:
Adds a listener function that will be called whenever the state of the engine changes.
(function)
The listener function.
updateButtonState=function(state) {
if ( state == "initialized" ) {
// Enable the button to speak some text
speakButton.disabled=false;
}
}
...
ttsAddStateChangeListener(updateButtonState);
Removes a state change listener function previously registered by ttsAddStateChangeListener
(function)
The listener function.
ttsRemoveStateChangeListener(updateButtonState);
Pauses the synthesis process.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
Operation failure is likely due to the engine not being in a "speaking" state (see ttsGetState)
Resumes the synthesis process previously paused by a call to ttsPause
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
Operation failure is likely due to the engine not being in a "paused" state (see ttsGetState)
Checks whether or not the engine is currently paused.
Boolean
:
true or false
Checks whether or not the engine is currently speaking.
Boolean
:
true or false
Checks whether or not the engine is initialized.
Boolean
:
true or false
Gets the version information of the Text-to-Speech engine.
Object
:
An object with version and copyright information with the following properties:
Sets the audio device volume.
(Float)
The volume level (0 to 1.0)
Deletes all files and data that have been stored in the Indexed DB for later use.
Promise
:
A Promise that will be fulfilled with a boolean value indicating whether or not the operation was successful.
This function effectively removes all cached data. Therefore, in the next Text-to-Speech session all the files will have to be downloaded from the server again.
Note that this function can only be called when the engine is in an "uninitialized" state (See ttsGetState)
Gets information about the amount of space currently being used by the cached voice data in the browser's Inexed DB.
Object
:
An object that contains information about the storage. null if the operation was not successful.
The object returned contains the following properties:
Gets a string representation of the last error that occured in the engine. You must call this function whenever an asynchronous Text-to-Speech function call is resolved with an error.
String
:
The text of the error (in English).