[Main] [News] [Publications] [Downloads] [Documents] [Contacts]


Documents

Supporting other TTS Engines

This document describes how to add support for TTS (Text-to-Speech) engines used in SMIL-Agent script processing for audio creation with Xface. Currently in Xface, we support Italian Flite and MS SAPI 5.1 for speech processing inside SMIL-Agent scripts. To support other TTS engines, you have to add some classes to Xface codebase, and do some minor additions to existing code. Unfortunately, there is no easier way to support other TTS engines since SMIL-Agent scripts must use TTS engines heavily.

Note that all the SMIL-Agent and TTS related codes reside in XSmilAgent project in the code.

1. Editing scriptProcs.xml
A copy of this file is found inside wxFaceEd, wxFacePlayer and wxFaceClient folders in the source distribution. They should all have the same content (in the final setup only one of them is used and it stays in the bin folder of the installation). Find the scripter (currently the last one) named SMIL-Agent and see that available TTS Engines are listed there. For TTSEngine tag, you can specify;

lang: Language this TTS engine will output. Use only one TTS engine for a specific language (one for English, one for Italian, etc.).

engine: Name of the engine, you should give a unique name for every TTS engine and it will be used in the code (explanation is in the following sections) to hook the TTS engine.

path: If the engine is a standalone executable (such as flite/festival), the user should let the system know about the path/directory it is in. This parameter should be updated for every computer the Xface is installed (in case default path name is not used). Note for SAPI, we don't need a path.

2. VisitorSpeech for SMIL-Agent processing
SMIL-Agent processing needs to know temporal information about the speech content that will be synthesized. During SMIL-Agent parsing/processing, for every speech tag, the content (text) is processed and sent to the TTS engine for synthesizing. However, in order to incorporate certain features such as silence or markers, one might need to do some preprocessing on the text before forwarding it to the TTS engine. Every TTS engine has its own input format and control mechanism, therefore you might need to add some code to Xface in order to have the correct behaviour.

VisitorSpeech class is used for processing the text properly before sending it to the TTS engine. In some cases, such as flite, we can directly use VisitorSpeech class. If your TTS engine cannot use silence inside the text, or have markers (see below for what silence and markers mean), you can simply use VisitorSpeech class.

As a good example code for a different/derived VisitorSpeech implementation, see the MS SAPI5.1 compatible VisitorSAPISpeech class.

All three overloads for operator() should be implemented as can be seen from VisitorSAPISpeech, but the crucial one (that changes for every TTS engine) is the one that takes SMILSpeech as argument.

Notes - silence: In some cases, text in your SMIL-Agent script can have silence in between text, and in this case, you should put an interval of silence in between the speech portions. In other words, you can explicitly set the start and ending time for every text portion in your script, and in some cases you might want to have pauses in speech while the animation continues. And note that, you must get only one output wav file for the speech audio. Therefore, we should inject some silence in between these texts. See VisitorSAPISpeech for sample implementation.

Notes - markers: A detailed explanation of markers are available in SMIL-Agent documentation. See VisitorSAPISpeech for sample implementation.



3. ITTSEngine interface and TTSEngineMaker
In automatic TTS Engine recognition mechanism by Xface, we use "Pluggable Factory" design pattern. See links at the bottom of this blog for more details about the pattern.

First of all, every TTS Engine class has to be derived from ITTSEngine abstract base class and implement the pure virtual methods. Your compiler will complain if you forget to implement any of these. There are two methods to reimplement (for the moment); "createSmilVisitor" and "speak":

createSmilVisitor():
This method creates and returns a VisitorSpeech class derived object, see step 2 above for understanding what they are used for. The created object is stored as parend class member variable.

speak(): This is the core of the whole story. Write your TTS engine specific code here!

As sample implementations, you can use ItalianFlite and SAPI51 classes.

For every ITTSEngine derived class, you have to provide a TTSEngineMaker derived one too. This is also explained in pluggable factory pattern links and you can also check the sample implementations for SAPI51Maker and ItalianFliteMaker classes.

A trick for a problem I couldn't find the proper solution: Somehow, in order to get everything going, you should provide a call to your derived classes. It's weird, maybe it's my compiler, but I couldn't find any other way. So in SMILProcessor constructor, I create two dummy instances just to let the library see the classes and use the statics inside. Weird.. Any proper solution suggestion will be appreciated...

4. Where do we handle SMIL-Agent processing?
It is done in SMILManager class. Normally, you shouldn't be concerned about that class. If you follow the above steps, your TTS engine registers itself to the library (ITTSEngine and TTSEngineMaker derived), you provide the visitor to process the speech tags in the script (VisitorSpeech derived), you edit the scriptProcs.xml file and SMILManager class manages the rest.

I hope I haven't forgotton anything... Send me an e-mail or post a comment if you are lost or I missed something..


Pluggable Factory Pattern Links:
Gamedev.net: Why Pluggable Factories Rock My Multiplayer World
Industriallogic.com article Part 1
Industriallogic.com article Part 2

Koray Balci (August 2006)