This document describes how to add support for TTS
(Text-to-Speech) engines used in SMIL-Agent script
processing for audio creation with Xface. Currently
in Xface, we support Italian Flite and MS SAPI 5.1
for speech processing inside SMIL-Agent scripts. To
support other TTS engines, you have to add some
classes to Xface codebase, and do some minor
additions to existing code. Unfortunately, there is
no easier way to support other TTS engines since
SMIL-Agent scripts must use TTS engines heavily.
Note that all the SMIL-Agent and TTS related codes
reside in XSmilAgent project in the code.
1. Editing
scriptProcs.xml
A copy of this file is found inside wxFaceEd,
wxFacePlayer and wxFaceClient folders in the source
distribution. They should all have the same content
(in the final setup only one of them is used and it
stays in the bin folder of the installation). Find
the scripter (currently the last one) named
SMIL-Agent and see that available TTS Engines are
listed there. For TTSEngine tag, you can specify;
lang:
Language this TTS engine will output. Use only one
TTS engine for a specific language (one for English,
one for Italian, etc.).
engine: Name
of the engine, you should give a unique name for
every TTS engine and it will be used in the code
(explanation is in the following sections) to hook
the TTS engine.
path: If the
engine is a standalone executable (such as
flite/festival), the user should let the system know
about the path/directory it is in. This parameter
should be updated for every computer the Xface is
installed (in case default path name is not used).
Note for SAPI, we don't need a path.
2. VisitorSpeech
for SMIL-Agent processing
SMIL-Agent processing needs to know temporal
information about the speech content that will be
synthesized. During SMIL-Agent parsing/processing,
for every speech tag, the content (text) is
processed and sent to the TTS engine for
synthesizing. However, in order to incorporate
certain features such as silence or markers, one
might need to do some preprocessing on the text
before forwarding it to the TTS engine. Every TTS
engine has its own input format and control
mechanism, therefore you might need to add some code
to Xface in order to have the correct behaviour.
VisitorSpeech class is used for processing the text
properly before sending it to the TTS engine. In
some cases, such as flite, we can directly use
VisitorSpeech class. If your TTS engine cannot use
silence inside the text, or have markers (see below
for what silence and markers mean), you can simply
use VisitorSpeech class.
As a good example code for a different/derived
VisitorSpeech implementation, see the MS SAPI5.1
compatible VisitorSAPISpeech class.
All three overloads for operator() should be
implemented as can be seen from VisitorSAPISpeech,
but the crucial one (that changes for every TTS
engine) is the one that takes SMILSpeech as
argument.
Notes - silence:
In some cases, text in your SMIL-Agent script
can have silence in between text, and in this case,
you should put an interval of silence in between the
speech portions. In other words, you can explicitly
set the start and ending time for every text portion
in your script, and in some cases you might want to
have pauses in speech while the animation continues.
And note that, you must get only one output wav file
for the speech audio. Therefore, we should inject
some silence in between these texts. See
VisitorSAPISpeech for sample implementation.
Notes - markers:
A detailed explanation of markers are available in
SMIL-Agent documentation. See VisitorSAPISpeech for
sample implementation.
3. ITTSEngine
interface and TTSEngineMaker
In automatic TTS Engine recognition mechanism by
Xface, we use "Pluggable Factory" design pattern.
See links at the bottom of this blog for more
details about the pattern.
First of all, every TTS Engine class has to be
derived from ITTSEngine abstract base class and
implement the pure virtual methods. Your compiler
will complain if you forget to implement any of
these. There are two methods to reimplement (for the
moment); "createSmilVisitor" and "speak":
createSmilVisitor(): This method creates and
returns a VisitorSpeech class derived object, see
step 2 above for understanding what they are used
for. The created object is stored as parend class
member variable.
speak():
This is the core of the whole story. Write your TTS
engine specific code here!
As sample implementations, you can use
ItalianFlite and SAPI51 classes.
For every ITTSEngine derived class, you have to
provide a TTSEngineMaker derived one too. This is
also explained in pluggable factory pattern links
and you can also check the sample implementations
for SAPI51Maker and ItalianFliteMaker classes.
A trick for a
problem I couldn't find the proper solution:
Somehow, in order to get everything going, you
should provide a call to your derived classes. It's
weird, maybe it's my compiler, but I couldn't find
any other way. So in SMILProcessor constructor, I
create two dummy instances just to let the library
see the classes and use the statics inside. Weird..
Any proper solution suggestion will be
appreciated...
4. Where do we
handle SMIL-Agent processing?
It is done in SMILManager class. Normally, you
shouldn't be concerned about that class. If you
follow the above steps, your TTS engine registers
itself to the library (ITTSEngine and TTSEngineMaker
derived), you provide the visitor to process the
speech tags in the script (VisitorSpeech derived),
you edit the scriptProcs.xml file and SMILManager
class manages the rest.
I hope I haven't forgotton anything... Send me an
e-mail or post a comment if you are lost or I missed
something..
Pluggable Factory Pattern Links:
Gamedev.net:
Why Pluggable Factories Rock My Multiplayer World
Industriallogic.com article Part 1
Industriallogic.com article Part
2
Koray Balci (August 2006)
