TTS Project Journal


→ 1: Macroscopic Machinery

It's a tradition in the Clock Crew to have the voices of your cartoon done with a text to speech program. Besides those cartoons with voice acting or lack of voices, I and others have felt a dissatisfaction with using the same few text to speech systems in all of these cartoons, chiefly Speakonia and Natural Voices.

This is easy to overcome for those who only want speech in their movies, but not for those of us who want the actual ȧėsthetic character of a text to speech voice. For short movies, one can record your own voice and cut and paste the sounds fairly easily, creating a fine facsimile of sample concatenation-based text to speech program.

Imagine, though, the time it takes to do this. Assuming you can find any sound in a list of audio samples of all the phonemes (important sounds) in the English language, copy the sample, and paste it to the end of the audio in a generous 3/4ths-second of time, that's 3 seconds of work for a typical 4-sound word. It's simply not practical to render any significant speech in this manner.

Some time back, I decided to create an educational spam series on the Norse language, the product being Old Norse Grammar Lesson 1. One of the ideas was to have the lessons be delivered in a heavy Norse accent. Besides the aforementioned ȧėsthetic motivaton, no TTS that I know of supports the Old Norse language. Thus, I must make the voice myself.

Now, I use a type of music program called a Tracker, which is similar to a sequencer. You have "instruments" which take a sound sample, pitch shift it to different notes, and map your keyboard to a virtual representation of a piano keyboard. I mapped each phone (sound) in the Norse language to an instrument* in my Tracker. I selected each sound as an instrument and hit a key to record that sound to a list. Since I never had to take my mouse off the menu, it was decidedly more efficient than using a normal editing program, such as Audacity. Nonetheless, by the time I had entered the 1280 phones and spaces that make up the 2-3 minute cartoon, I had spent about 6 hours selecting and typing.

So the community needs personalized text to speech voices, and I need a series. I for sure wasn't going to do another Norse Grammar Lesson through the same means. I decided after publishing the cartoon to make a TTS of my own, particularly one for which the creation of new voices could be accomplished by the end user. It took me a while to get from pondering the idea to actually programming, but I did indeed start actually programming. Thus concludes the overview of the topic at hand, mine TTS program.

~LCK, 4/23/2010

*Though I could have mapped the samples to notes, giving me the ability to type the sounds without switching instruments, I would need to pitch shift each sample opposite the amount it would be pitch shifted by the program. This is equivalent to having each sound at a different sample rate, with great detriment to intelligibility. Imagine making an image in a computer program which has full resolution (in dots per inch) as long as you're using pure red, but while using green you draw at 30 dpi, and blue at 2 dpi. To gain any speed of input, extra work needs to be put in which ultimately degrades the quality of the end product.

→ 1: Macroscopic Machinery

All content © Casady Roy Kemper (a.k.a. Loki Clock) and protected by the Digital Millenium Copyright Act and the Berne Convention, unless otherwise stated or unless alternative authorship is indicated without explicit accompanying copyright claims on the part of Loki Clock.