TTS Project Journal

Macroscopic Machinery

← 0: Overview
→ 2: The Living LVS

A sample concatenation-based TTS works by stringing the pre-recorded phones of a language together into an assemblage of speech. In order to accomplish this, you need to be able to:

  1. Create a database of audio files.
  2. Create a database, or map, of strings of text that are considered equivalent to these audio files. An entry in such a map might link "ee" to the sound in "keep."
  3. Break up, or "tokenize," strings of text into strings from the database. "Keep" might be separated into "k," "ee," and, "p."

In detail, the map not only controls how the program separates strings into substrings, but it gives each string in its database a number. That number is the position of the audio file the string represents in a list of audio files, one for each phone of the language.

The tokenization is pretty much useless unless you have something to tokenize to, so you can start either with the map or the phone database. For instant gratification's sake, I started with the phone bank.

Since the actual purpose of the audio portion of the program, stringing together sound samples, is elementary, I made the unusual choice of programming file formats before the ability to do anything with their data. These file formats are, tentatively, the Loki Voice Sample and Loki Clock Voice. The former contains only raw PCM audio data at the moment, as well as various data counts and behavioral flags. The latter contains the supported audio data types as subblocks, which at the moment includes only the LVS.

As of writing, implementation of those file formats is complete, as are means of generating a TTS voice of the LCV variety. This is the foundation of the program. Though it seems odd to say that a TTS doesn't need to be able to read text, the other two parts of the program can be seen as an interface to the bank of phones. A necessary interface for practicality, but nonetheless only a means to create phomnemonics for the indices of the database entries.

After completing this, I am working feverishly on the map. This requires two things to accomplish. A data structure, now built, for storing the strings, and a means of sorting them. Why do they have to be sorted? Without sorting, the program essentially knows nothing about the content of each entry in the database unless its looking at it. Whenever it needs to find an entry, it will have to check every entry in the database until one matches. With sorting, it can find out roughly where to look, then hone in on a likely location. Without sorting it cannot know that an entry isn't in the database until all of them come back negative for a match. With sorting, it can eliminate at least those with higher or lower than maximum or minimum values, and more from some deductions.

To be able to sort things, you have to be able to compare them! Some way of looking at an entry's data has to indicate whether it is greater or lesser than something. So, to fix my comparisons, I employ an encoding, in particular the widespread encodings of Unicode.

Now, what text encodings, such as Unicode, do is assign a number to each letter or other character you can represent on a computer. After establishing that you are both using the same encoding, you can send someone else a message in this encoding. They will be able to map those numbers to images of the letters they are assigned to, just as I am mapping a number here to a sound the letters are assigned to.

So now that we have a numbering, we can compare the numbers. But there's a major problem. Unicode has multiple canonical encodings. UTF-8, UTF-16, and UTF-32 are such codings, taking up at minimum 8, 16, and 32 bits per character. Unicode codepoints go up to 32 bits of unique codepoints. Four bytes per character is very hefty, but to get lower minimums you also have to employ some witchcraft on the bits.

Why? Well, without it, you can't tell in running text whether the next character belongs to a 3-byte character, a 2-byte character, or any other kind, because the characters in running text are mixed together without separators. As a result of this witchcraft, some characters in any non UTF-32 encoding will have a lower value than some of those that actually have a lower Unicode codepoint.

Thus, sorting is constrained to whatever encoding is used to sort it. One reason this is a problem is that, while reencoding a map is enough of a resource sink, having to resort it afterwards to match the new encoding is ridiculous. For another, these encodings only serve to respresent the enumeration of Unicode. So sorting a character's value in its present encoding in a way invalidates the universal enumeration that Unicode intended to establish to begin with, hence the name.

Programming the translations between these formats has taken the larger part of the work of this second part of the program. I say with glee that I am almost to being able to create a map from non-UTF-32 strings that are nonetheless sorted by their codepoints, not their values. There's still a ways to go, but the milestone is close enough to taste.

~LCK, 4/29/2010

← 0: Overview
→ 2: The Living LVS

All content © Casady Roy Kemper (a.k.a. Loki Clock) and protected by the Digital Millenium Copyright Act and the Berne Convention, unless otherwise stated or unless alternative authorship is indicated without explicit accompanying copyright claims on the part of Loki Clock.