An Introduction to Multi-Lingual Speech Recognition

by Uros Rapajic

Supervisor: Dr P. Naylor



The aim of this article is to provide a brief introduction to speech recognition, with examples of how it relates to the language identification problem. We will look at the way speech is represented and produced by humans in order to introduce some basic concepts and terminology. Different ways in which the problem has been approached will then be broadly outlined, aiming to give the reader some examples of what the methods introduced actually mean in practice, rather than provide detailed analysis. Finally, the main causes of variability of speech will be summarised. Where possible, examples of how the theory applies to different languages and examples of ways in which they differ will be given.

Representation of speech

It is almost impossible to tackle the speech recognition problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced. The letters symbols we use for writing are obviously not adequate, as the way they are pronounced varies; for example, the letter "o" is pronounced differently in words "pot", "most" and "one".

One way of representing speech sounds is by using phonemes. Formally, we can define the phoneme as a linguistic unit such that, if one phoneme is substituted for another in a word, the meaning of that word could change. This will only hold for a set of phonemes in one language; within a single language, a finite set of phonemes will therefore exist. However, when different languages are compared, there are differences; for example, in English, /l/ and /r/ (as in "lot" and "rot") are two different phonemes, whereas in Japanese, they are not. Similarly, the presence of individual sounds, such as the "clicks" found in some sub-Saharan African languages, or the velar fricatives (introduced later) found in Arabic, are readily apparent to listeners fluent in languages that do not contain these phonemes. Still, as the vocal apparatus used in the production of languages is universal, there is much overlap of the phoneme sets, and the total number of phonemes is finite.

Table 1 in the Appendix shows how the phonemes are subdivided into groups based on the way they are produced. The variation between different sets will be dealt with later.

It is also possible to distinguish between speech sounds based solely on the way they are produced. The units in this case are known as the phones. There are many more phones than phonemes, as some of them are produced in different ways depending on context. For example, the pronunciation of the phoneme /l/ differs slightly when it occurs before consonants and at the end of utterances (as in "people"), as opposed to other positions (e.g. in "politics"). The two phones are called the velarised and the non-velarised "l" respectively. As they are both different forms of the same phoneme, they form a set of allophones. Any machine-based speech recogniser would need to be aware of the existence of allophone sets.

It is not just the speech organs involved that influence the way an utterance is spoken and subsequently interpreted. The stress, rhythm and intonation of speech are its prosodic features.

Stress is used at two levels; in sentences, it indicates the most important words, while in words it indicates the prominent syllables - for example, the word "object" could be interpreted as either a noun or a verb, depending on whether the stress is placed on the first or second syllable.

Rhythm refers to the timing aspect of utterances. Some languages (e.g. English) are said to be stress-timed, with approximately equal time intervals between stresses (experiments have shown that, objectively, there is merely a tendency in this direction). The portion of an utterance beginning with one stressed syllable and ending with another is called a foot (by analogy with poetry). So, a four-syllable foot (1 stressed, 3 unstressed) would be longer than a single (stressed!) syllable foot, but not four times longer. Other languages, such as French, are described as syllable-timed.

Intonation, or pitch movement, is very important in indicating the meaning of an English sentence. In tonal languages, such as Mandarin and Vietnamese, the intonation also determines the meaning of individual words.

Speech production

Speech organs

The above diagram shows the organs involved in the production of speech, all of which probably evolved for purposes other than speech production (such as eating and breathing), thus limiting the range of sounds that we can produce. Basically, air is drawn into the lungs by inhaling, expanding the rib cage and lowering the diaphragm. The pressure in the lungs is then increased by the reverse process, which pushes the air up the trachea (wind pipe). The larynx, a bony structure covered by skin containing a slit-like orifice, the glottis (vocal cords), is situated at the top of the trachea. As the air flows through the glottis, local pressure falls, and eventually allows the laryngeal muscles to close the glottis, interrupting the flow of air. This in turn causes the pressure to rise again, forcing the vocal cords apart. The cycle repeats itself, producing a train of pulses. This process is known as phonation.

The rest of the vocal tract, the oral and nasal passages, act as a filter, allowing the harmonics of the glottal waveform which lie near the natural resonant frequencies of the tract to pass, whilst attenuating the others. Indeed, reasonable acoustic models of speech production have been created consisting of an excitation source driving a series of filters.

So, what we get as a result of the above process is the acoustic wave radiated from the lips. To produce different sounds, we change the shape of the vocal tract by moving the jaw, tongue and lips so that the natural resonance occurs at different frequencies. In normal speech, the fundamental frequency will thus be changing all the time. However, the components of the larynx tone are always harmonics of the fundamental, and the effect of the resonances is to produce peaks in the spectrum of the output at the harmonics which are the closest to the true resonance. This ensures that the spectrum of the resulting sound always has the same envelope (or general outline), although the fundamental frequency is continually changing. Thus a certain sameness of quality is heard in a range of sounds with different fundamentals. If this were not the case, speech sounds could not fulfil the linguistic function that they in fact have.

The peaks in the spectra described above thus correspond to the basic frequencies of the vibration of air in the vocal tract. These peaks depend on the shape of the vocal tract, and regions around them are called formants. Formants are most easily seen in sonagrams (also called spectrograms and spectrographs).

Sonagrams represented an important breakthrough in speech research when they were invented, because they could conveniently represent the way speech spectra vary with time. They are basically plots of frequency versus time, with the darkness of the trace showing the intensity of the sound at a particular frequency. The following diagram shows a sonagram for the four semivowels /w/, /r/, /l/ and /j/, as in the syllables "wer", "rer", "ler" and "yer". It can be seen that all initial semivowels have a rising first formant. The second formant of /w/ rises, while that of /j/ falls

A Sonagram

Different groups of phonemes (as shown in Table 1) produce different sonagrams, but phonemes within a given group will usually have similar formants. These groups may be used to distinguish between different languages by considering the frequency of occurrence of phones, which will vary for identical phones in different languages.

Speech perception

In speech research, a lot of effort has been put into studying the way we as humans recognise and interpret speech, which makes sense since the best and most accurate speech recognition (and language identification, for that matter) system in existence today is that which most of us posses. This field of study is still to answer many crucial questions, but a lot has been achieved to date.

Research has shown that the two lowest formants are necessary to produce a complete set of English vowels, as well as that the three formants lowest in frequency are necessary for good speech intelligibility. More formants give more natural sounds.

The situation is made more complex when dealing with continuous speech, as the speed at which some articulators can move is limited by their inertia. Consequently, there is sometimes no time for a steady vowel configuration to be reached before the tract must be changed for the next consonant, and the formants don't reach their target values.

Other factors found to influence the perception of phonemes include duration and the frequency of the formants in the preceding utterance. Also, an interesting phenomenon which has been called the "cocktail party effect" has been investigated. When a number of conversations are being carried on simultaneously, it is usually possible to follow one, even if the total loudness of the others seems greater. Experiments have shown that the continuity of fundamental frequency groups events occurring at different times into the speech of a single speaker, and also that a common fundamental is a necessary (though not sufficient) condition for sounds to be grouped together as a stream.

We have already mentioned the effect that rhythm and intonation have on the way utterances are perceived in English. This is also important in other languages, as mentioned earlier.

History of speech identification

Over the past few decades, there have been important changes in the way the problem has been approached. They are briefly summarised below.

The acoustic approach (pre-1960)
The patterns of formant movements were analysed in an attempt to recognise a word from a limited, predefined vocabulary (e.g. digits between 1 and 10). The systems performed well, but only when used by the speaker they were designed for. The usefulness of this method was limited by the fact that acoustic patterns of a word spoken on different occasions differs in duration and intensity, and the same word produced by different persons produces patterns differing in frequency content as well.
The pattern - recognition approach (1960-1968)
Attempts were made to normalise the speech waveform in some way, so that comparisons with pre-defined patterns (words) could be made for a range of speakers. In particular, it was noted that the fundamental frequency could be used to normalise formant frequencies. Also, ways of normalising the duration of patterns were investigated. The problem was still that such systems were only adequate for limited vocabularies.
The linguistic approach (1969 - 1976)
The fact that, when two people communicate using speech, they must both use the same language, was neglected by early recognisers. There are many sources of linguistic knowledge which could be used to enhance various systems, such as pre-stored dictionaries, and the varying probabilities of a particular phoneme or word occurring after another one. This is referred to as phonotactics. Phonotactics deals with the rules that govern the combinations of the different phones in a language. There is a wide variance in such rules across languages - for example, the phone clusters /sr/ and /schp/ are common in Tamil and German respectively, but do not exist in English.
The pragmatic approach (1977-1980's)
The major advance that took place in isolated word recognisers was the use of dynamic programming algorithms, which enabled optimum non-linear timescale distortions to be introduced in the matching process. This improved the accuracy. Also, a number of more mathematically sophisticated algorithms were devised for other methods

Nature of the problem

Finally, let us look at the main problems that have to be dealt with when designing continuous speech recognisers.

The first step is to decide which unit to recognise. Even though all these units alone are insufficient for language identification, as they can all occur in more than one language, they are still important for two reasons. Firstly, language identification systems can be part of speech recognition systems. Secondly, language identifiers which are basically made up of speech recognisers for individual languages have been developed.

As basic units of languages, they are necessary to deduce the meaning of an utterance. However, a representation of each would have to be generated by a speaker and stored. As there are usually tens of thousands of words in a language, this is not practical unless the vocabulary can be reduced. Another problem is finding where one word ends and the next begins.
They are an attractive option as they have a fixed structure. It has been estimated that there are around 10,000 syllables in English. However, problems could again arise when deciding whether a particular consonant belongs to the preceding or the following syllable.
These consist of half a syllable, from the beginning of the syllable to the middle of the vowel, or from the middle of the vowel to the end of the syllable. Syllables are thus split at the point of maximum intensity. In English, the total number of demisyllables is around 2000. This number can be reduced further by recognising that consonant clusters in syllables are often preceded or followed by affixes, such as when /s/ is added to form the plural of a noun. If such affixes are removed from the consonant clusters used, we end up with around 1600 demisyllables. This is very close to the number of demisyllables in German, which was found to be around 1630.
There are few of them (40 - 60). However, we must take the allophones into consideration as well (they are different forms of the same phoneme, as described earlier), and there are some 100 - 200 of these. The main problem with this approach is segmentation, as it is often hard to tell where one phoneme ends and the next one starts.

We have already seen how variability can be introduced in speech by different speakers, due to different vocal tracts. A number of other factors must be considered, including different dialects of people from different parts of the country and different social / economic backgrounds, as well as people speaking a second language, many of whom partly use the prosodic patterns, the phoneme inventory and even (approximately) the phonotactics of their first language. This can clearly be a big problem when attempting to develop "universal" language recognisers.

Further variation is introduced by people speaking differently on formal and informal occasions, the speaking rate, and the background noise. Finally, the sound engineering aspects of recording speech must be looked at, such as the microphones used and the reverberation, which adds delayed and distorted versions of the signal to the original. Such aspects are however beyond the scope of this study.

Even if we managed to solve all of the problems outlined above, there are linguistic ambiguities which we only resolve by considering their meaning. Examples are words such as "two", "too" and "to", as well as "pitch shifter" and "pit shifter".


The article has covered a fairly broad range of topics, many of which are essential if multi-lingual speech recognition is to be investigated further. To summarise what all this means for language identification, the information we need to gather to determine the acoustic signature of a language includes the acoustic phonetics (the variation in phonemes and in the frequency of their occurrence), prosodics (stress, rhythm and intonation), phonotactics (how phonemes are grouped), and the vocabulary (needed in order to identify second-language speakers). In particular, the prosodic features, which have not been as useful in early language ID systems as designers may have hoped, are likely to contribute significantly to the way the problem is approached in the future.


Table 1

Phoneme categories of British English and examples of words in which they are used
Vowels: Diphthongs: Semivowels: Fricatives:Nasals: Plosives: Affricates:
heedbay wassail ambat jaw
hid byran shipan discchore
headbow lotfunnel sanggoat
had boughyacht thickpool
hardbeer hulltap
hod doer zookite
hoardboar azure
hood boy that
who'd bear valve


  1. Speech Recognition by Machine - W. A. Ainsworth, Peter Peregrinus Ltd., 1988.
  2. Reviewing Automatic Language Identification - Yeshwant K. Muthusamy, Etienne Barnard and Ronald A. Cole, IEEE Signal Processing Magazine, October 1994, p.33-41
  3. Elements of Acoustic Phonetics - Peter Ladefoged, The University of Chicago Press, 1962.
  4. The Physics of Speech - D. B. Fry, Cambridge University Press, 1979.
  5. Spoken Language Generation and Understanding - edited by J. C. Simon, NATO Advanced Study Institutes Series

Author: Uros Rapajic (