The aim of this article is to provide a brief introduction to speech recognition, with examples of how it relates to the language identification problem. We will look at the way speech is represented and produced by humans in order to introduce some basic concepts and terminology. Different ways in which the problem has been approached will then be broadly outlined, aiming to give the reader some examples of what the methods introduced actually mean in practice, rather than provide detailed analysis. Finally, the main causes of variability of speech will be summarised. Where possible, examples of how the theory applies to different languages and examples of ways in which they differ will be given.
It is almost impossible to tackle the speech recognition problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced. The letters symbols we use for writing are obviously not adequate, as the way they are pronounced varies; for example, the letter "o" is pronounced differently in words "pot", "most" and "one".
One way of representing speech sounds is by using phonemes. Formally, we can define the phoneme as a linguistic unit such that, if one phoneme is substituted for another in a word, the meaning of that word could change. This will only hold for a set of phonemes in one language; within a single language, a finite set of phonemes will therefore exist. However, when different languages are compared, there are differences; for example, in English, /l/ and /r/ (as in "lot" and "rot") are two different phonemes, whereas in Japanese, they are not. Similarly, the presence of individual sounds, such as the "clicks" found in some sub-Saharan African languages, or the velar fricatives (introduced later) found in Arabic, are readily apparent to listeners fluent in languages that do not contain these phonemes. Still, as the vocal apparatus used in the production of languages is universal, there is much overlap of the phoneme sets, and the total number of phonemes is finite.
Table 1 in the Appendix shows how the phonemes are subdivided into groups based on the way they are produced. The variation between different sets will be dealt with later.
It is also possible to distinguish between speech sounds based solely on the way they are produced. The units in this case are known as the phones. There are many more phones than phonemes, as some of them are produced in different ways depending on context. For example, the pronunciation of the phoneme /l/ differs slightly when it occurs before consonants and at the end of utterances (as in "people"), as opposed to other positions (e.g. in "politics"). The two phones are called the velarised and the non-velarised "l" respectively. As they are both different forms of the same phoneme, they form a set of allophones. Any machine-based speech recogniser would need to be aware of the existence of allophone sets.
It is not just the speech organs involved that influence the way an utterance is spoken and subsequently interpreted. The stress, rhythm and intonation of speech are its prosodic features.
Stress is used at two levels; in sentences, it indicates the most important words, while in words it indicates the prominent syllables - for example, the word "object" could be interpreted as either a noun or a verb, depending on whether the stress is placed on the first or second syllable.
Rhythm refers to the timing aspect of utterances. Some languages (e.g. English) are said to be stress-timed, with approximately equal time intervals between stresses (experiments have shown that, objectively, there is merely a tendency in this direction). The portion of an utterance beginning with one stressed syllable and ending with another is called a foot (by analogy with poetry). So, a four-syllable foot (1 stressed, 3 unstressed) would be longer than a single (stressed!) syllable foot, but not four times longer. Other languages, such as French, are described as syllable-timed.
Intonation, or pitch movement, is very important in indicating the meaning of an English sentence. In tonal languages, such as Mandarin and Vietnamese, the intonation also determines the meaning of individual words.
The above diagram shows the organs involved in the production of speech, all of which probably evolved for purposes other than speech production (such as eating and breathing), thus limiting the range of sounds that we can produce. Basically, air is drawn into the lungs by inhaling, expanding the rib cage and lowering the diaphragm. The pressure in the lungs is then increased by the reverse process, which pushes the air up the trachea (wind pipe). The larynx, a bony structure covered by skin containing a slit-like orifice, the glottis (vocal cords), is situated at the top of the trachea. As the air flows through the glottis, local pressure falls, and eventually allows the laryngeal muscles to close the glottis, interrupting the flow of air. This in turn causes the pressure to rise again, forcing the vocal cords apart. The cycle repeats itself, producing a train of pulses. This process is known as phonation.
The rest of the vocal tract, the oral and nasal passages, act as a filter, allowing the harmonics of the glottal waveform which lie near the natural resonant frequencies of the tract to pass, whilst attenuating the others. Indeed, reasonable acoustic models of speech production have been created consisting of an excitation source driving a series of filters.
So, what we get as a result of the above process is the acoustic wave radiated from the lips. To produce different sounds, we change the shape of the vocal tract by moving the jaw, tongue and lips so that the natural resonance occurs at different frequencies. In normal speech, the fundamental frequency will thus be changing all the time. However, the components of the larynx tone are always harmonics of the fundamental, and the effect of the resonances is to produce peaks in the spectrum of the output at the harmonics which are the closest to the true resonance. This ensures that the spectrum of the resulting sound always has the same envelope (or general outline), although the fundamental frequency is continually changing. Thus a certain sameness of quality is heard in a range of sounds with different fundamentals. If this were not the case, speech sounds could not fulfil the linguistic function that they in fact have.
The peaks in the spectra described above thus correspond to the basic frequencies of the vibration of air in the vocal tract. These peaks depend on the shape of the vocal tract, and regions around them are called formants. Formants are most easily seen in sonagrams (also called spectrograms and spectrographs).
Sonagrams represented an important breakthrough in speech research when they were invented, because they could conveniently represent the way speech spectra vary with time. They are basically plots of frequency versus time, with the darkness of the trace showing the intensity of the sound at a particular frequency. The following diagram shows a sonagram for the four semivowels /w/, /r/, /l/ and /j/, as in the syllables "wer", "rer", "ler" and "yer". It can be seen that all initial semivowels have a rising first formant. The second formant of /w/ rises, while that of /j/ falls
Different groups of phonemes (as shown in Table 1) produce different sonagrams, but phonemes within a given group will usually have similar formants. These groups may be used to distinguish between different languages by considering the frequency of occurrence of phones, which will vary for identical phones in different languages.
In speech research, a lot of effort has been put into studying the way we as humans recognise and interpret speech, which makes sense since the best and most accurate speech recognition (and language identification, for that matter) system in existence today is that which most of us posses. This field of study is still to answer many crucial questions, but a lot has been achieved to date.
Research has shown that the two lowest formants are necessary to produce a complete set of English vowels, as well as that the three formants lowest in frequency are necessary for good speech intelligibility. More formants give more natural sounds.
The situation is made more complex when dealing with continuous speech, as the speed at which some articulators can move is limited by their inertia. Consequently, there is sometimes no time for a steady vowel configuration to be reached before the tract must be changed for the next consonant, and the formants don't reach their target values.
Other factors found to influence the perception of phonemes include duration and the frequency of the formants in the preceding utterance. Also, an interesting phenomenon which has been called the "cocktail party effect" has been investigated. When a number of conversations are being carried on simultaneously, it is usually possible to follow one, even if the total loudness of the others seems greater. Experiments have shown that the continuity of fundamental frequency groups events occurring at different times into the speech of a single speaker, and also that a common fundamental is a necessary (though not sufficient) condition for sounds to be grouped together as a stream.
We have already mentioned the effect that rhythm and intonation have on the way utterances are perceived in English. This is also important in other languages, as mentioned earlier.
Over the past few decades, there have been important changes in the way the problem has been approached. They are briefly summarised below.
Finally, let us look at the main problems that have to be dealt with when designing continuous speech recognisers.
The first step is to decide which unit to recognise. Even though all these units alone are insufficient for language identification, as they can all occur in more than one language, they are still important for two reasons. Firstly, language identification systems can be part of speech recognition systems. Secondly, language identifiers which are basically made up of speech recognisers for individual languages have been developed.
We have already seen how variability can be introduced in speech by different speakers, due to different vocal tracts. A number of other factors must be considered, including different dialects of people from different parts of the country and different social / economic backgrounds, as well as people speaking a second language, many of whom partly use the prosodic patterns, the phoneme inventory and even (approximately) the phonotactics of their first language. This can clearly be a big problem when attempting to develop "universal" language recognisers.
Further variation is introduced by people speaking differently on formal and informal occasions, the speaking rate, and the background noise. Finally, the sound engineering aspects of recording speech must be looked at, such as the microphones used and the reverberation, which adds delayed and distorted versions of the signal to the original. Such aspects are however beyond the scope of this study.
Even if we managed to solve all of the problems outlined above, there are linguistic ambiguities which we only resolve by considering their meaning. Examples are words such as "two", "too" and "to", as well as "pitch shifter" and "pit shifter".
The article has covered a fairly broad range of topics, many of which are essential if multi-lingual speech recognition is to be investigated further. To summarise what all this means for language identification, the information we need to gather to determine the acoustic signature of a language includes the acoustic phonetics (the variation in phonemes and in the frequency of their occurrence), prosodics (stress, rhythm and intonation), phonotactics (how phonemes are grouped), and the vocabulary (needed in order to identify second-language speakers). In particular, the prosodic features, which have not been as useful in early language ID systems as designers may have hoped, are likely to contribute significantly to the way the problem is approached in the future.