Using Phone Recognition and Language Modelling (PRLM) for Automatic Language Identification

by Uros Rapajic

Supervisor: Dr. P. Naylor



Article 1 looked at some basic concepts of multi-lingual speech recognition and a number of examples of linguistic features which may be used to distinguish between languages. We shall now examine one of several methods for language identification in greater detail. Most of these methods are based on statistical models(PRLM is no exception), some of which are too complex to be introduced in a short article. Instead, we will concentrate on identifying the building blocks of the system, as well as seeing how well it performs.

Appendix 1 contains brief definitions of several terms or concepts mentioned in the article, which are too detailed to be fully described and analysed here. For each term, a suggested text which covers the topic in more detail is given.


Most language ID systems can be divided into two phases: training and recognition. The training phase involves presenting the system with examples of speech from a variety of languages. Simple systems require only a sampled speech wave and the true identity of the language being recognised. Others may require a phonetic transcription (a sequence of symbols representing the sounds in each utterance), or an orthographic transcription (the text of the words spoken) along with a pronunciation dictionary, which would map each word to its pronunciation. Producing such transcriptions and dictionaries is obviously a time- and money-consuming process.

In such two-phase systems, the training speech for each language is analysed, and one or more models are created. These models are intended to represent some set of language-dependent, fundamental characteristics of the training speech that can be used in the second (recognition) phase of the identification process. During that phase, a new utterance is compared to each of the language-dependent models, and the likelihood that the language of the utterance matches the languages used to train the models is calculated. The model most likely to be correct is then selected.

The earliest automatic language ID systems were based on extracting and storing a set of prototypical frequency spectra from each language, each computed from 10 ms excerpts of training speech. Test speech was analysed and compared to the pre-stored spectra, and classified based on the results of the comparison. More recent work has been focused on automatic spectral feature extraction (using feature vectors based on formants, prosody, etc.) , unsupervised training, and maximum likelihood recognition, as indicated earlier. Such systems perform primarily static classification, as the feature vectors are assumed independent of each other and no use of feature vector sequences (i.e. the likelihood that one vector will follow another) is made. A number of dynamic classification systems have been designed and tested, based primarily on the hidden Markov models (HMMs), in an attempt to model sequential characteristics of speech production. While initial experiments indicated that such systems had little or no advantage over static classifiers, researchers have recently managed to obtain better results from dynamic as opposed to static classifiers.

Finally, over the past couple of years, efforts at a number of sites have been focused on the use of continuous speech recognition systems for language ID. Training consists of creating one speech recogniser per language. During testing, all the recognisers are run in parallel, and the one yielding output with highest likelihood is selected, meaning that the language it was trained to recognise is chosen. These systems, while difficult to train (many hours of labelled speech in each of the languages are needed), and computationally complex, produce a transcription of utterances as a by-product of language identification. More importantly, they promise better performance, as they use higher-level knowledge (words and word sequences) rather than lower-level knowledge (phones and phone sequences).

The basic PRLM system

A related approach to dynamic classification described above has been to use a single-language phone recogniser as a front end to a system that uses phonotactic scores to perform language ID. Phonotactics are the language-dependent set of rules specifying which phones / phonemes are allowed to follow other phones / phonemes.

A single-language phone recogniser and an n-gram analyser form the two parts of the system. Note that a number of single-language recognisers can be used together for better performance - this will be dealt with later.

The system relies on acoustic preprocessing methods to perform feature extraction, whereby speech waveforms are converted from their digitised waveform representation into one or more streams of feature vectors. This topic is covered in detail in "A Method for Extracting Feature Vectors from Speech" by George Constantinides.

Training consists of tokenising messages in each language (i.e. converting them into phones), analysing the resulting symbol sequences, and estimating the n-gram probability model for each language. Recognition involves tokenising the test message, and calculating the likelihood that its phone sequence was produced in each of the languages. Again, the language yielding the highest likelihood is identified and selected. All this is summarised in the diagram below.

PRLM can thus be viewed as a compromise between modelling the sequence information using HMMs trained from unlabelled speech, and employing language-dependent phone recognisers trained from orthographically or phonetically labelled speech. Let us now examine the two phases more closely.

The front end: Single language phone recognition

The phone recogniser can be trained in any language, regardless of whether that language is one of those that we are trying to recognise. For example, good language identification results have been obtained using a recogniser trained in English (as training speech in that language was the most readily available) to distinguish between Farsi, French and Tamil. The phone recogniser, which can be implemented using the Hidden Markov Model Toolkit, is a network of context-independent phones ("monophones"), in which each phone model contains three emitting states. The exact details of how the search is performed are beyond the scope of this article. Briefly, the output vector probability densities are modelled as Gaussian mixture models, with six underlying Gaussian densities per state per stream. Phone recognition is performed via a Viterbi search.

The phone recognition stage, which dominates PRLM processing time, was found to last around 1.5 x real-time on a Sun SPARCStation 10.

The back-end: N-gram language modelling

Initially, training speech for each of the languages that we want the system to recognise is fed into the phone recogniser, and a model for the statistics of the phones and phone sequences is computed based on the output from the front-end. N-grams are basically sub-sequences of n symbols (phones in this case), and we count their occurrences. Training is performed by accumulating a set of n-gram histograms, one per language, in an assumption that different languages will have different n-gram histograms. We then approximate the n-gram distribution as the weighted sum of the probabilities of the n-gram, the (n - 1) gram, etc. To illustrate this, let's look at an example for a bigram model (an n-gram with n = 2). If we use w(t-1) and w(t) to represent consecutive symbols observed in the phone stream, and a2, a1 and a0 as weighting constants, the distribution is given by

P'(w(t) | w(t-1)) = a2 P(w(t) | w(t-1)) + a1 P(w(t)) + a0 P0

The Ps are ratios of counts observed in the training data, e.g.

P(w(t) | w(t-1)) = [C(w(t-1), w(t))] / [C(w(t-1))]

where C(w(t-1), w(t)) is the number of times symbol w(t-1) is followed by w(t), and C(w(t-1)) is the number of occurrences of w(t-1). P0 is the reciprocal of the number of symbol types.

During recognition, the test utterances are first passed through the front-end phone recogniser, producing a phone sequence W={w0, w1, w2,...). The log likelihood, L, that the bigram model for language l, M(l), produced that particular phone sequence is given by summing the probabilities for all consecutive pairs of phones, i.e.

L (W, M(l)) = SUM { log P' (w(t) | w(t-1), M(l)) }

Finally, the language of the model with the highest value of L is hypothesised as the language of the utterance.

It was found experimentally [1] that, for n = 2 and a0 = 0.001, peak performance was obtained for 0.3 < a1, a2 < 0.7. There was also little improvement for n > 2. The researcher in question, however, noted that one might weigh the higher order a's more heavily as the amount of training data increases.

Parallel PRLM

The sounds in the languages to be identified do not always occur in the language used to train the phone recogniser. We may therefore want to incorporate sounds from more than one language into a PRLM-like system. This can be achieved by multiple PRLM systems in parallel, with the single language front-end recognisers each trained in a different language. An example of this is shown below.

This system would be expected to perform better. Its only disadvantages are the need for labelled training speech in more than one language, and the increased processing time; the system represented above has three PRLM systems (therefore three phone recognisers) and would take 4.5 x real time.


Experiments have been carried out to find to what extent parallel PRLM outperforms PRLM. The two systems were tested on pairs of languages, in each case with 10-second and 45-second utterances. They performed similarly for 45-second samples, while the parallel approach reduced the percentage of errors for 10-second speech from 16 to 11.

A parallel PRLM system with six front-end phone recognisers, trained in English, German, Hindi, Japanese, Mandarin and Spanish, was tested on the 11 languages of the OGI_TS corpus. The error rates obtained were 20% and 30% for 45- and 10-second utterances respectively. The effect of reducing the number of front-end phone recognisers was also investigated and can be seen from the graph.

It was also found that, for single-language front-end PRLMs, the language used to train the phone recogniser can affect performance. Each of the above six languages was used in turn to train the system, and tests were performed on the ten OGI_TS languages other than the training language. The error rate for 45-second samples varied between around 25% when English was used, and 35% for Japanese.

When compared to some other speech recognition systems, the phone recognition method used here was found to have a relatively high error rate, therefore it may seem strange that it can be used to perform language ID effectively. To explain this, it is important to realise that what the language models require is consistency, rather than accuracy. Thus, if phone a is always recognised by a two-phone front-end as phone b, and vice versa, the accuracy might be zero, but the ability of the model to perform language ID will be unaffected.

Finally, there is evidence that parallel PRLM systems may have trouble with non-native speakers of a language. Seventeen Spanish speakers, seven of whom were non-native, were asked to test one such system. The four speakers who were classified incorrectly all learned Spanish as a second language. This is significant, since many people, when speaking a foreign language, use the prosodic and phonetic features of their first language to a greater extent than its phonotactics, so phonotactic features would be expected to be the most reliable way of correctly identifying such speakers.


In this article, we have broadly outlined several ways of approaching automatic language identification, and then focused on two versions of one particular method. Some results indicating performance, and problems that have been observed in practice have also been given. Because labelling foreign languages phonetically or orthographically is expensive, the high performance obtained from the parallel PRLM system, which does not require (but can use) labelled speech for each language to be recognised, is encouraging.

Recently, attempts have been made to enhance the system even further, by using gender-dependent front-ends in parallel with gender-independent ones. This did result in better performance, with the best system yielding a 11% error rate in performing 11-language closed-set classification and a 2% error rate in distinguishing between languages in closed two-language sets.

As mentioned earlier, language-ID using language-dependent word spotters and continuous speech recognisers are evolving. In the near future, it will be interesting to see how the system we looked at compares to these new systems, both in terms of performance and computational complexity.


A brief introduction to some phone recognition concepts

Hidden Markov Models
A Markov model consists of a set of states and a set of possible outputs. The next state depends only on the present state and the transition probabilities, a(i,j), to the next possible states. In an HMM, the states of the model cannot be observed directly. Only the output symbols, which are determined by another probability function, can be observed. The probability that the output of the state q(j) is the pattern represented by the symbol sk is given by b(j,k). To complete the model, the initial state distribution is given by the vector V = (v1, v2, v3,...), where vn is the probability that the model is initially in state n. The model is thus defined by M (V, A, B), where A = [a(i,j)] is the matrix of transition probabilities, and B = [b(j,k)] is the matrix of output probabilities. Given a set of observations, the problem of recognition becomes that of estimating which model is most likely to produce that particular sequence. The following diagram shows one example of a Markov model.
For more details, see A Tutorial on HMMs and Selected Applications in Speech Recognition - L. R. Rabiner, Proceedings of the IEEE, 77(2):257-286, February 1989.
Hidden Markov Model Toolkit (HTK)
A selection of software tools and programs designed to facilitate the construction of systems using continuous density Gaussian mixture HMMs.Details: The HTK HMM Toolkit: Design and Philosophy - S. J. Young, Cambridge Univ. Eng. Dept. Tech. Rpt. CUED/F-INFENG/TR.152, 1993. Alternatively, [3] contains a description of the system.
Viterbi search
An example of a technique used to match input patterns with one of the pre-defined templates, which can be used in situations where the patterns and sequences consist of a different number of feature vectors. See [2] for a brief introduction.


  1. Comparison of Four Approaches to Automatic Language Identification of Telephone Speech - Marc A. Zissman, IEEE Transactions on Speech and Audio Processing, Vol. 4 No. 1, January 1996, pp. 31-44.
  2. Speech Recognition by Machine - W. A. Ainsworth, Peter Peregrinus Ltd., 1988, pp. 64-89.
  3. Spontaneous Speech Recognition for the credit Card Corpus Using the HTK Toolkit - Stephen J. Young, Philip C. Woodland, and William J. Byrne, IEEE Transactions on Speech and Audio Processing, Vol. 2 No. 4, October 1994, pp. 615-621.
  4. Language Identification Using Phoneme Recognition and Phonotactic Language Modelling - Marc A. Zissman, publication unknown. Author's e-mail address: