A Method For Extracting Feature Vectors From Speech

George Constantinides
Supervisor: Dr. P. Naylor


[ Feature Vectors | Segmentation | Activity | Cepstrum ]
[ Mels | RASTA | Output | Summary | References ]

Feature Vectors

Many different algorithms exist for speech recognition and language identification. A common need between them is some form of parametrised representation of the speech input. These feature vector streams may then be used to train or interrogate the language models which will follow the feature extraction module in a typical language identification system.

It is obvious that there exist an infinite number of ways to encode the speech, depending upon which particular numerical measures are deemed useful. I will examine one feature extraction scheme which has been widely used. A block diagram of the scheme is shown below

This front-end has been studied by Davis and Mermelstein [1], and was shown to give the best performance of the options examined. Since then, it has been used as a base for the comparison of different approaches to language identification by Zissman [2].

I will examine each stage in this process, explaining why it was chosen to perform the operations and what use they have.

Segmentation

Before the speech is fed into the above front-end system, it is first necessary to segment the speech into the phone-sized units which are then processed. This is an ongoing topic of research. When Davis and Mermelstein studied this system, they proposed a segmentation system based on the work of Mermelstein [5]. However since then much research has been done in this area, which really deserves an article in its own right. I will not dwell on it here, assuming pre-segmented speech is available.

Speech Activity Detector

Often either the training or the test speech messages consist of speech segments separated by long periods of silence. It has been found [2] that under these circumstances it is desirable to use only the active segments. This is an obvious consideration for Language Identification, as the silent regions will generally contain no language specific information. The speech activity detector used by Davis and Mermelstein [1], was developed by Reynolds [3] for a speaker identification system. It works by keeping a running estimate of the signal to noise ratio (SNR) of the signal. This is then used to divide the signal into regions of high SNR which are fed-through.

Cepstral Processing

The part of the diagram consisting of the DFT, the logarithm and the inverse cosine transform computes a function of the input data known as the 'cepstrum'. The idea is to separate the excitation signal of a speech wave from the filter part. This makes it easier to estimate the frequencies of the formants and thus the phone being uttered.

The cepstrum, which has the dimensions of time, has a number of short-duration peaks clustered around the origin, corresponding to the formants of the speech. At higher cepstral coefficients there is also a more spread peak, corresponding to the fundamental frequency. For language ID, only the lowest 13 coefficients of the (mel-weighted) cepstrum are calculated, thereby retaining information relating to vocal tract shape while ignoring the excitation signal.

Mel-Scale Weighting

Many researchers have been working to determine the type of frequency analysis performed by the human ear. Speech signals are present in the ear as pressure variations in the cochlear fluid, which then excites the basilar membrane. This is a non-linear frequency transformation, which causes the ear to be more sensitive to changes in frequency at different absolute frequencies. This has lead to the definition of scales of frequency based on human perception. Fant [6] played various tones to listeners and from their responses about what 'sounded like' double frequency and equal increments in frequency, determined the mel-scale. A diagram showing both the commonly used subjective frequency units (Bark and Mel) against a linear frequency scale is shown below

Transformation to the mel-scale can be performed by a mathematical function which has been fitted to these experimental data [7]. This is what is meant by 'mel-scale weighting' in the block diagram.

The objective of doing this transformation is, of course, to retain the same information the ear would in the hope that this will improve performance. This has indeed been shown to be the case [1], when it was compared against a scheme using linear frequency.

RASTA

RASTA (in this context) stands for RelAtive SpecTrAl. It is a technique for minimising the effects of a transmission channel on parameter estimation, developed by Hermansky, et al [8]. Unfortunately much of the method thus far described is adversely affected by variation in the characteristics of the channel from which the original signal is received. Since in the most general case (as is certainly the case in the OGI_TS data set) each particular message is received over a different channel, some method must be used to eliminate these differences. The RASTA process removes near-DC components in the spectum, together with some higher frequency components. This is achieved with minimal computational expense.

Output Vectors

Important to Language ID systems are not only the particular mel-cepstral observation vectors described above, but some information about the transitions occurring. In an attempt to model this information simply, 'time-differencing' is applied. Here two neigbouring (in-time) observation vectors are differenced, producing a new vector of data which is also provided for the language modelling system.

Summary

I have described a system which is capable of an extremely concise (25-D vector) parametric representation of a segment of speech. It is nevertheless flexible enough to sucessfully encode the necessary information for training and use of language models which perform very well [2]. The final test for any such front-end is its effect on the accuracy of the overall Language ID system. In this respect the system compares favourably with any other I have come across in my investigation.

References

[1] S.B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-28, No. 4, August 1980.

[2] M.A. Zissman, "Comparison of Four Approaches to Automatic Language Identification of Telephone Speech", in IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 1, January 1996.

[3] D.A. Reynolds, R.C. Rose, and M.J.T. Smith, "PC-Based TMS320C30 implementation of the Gaussian mixture model text-independent speaker recognition system", in Proc. ICSPAT '92, Vol. 2, Nov. 1992, pp. 967-973.

[4] W.A. Ainsworth, "Speech Recognition by Machine", Peter Peregrinus Ltd., London, UK, 1988.

[5] P. Mermelstein, "Automatic Segmentation of Speech", J. Acoust. Soc. Amer., Vol. 58, pp. 880-883, Oct 1975.

[6] C.G. Fant, "Speech Sounds and Features", MIT Press, January 1973.

[7] C.J. van der Merwe and J.A. du Preez, "Calculation of LPC-based Cepstrum Coefficients using Mel-Scale Frequency Warping", IEEE 1991.

[8] H. Hermansky, N.Morgan, A. Bayya, and P. Kohn, "RASTA-PLP Speech Analysis Technique", in Proc. ICASSP '92, Vol. 1, March 1992, pp. 121-124.


Email: gac1@doc.ic.ac.uk