Speaker Independent Speech Recognition of Isolated Words in Room Environment

— In this paper, the process of recognizing some important words from a large set of vocabularies is demonstrated based on the combination of dynamic and instantaneous features of the speech spectrum. There are many procedures to recognize a word by its vowel, but this paper presents the highly effective speaker independent speech recognition in a typical room environment noise cases. To distinguish several isolated words of the sound of different vowels, two important features such as Pitch and Formant are extracted from the speech signals collected from a number of random male and female speakers. The extracted features are then analyzed for the particular utterances to train the system. The specific objectives of this work are to implement an isolated and automatic word speech recognizer, which is capable of recognizing as well as responding to speech and an audio interfacing system between human and machine for an effective human-machine interaction. The whole system has been tested using computer codes, and the result was satisfactory in almost 90% of cases. However, the system might get confused by similar vowel sounds sometimes.


I. INTRODUCTION
Speech is the primary means of expressing emotion and communication between mankind. In 1773 a professor of physiology in Copenhagen and a Russian scientist Christian Kratzenstein succeeded in producing vowel sounds by using resonance tubes which were connected to organ pipes. Later in Vienna, Wolfgang von Kempelen constructed an Acoustic-Mechanical Speech Machine in 1791 [1]. In 1881 Alexander Graham Bell with his cousin Charles Sumner Tainter and Chichester Bell invented a recording device, which was the same way as a microphone. Based on this invention, Tainter and Bell formed the Volta Graphophone Co. [2]. In the year 1952, Balashek of Bell Laboratories and Davis, Biddulph, built a system for isolated digit recognition for a single speaker using the measured formant frequencies during vowel regions of each digit. In the 1980's, Speech recognition research was characterized by a shift in methodology from the more intuitive template-based approach towards a more rigorous statistical modelling framework [3].
Although the basic idea of the hidden Markov model (HMM) was understood by a known nearly only in a few laboratories., most speech recognition research, up to 1980, considered the major research problem to be one of converting a speech waveform (as an acoustic realization of a linguistic event) into words (as a best-decoded sequence of linguistic units). The keyword spotting method and its application in AT&T's Voice Recognition Call Processing (VRCP) System, as mentioned earlier, was introduced in response to the first factor while the second factor focused the attention of the research community on the area of dialog Management.
Humans normally express their feelings, ideas, and thoughts orally to another person using a series of complex vocal movements. Speech is an information-rich, frequency modulated, signal exploiting, time and amplitude modulated carriers (e.g. harmonics, noise, power, pitch information, duration, resonance movements) to convey the emotions and information about words, expression, accent, style of the speech, speaker identity, health condition of the speaker and so on [4].
Pitch is one of the important acoustic features in speech recognition process. It is also considered as the fundamental frequency of a complex speech signal. Pitch signal is basically produced due to the vibration of vocal folds. It normally depends on the tension of the vocal folds as well as the subglottal air pressure when speech is generated. Conversely, pitch in a human voice is also dependent on the thickness and the length of the vocal cord, as well as the relaxation and tightening of the muscles surrounding the vocal cord [3]. A system of gender classification can be easily identified based on the pitch, sometimes formants and the combination of both features. Pitch is the perceptual property of speeches which allows the orders on a frequency based scale [4]. Fig. 1 (a) and (b) show the internal structure of a human vocal cord. Since women possess shorter vocal cords than of the men, they generally have a higher pitch than men. Thus, pitch in human voice plays a significant role in gender identification [4,5].
In addition, the formant is a harmonic of a note that is augmented by resonance. In acoustics, a very similar definition is widely used: the Acoustical Society of America defines a formant as: "a range of frequencies [of a complex sound] in which there is an absolute or relative maximum in the sound spectrum" [6]. Formants are often measured as amplitude peaks in the frequency spectrum of the sound using a spectrogram or a spectrum analyzer and, in the case of the voice, this gives an estimate of the vocal tract resonances. In vowels spoken with a high fundamental frequency fundamental frequency, as in a female or child voice, however, the frequency of the resonance may lie between the widely spaced harmonics and hence no corresponding peak is visible [6,7].
A lot of studies have been carried out to investigate the acoustic indicators to detect features in speech. The characteristics that are commonly considered include fundamental frequency, spectral variation, duration, wavelet and intensity-based features [8,9]. In this paper, linear feature extraction techniques and their extraction algorithms are explained. These features are used to identify the proper language state. The production of those speech signals is considered as the convolution between vocal tracks [10,11].

A. Feature Extraction
The fundamental frequency (Fo) is the main cue of the pitch. However, it is difficult to build a reliable statistical model involving fundamental frequency Fo because of pitch estimation errors and the discontinuity of the Fo space. Thus, a reliable pitch detection algorithm (PDA) is a very important component in many speech processing systems.
By analyzing the power spectral density (PSD) spectra of the sound, formant frequencies of a particular uttered sound can be extracted. The formants are the frequencies corresponding to the peaks in the PSD spectra. In order to obtain the PSD of an utterance, Yule-Walker AR method is used in this work [12].

B. Autocorrelation Method and AMDF
Generally, the pitch detection algorithms use short-term analysis techniques. For every frame x m, we get a score f(T| x m ) that is a function of the candidate pitch periods T. Algorithm determine the optimal pitch by maximizing is given by: Tm = argmaxf (T | x m ) (1) T A commonly used method to estimate pitch is based on detecting the highest value of the autocorrelation function in the region of interest. Given a discrete time signal x(n), defined for all n, the auto-correlation function is generally defined as: (2)

C. Modified Autocorrelation Method
According to the discussion above, the modified autocorrelation pitch detector based on the center-clipping method and infinite-clipping is used in our implementation. Fig. 4 shows a block diagram of the pitch detection algorithm. The method requires that the speech be lowpassed filtered to 900 Hz. The low-pass filtered speech signal is digitized at a 10-kHz sampling rate and sectioned into overlapping 30-ms (300 samples) sections for processing.
PID control optimization process is done to get the right values in the PID control parameters, so the response generated controllers capable of handling the minimum conditions for the achievement of rapid set points and small overshoot. Fig. 2 shows the block diagram of PDA [14].

D. Yule-Walker AR Method
Assuming a given zero-mean discrete time series is an AR process, the appropriate order p of the AR(p), + (3) and the corresponding coefficients { }.

E. Speech Recognition Techniques 1) Data Gathering
Various speech samples were taken from 10 female and male subjects. The recording environment was typical Bangladeshi. Audio templates for training were recorded from typical living rooms environments and classrooms.

2) Data Pre-processing
After the collection, data pre-processing is the initial step of the recognition process. Here, some speech commands are taken as inputs by using a microphone. The microphone converts the speech signal into an analog electrical signal. The speech command is recorded in MATLAB with the sampling frequency of 8000 Hz. After sampling, there are some discrete speech signals. Before following the further steps, those discrete speech signals go through some filters and windows for noise cancellation.

3) Pitch and Formant Extraction
After pre-processing the data, feature extraction step begins. Since the pitch of a male and a female speaker lie in two different ranges, formants also differ between them. But this difference is not sharp enough to distinguish between same male and female utterance. Therefore, the pitch is extracted first to detect either the pitch is from a male speaker, or from a female speaker. Then formants are extracted according to that specific pitch.

4) Preparing Sample Templates
In this paper, data for four specific speech commands 'Go', 'Right', 'Left' and 'Halt' have been collected from 10 male and 10 female speakers. Each command has uttered two times. Then the pre-processing and feature extraction steps are followed on the collected data. The obtained values of extracted features are analyzed, and then templates are prepared through determining two things if the ranges of the pitch are for male or female utterances. And two different set of ranges of formants for each of the specific speech commands, where one of the set is for male utterances and another one is for female utterances.

5) Analyzing
The whole speech recognition system must be trained with enough feature data to make it much capable of recognition. The system has been trained with pitch and formant data ranges by obtaining through the previous steps. After training the system, this is now ready to recognize the specific input speech commands by random speakers including both male and female through the microphone. As the results of testing will evaluate the system performance rate, the trained system has been tested many times by many speakers in order to find out the accuracy of this system.

III. RESULT AND DISCUSSION
Gathering and analyzing the pitch and formant values of different speakers, there is a clear distinction between male and female voices.

A. Pitch Calculation for Male and Female Utterances
Pitch readings of 10 male and 10 female voices for the utterances Go, Right, Left and Halt are collected, and 10 of the Pitch readings for male voices are summarized. Table 1 and Table 2 represent the Pitch values for 10 female voices.
It is clearly seen from Table 1 and Table 2 that the most of the pitch values for male utterances lie within the range of 100-170 Hz, and pitch for women vary from 180 Hz to 290 Hz. The conclusion can be drawn that the range of male and female pitch are located far away from one another. There is

B. Formant Calculation for Male and Female Utterances
Formant analysis actually assists us to distinguish between various isolated words containing a different vowel. The 2 nd formant is very effective in the recognition process among the first three formants. Table 3 is representing the normalized frequency scale of the 2 nd formant values of 10 male speakers. And immediately after that, there is Table 4 which represents the normalized frequency scale of the 2 nd formant values of 10 female speakers.  After analyzing the '2 nd formant values' of male and female speakers for the utterances 'Go', 'Right', 'Left' and 'Halt', it is seen that for a specific person like Male 2, values of 2nd formant are different for different words. Thus, the system can easily recognize these words of the dissimilar vowel. On the contrary, Table 3 and Table 4 also indicate that second formant of a specific utterance such as Left remains almost identical irrespective of speakers. But, due to the variation of accent and recording environment, it slightly varies from one another. Hence, we find the range of variation for a specific utterance regardless of speakers.

C. Simulation Results
In the simulation experiment, the obtained power spectral density, PSD spectrum for different utterances uttered by random speakers using Yule walker AR method. Peak locations in those spectrums correspond to the desired formant frequencies.
Figs. 4 (a), (b), 5 (a), (b) demonstrate the PSD spectra of two male, Male 1 and Male 2 for the word "GO" and "Halt". The locations of 2nd formants are indicated in the PSD curves. From the graphs, it is clear that locations of 2nd formants are different for different utterances.

D. Simulation Environment
• Subjects' age 18-50 years • 16 bits per sample speech resolution • 8000 sampling frequency • MATLAB is used as the simulation platform

E. Some Important Observations
We found this automatic speech recognition system as the percentage of recognition.
• Single-speech recognition which is uttered by different speakers. • Single-speech recognition which is uttered by one speaker at a different time. • Speech recognition by one or different speaker in a different environment. For performance evaluation, the recognizer has been tested many times by inputting the same speeches command which is uttered by different speakers. Both male and female speaker have been given voice, sometimes a single speaker at different times. From various testing and implementation, it was found that this recognizer recognized almost 18 inputs out of 20 inputs successfully. So it can be said that it provides about 90% of the accuracy of specific speech commands in the recognition process.

IV. CONCLUSION
The values of pitch and formants of the individual voice samples can determine the gender of the speaker. Also, the values of the 2 nd formants can determine the different words. Variation of accent and the environment of the lab or working station also vary the observation of the experiment of this speech recognition system. Different people from the different area has a different accent, and that make a lot of difference in the values of the second formants of the speech. Conversely, the 2 nd formant reading for male and female utterances are sometimes slightly different. That is the reason of slightly overlapping of utterances of formants of different speeches.
One lacking of this work is that this speech recognition system cannot make a difference between words with same vowels. That is the reason for choosing the words with different vowel sounds for a better result. If there are two words with the same vowel sounds like "Right" and "Light", the system will get confused, and the result might not be satisfactory. In future, this speech recognition system can do better with a large range of vocabulary in a more advanced way that is the hope.