WO2011039764A2

WO2011039764A2 - Word recognition system

Info

Publication number: WO2011039764A2
Application number: PCT/IN2010/000588
Authority: WO
Inventors: Tanushyam Chattopadhyay; Ruchir Gulati
Original assignee: Tata Consultancy Services Ltd
Priority date: 2009-09-08
Filing date: 2010-09-03
Publication date: 2011-04-07
Also published as: WO2011039764A3

Abstract

A system for word recognition, said system comprising: means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands; a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising: pattern determination means adapted to determine the consonant-vowel pattern in the user's utterances; length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances; comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands; score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.

Description

WORD RECOGNITION SYSTEM

FIELD OF THE INVENTION

The present invention relates to the field of telecommunications.

Particularly, the present invention relates to the field of word recognition.

BACKGROUND OF THE INVENTION

Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable inputs. The term "voice recognition" is sometimes used to refer to speech recognition. Word recognition refers to the identification of each of the words in a spoken word string. Speech recognition finds its main application in interactive voice response (IVR) systems, audio command-and-control systems and the like. Using audio commands makes the process of external control of electronic devices very easy and user friendly (no need to press a sequence of keys to carry out a task). This will also prove as a blessing for physically challenged and blind persons.

Early speech recognition systems tried to apply a set of grammatical and syntactical rules to recognize the speech. If the words spoken fitted into a certain set of rules, the system could determine what the words were. However, human language has numerous exceptions to its own rules, even when it is spoken consistently. Accents, dialects and mannerisms can vastly change the way certain words or phrases are spoken.

Most of the people usually do not pronounce their words very carefully. For example, considering the sentence, "I'm going to see the ocean", the spoken l result might come out as "I'm goin' da see tha ocean". Many people speak with no noticeable break, such as "I'm goin'" and "tha ocean." Rules-based speech recognition systems proved to be unsuccessful because those couldn't handle the accent and dialect variations. This also explains why earlier systems could not handle continuous speech. For them to work properly, the speaker had to speak each word separately, with a brief pause in between them.

Today's speech recognition systems use powerful and complicated statistical modeling systems. These systems use probability and mathematical functions to determine the most likely outcome. The two models that dominate the field of speech recognition today are the Hidden Markov Model and neural networks. These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it.

The Hidden Markov Model is the most common speech recognition method. In this model, each phoneme (a phoneme is the smallest segmental unit of the sound used to form meaningful contrasts between utterances) is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the system attempts to match the digital sound with the phoneme that is most likely to come the next. During this process, the system assigns a probability score to each phoneme, based on its built-in dictionary and user training.

This process is even more complicated for phrases and sentences. In such scenarios, the system has to figure out where each word stops and starts. The classic example is the phrase "recognize speech," which sounds a lot like "wreck a nice beach" when it is said very quickly. The system has to analyze the phonemes using the phrase that came before it in order to get it right. Another problem of the speech recognition systems lies with the detection of the start of the whole speech activity. A technique named as voice activity detection (VAD) is implemented for the same. Voice activity detection (also known as speech activity detection) is a technique used in speech processing, wherein the presence or absence of human speech is detected in regions of audio frequency (which may also contain music, noise, or other sound). Implementation of VAD in speech recognition systems adds to making them more complex and less responsive.

Implementation of such complex speech recognition systems to detect and analyze the audio commands used to control electronic devices will make the purpose of using audio commands ineffective. Also, the whole electronic device control system will become very costly.

Therefore, there is felt a need for a word recognition system for controlling electronic devices which:

• is less complex;

• is highly responsive;

• does not require the incorporation of VAD;

• makes use of cheaper components; and

• reduces the development time and effort.

PRIOR ART:

US20020071577 focuses on Remote Control with speech recognition, which uses templates to perform speech recognition. These templates are received by the remote from an external database. It further uses a database to allow the user to customize the voice commands

However, there is a need to eliminate the dependency on external databases. Also, there is a need for a light-weight speech recognition algorithm (with a voice controlled remote control unit for a STB being presented as an example application).

WO 2008/084476 refers to a method which deals with Consonant and Vowel (CV) detection. This states its applications in the area of Continuous Speech Recognition, or Large Vocabulary Continuous Speech Recognition (LVCSR).

However, this application does not concentrate on fixed-vocabulary isolated word recognition.

Indian Patent Application No. 1028/MUM/2008 discloses 'Methods and Apparatus for implementation of a set of Interactive Applications using a flexible Framework i.e. Methods and systems for wireless and wired transmission'.

Indian Patent Application No. 2035/MUM/2008 discloses an 'Inputting Device'.

Indian Patent Application No. 63 MUM/2008 discloses 'Input Mechanisms' which are methods and systems for wireless and wired transmission. Indian Patent Application No 400/MUM/2008 discloses 'Remote Controlling Using Gestures'.

Although, these Indian Patent applications relation to various inputting mechanisms, systems, apparatus, and methods, none of them relate to utilising speech as an input parameter for control.

OBJECTS OF THE INVENTION

It is an object of the present invention to provide an isolated word recognition system for controlling electronic devices which is less complex, and operates on a pre-determined, fixed vocabulary.

It is another object of the present invention to provide a word recognition system for controlling electronic devices which is highly responsive.

It is yet another object of the present invention to provide a word recognition system for controlling electronic devices which does not require the incorporation of VAD.

It is still another object of the present invention to provide a word recognition system for controlling electronic devices which makes use of cheaper components.

One more object of the present invention is to provide a word recognition system for controlling electronic devices which reduces the development time and effort. It is still another object of the present invention to provide a novel user interface to users with certain handicaps who require enhanced accessibility.

SUMMARY OF THE INVENTION:

According to this invention, there is provided a system for word recognition, said system comprises:

- means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands;

- a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising:

• pattern determination means adapted to determine the consonant- vowel pattern in the user's utterances;

• length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances;

■ comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands (vocabulary);

- score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and

• decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.

Typically, said system includes TDA (Time Domain Analysis) means adapted to calculate frame-based signal energy of the captured audio commands in order to extract the basic units of said accepted user's utterances.

Typically, said system includes a first identification means adapted to identify the occurrence of fricatives and plosives in the case of consonants.

Typically, said system includes a second identification means adapted to identify the occurrence of back, mid, and front vowels in the case of vowels.

Typically, said first identification means is a frequency transformation means.

Typically, said second identification means is a frequency transformation means.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The invention will now be described with reference to the accompanying drawings, in which : Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention; and

Figure 2 illustrates the process overview of the word recognition system in accordance with the present invention.

DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The drawings and the description thereto are merely illustrative of a word recognition system and only exemplify the system of the invention and in no way limit the scope thereof.

In accordance with the present invention, a system is envisaged for implementing isolated word recognition for controlling electronic devices using audio commands. The system requires only low end target processors, thereby bringing down the component price and development effort, thus making it very well suited for embedded systems. In accordance with one embodiment of the present invention, the system is implemented for controlling interactive Set Top Boxes (iSTBs). A Set Top Box or a Set Top Unit (STU) is an electronic device that is connected to a communication channel, such as a phone or a cable television line, and produces output on a conventional television screen.

Referring to the accompanying drawings, Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention, indicated generally by the reference numeral 1000. The core component of the system is the speech recognizer module 100 embedded inside the iSTB. The accessory components are a microphone 200 and a remote controller 300. The speech recognizer module 100 has a remote control driver 102, an audio driver 104, a speech processing module 106, a decision module 108, a control module 110, a selector module 112 and an application module 114.

The user presses a button on the remote controller 300 to generate a starting signal for initiating the audio command controlling process for the iSTB. The remote control driver 102 of the speech recognizer module 100 receives the starting signal from the remote controller 300 and transfers the starting signal to the decision module 108.

The decision module 108 filters the different input commands received from different input drivers including the remote control driver 102. After filtration, the decision module 108 takes decision regarding to which module the input command has to be passed. For the input command obtained from the remote control driver 102 corresponding to the initiation of audio command controlling (starting signal), the command is transferred to the selector module 112. The selector module 112 triggers the control module 110 to initiate the speech processing methods implemented in the speech processing module 106. The control module 110 will also suppress the voice generated by the electronic device to make the work of the speech processing module 106 easy. The speech processing module 106 has a digital signal processor (DSP) which processes the audio commands from the user.

The audio driver 104 receives the audio commands from the user and the audio commands are then passed to the speech processing module 106. The DSP of the speech processing module 106 processes the audio signal as described below. The speech processing module 106 has a means for selecting the phonetically mutually exclusive audio commands. Use of such commands reduces the chance of incorrect classification. The term mutually-exclusive audio command refers to unique commands isolated from one another in terms of the following parameters:

• length, or total number of basic units (consonants and vowels);

• consonant-vowel (CV) pattern; and

• further sub-classification in basic units such as;

a) fricatives as plosives in case of consonants; and

b) back, mid, and front positions in case of vowels.

Considering four functions which are to be controlled for an STB, as shown in the Table 1 given below,

Table 1 The parameters to calculate the distance between the audio commands are calculated as given below:

i. Levenshtein distance between utterance and vocabulary

Minimum Levenshtein Distance, MinLev is calculated as:

, ,. , , candidate levenshteim , , , _— _

MinLev = 1 = V/ = UoN

candidate _ length where N is the number of commands. A fuzzy factor is then assigned to the distance in a scale of 0 to 1. ii. Longest Common Subsequence in Utterance and Vocabulary

Minimum LCS, MinLCS is calculated as: · I candidate lest candidate lcsi ] . ^" ,

MinLCS = Mini = , = Wi = \toN

[candidate length template length J >

where n is the number of commands. A fuzzy factor is assigned to the longest common subsequence in a scale of 0 to 1. iii. Length based factor

\candidate _ length - template _ length \

length _ factor = -τ-. τ V/ - ItoN

maxj \candidate length— template length \ \ where N is the number of commands. A fuzzy factor is also assigned to the length based factor.

A TDA (Time Domain Analysis) means is provided in the speech processing module 106 to calculate frame-based signal energy of the captured audio commands. Using this means, it is possible to extract the basic units, namely the consonants and vowels in the utterance. This process is fairly light in terms of CPU (Central Processing Unit) and memory requirement. To make it further simpler, it is possible to detect a change from a consonant to vowel and a vowel to consonant. For example, the utterance "PREVIOUS CHANNEL" is made up of the following consonants and vowels as given in Table 2:

Table 2

These parts of utterances total up to nine consonants and vowels, and the pattern thus formed by "PREVIOUS CHANNEL" is "CVCVCVCVC". Patterns for all the audio commands in the supported vocabulary are found offline and stored as constants. This reduces the search time significantly. In this example, some commands can be easily isolated based on the length alone. "PREVIOUS CHANNEL" for instance, is the longest in the set of utterances chosen. Since feature extraction may not work accurately at all times it is not recommended that the decision be made entirely based on length. Instead, if a length greater than 7 or so is detected, then it is safe to say that the utterance cannot be "MUTE". However there is still a chance that "FAST FORWARD" is misunderstood as "PREVIOUS CHANNEL" by the system. Thus, the commands can be partitioned to a certain extent based on this feature, but a confirmed decision cannot be made. This feature will be useful to improve the searching speed, particularly when a larger set of commands are in use. The other time-domain metrics, namely Levenshtein Distance and LCS are used to independently compute the correlation between the user's utterance and the parameters of the supported commands. The higher the value of this correlation with a supported command, the more is the probability of the user's utterance being the supported command.

Then, an LPC (Linear Predictive Coding) means in the speech processing module 106 computes the linear predictive co-efficients. The information regarding further classification of consonants and vowels is also extracted by the LPC means. These linear predictive co-efficients can be effectively used to improve the accuracy of the system several fold.

Further, if frequency transformation is applied, it is possible to distinguish and classify the consonants as fricatives and plosives, and vowels, as back, mid, and front vowels. This information increases the resolution of the system by adding to it the capability of isolating one vowel from the others and one consonant from the others. While this processing will increase the CPU load slightly, it will improve the accuracy of the recognizer, thus making the system very robust. The correlation of the user's utterance with the parameters of the supported commands is translated into confidence scores and they are scanned for maxima. The command with the maximum score is declared to be the winning candidate or "recognized" candidate. However, there is a further advantage in going for confidence scoring. The score maxima can be compared to pre-determined thresholds to improve the system's accuracy.

If the score maxima is in excess of a higher threshold, the winning candidate can be decided and the recognition process can be terminated. If this is not the case, then the score is checked with a lower threshold value. If the score exceeds the lower threshold value, then the system asks the user for a confirmation of the candidate selected by it. In other words, the user is prompted to say "YES" or "NO" which is again detected by the isolated word recognition method and the corresponding action is taken accordingly. If the scores fail to exceed both the higher and lower threshold values, then the system informs the user that the recognition process has failed. Then the system will restart listening to the user's utterance.

The same process of prompting the user for confirmation is employed if two or more candidates have a tie in their scores. In this case, the user is prompted for the winning candidate one at a time till his confirmation is obtained.

After the recognition of the audio command, the speech processing module 106 passes the command to the control module 110 which in turn is received by the selector module 112 from the control module 110. The selector module 112 passes the command to the concerned application lying in the application module 114 so that the application can operate according to the user's command.

In accordance with another aspect of the present invention, a push-to-talk mechanism is implemented. For activating the audio command controlling system, the user just has to push a button (activating a switch) on his electronic device. This activation can also be done using a key press event of the remote controller 300. This provides several advantages as given below.

• It eliminates the need for a start of speech detection or Voice Activity Detection (VAD), thereby making the system less complex, and more responsive. The system starts listening for the user's commands only when the user activates the switch.

• This is particularly of use when the appliance being controlled (a television or a music player) is sourcing audio signals. When the user activates the abovementioned switch, the audio signal can be attenuated, thus making the work of the word recognition system easier and reducing the chances of false detection. In the absence of this functionality, the electronic device may need to implement a noise suppressor or an equivalent to solve this problem.

• To operate the word recognition system, no initial training is required and is thus user-independent. But, accuracy of the word recognition system can be improved further by introducing initial training. When the user trains the system in a low-noise environment, user-specific patterns for the commands can be stored for look up purpose by the speech processing module 106. Figure 2 illustrates the process overview of the word recognition system in accordance with the present invention. The major process steps involved are i) the capturing of the audio commands by the audio driver (104 of Figure 1) and for the parallel suppression of the audio signals generated by the electronic device; ii) the preprocessing by the control module (110 of Figure 1) for invoking the speech processing module (106 of Figure 1); and iii) the decision making process by the speech processing module (106 of Figure 1) involving the time domain analysis, linear predictive coding, f equency transformation analysis, confidence scoring process and the decision making based on the confidence scoring.

TECHNICAL ADVANCEMENTS

The technical advancements of the present invention include realization of a word recognition system for controlling electronic devices which:

• is less complex;

• is highly responsive;

• does not require the incorporation of VAD;

• makes use of cheaper components; and

• reduces the development time and effort.

While considerable emphasis has been placed herein on the particular features of this invention, it will be appreciated that various modifications can be made, and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other modifications in the nature of the invention or the preferred embodiments will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

Claims

Claims:

1. A system for word recognition, said system comprising:

• means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands;

■ comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands;

• score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and

■ decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.

2. A system as claimed in claim 1 wherein, said system includes TDA (Time Domain Analysis) means adapted to calculate frame-based signal energy of the captured audio commands in order to extract the basic units of said accepted user's utterances.

3. A system as claimed in claim 1 wherein, said system includes a first identification means adapted to identify the occurrence of fricatives and plosives in the case of consonants.

4. A system as claimed in claim 1 wherein, said system includes a second identification means adapted to identify the occurrence of back, mid, and front vowels in the case of vowels.

5. A system as claimed in claim 3 wherein, said first identification means is a frequency transformation means.

A system as claimed in claim 4 wherein, said second identification is a frequency transformation means.