WO2011039764A2 - Word recognition system - Google Patents

Word recognition system Download PDF

Info

Publication number
WO2011039764A2
WO2011039764A2 PCT/IN2010/000588 IN2010000588W WO2011039764A2 WO 2011039764 A2 WO2011039764 A2 WO 2011039764A2 IN 2010000588 W IN2010000588 W IN 2010000588W WO 2011039764 A2 WO2011039764 A2 WO 2011039764A2
Authority
WO
WIPO (PCT)
Prior art keywords
user
utterances
parameters
means adapted
parameter
Prior art date
Application number
PCT/IN2010/000588
Other languages
French (fr)
Other versions
WO2011039764A3 (en
Inventor
Tanushyam Chattopadhyay
Ruchir Gulati
Original Assignee
Tata Consultancy Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Ltd filed Critical Tata Consultancy Services Ltd
Publication of WO2011039764A2 publication Critical patent/WO2011039764A2/en
Publication of WO2011039764A3 publication Critical patent/WO2011039764A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to the field of telecommunications.
  • the present invention relates to the field of word recognition.
  • Speech recognition also known as automatic speech recognition or computer speech recognition
  • voice recognition is sometimes used to refer to speech recognition.
  • Word recognition refers to the identification of each of the words in a spoken word string. Speech recognition finds its main application in interactive voice response (IVR) systems, audio command-and-control systems and the like. Using audio commands makes the process of external control of electronic devices very easy and user friendly (no need to press a sequence of keys to carry out a task). This will also prove as a curse for physically challenged and blind persons.
  • IVR interactive voice response
  • the Hidden Markov Model is the most common speech recognition method.
  • each phoneme a phoneme is the smallest segmental unit of the sound used to form meaningful contrasts between utterances
  • the chain branches off in different directions as the system attempts to match the digital sound with the phoneme that is most likely to come the next.
  • the system assigns a probability score to each phoneme, based on its built-in dictionary and user training.
  • VAD voice activity detection
  • US20020071577 focuses on Remote Control with speech recognition, which uses templates to perform speech recognition. These templates are received by the remote from an external database. It further uses a database to allow the user to customize the voice commands
  • WO 2008/084476 refers to a method which deals with Consonant and Vowel (CV) detection. This states its applications in the area of Continuous Speech Recognition, or Large Vocabulary Continuous Speech Recognition (LVCSR).
  • CV Consonant and Vowel
  • Indian Patent Application No. 1028/MUM/2008 discloses 'Methods and Apparatus for implementation of a set of Interactive Applications using a flexible Framework i.e. Methods and systems for wireless and wired transmission'.
  • Indian Patent Application No. 63 MUM/2008 discloses 'Input Mechanisms' which are methods and systems for wireless and wired transmission.
  • Indian Patent Application No 400/MUM/2008 discloses 'Remote Controlling Using Gestures'.
  • One more object of the present invention is to provide a word recognition system for controlling electronic devices which reduces the development time and effort. It is still another object of the present invention to provide a novel user interface to users with certain handicaps who require enhanced accessibility.
  • a system for word recognition said system comprises:
  • a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising:
  • pattern determination means adapted to determine the consonant- vowel pattern in the user's utterances
  • length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances
  • ⁇ comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands (vocabulary);
  • - score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands;
  • decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.
  • said system includes TDA (Time Domain Analysis) means adapted to calculate frame-based signal energy of the captured audio commands in order to extract the basic units of said accepted user's utterances.
  • TDA Time Domain Analysis
  • said system includes a first identification means adapted to identify the occurrence of fricatives and plosives in the case of consonants.
  • said system includes a second identification means adapted to identify the occurrence of back, mid, and front vowels in the case of vowels.
  • said first identification means is a frequency transformation means.
  • said second identification means is a frequency transformation means.
  • Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention.
  • FIG. 2 illustrates the process overview of the word recognition system in accordance with the present invention.
  • a system is envisaged for implementing isolated word recognition for controlling electronic devices using audio commands.
  • the system requires only low end target processors, thereby bringing down the component price and development effort, thus making it very well suited for embedded systems.
  • the system is implemented for controlling interactive Set Top Boxes (iSTBs).
  • iSTBs interactive Set Top Boxes
  • a Set Top Box or a Set Top Unit (STU) is an electronic device that is connected to a communication channel, such as a phone or a cable television line, and produces output on a conventional television screen.
  • Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention, indicated generally by the reference numeral 1000.
  • the core component of the system is the speech recognizer module 100 embedded inside the iSTB.
  • the accessory components are a microphone 200 and a remote controller 300.
  • the speech recognizer module 100 has a remote control driver 102, an audio driver 104, a speech processing module 106, a decision module 108, a control module 110, a selector module 112 and an application module 114.
  • the remote control driver 102 of the speech recognizer module 100 receives the starting signal from the remote controller 300 and transfers the starting signal to the decision module 108.
  • the decision module 108 filters the different input commands received from different input drivers including the remote control driver 102. After filtration, the decision module 108 takes decision regarding to which module the input command has to be passed. For the input command obtained from the remote control driver 102 corresponding to the initiation of audio command controlling (starting signal), the command is transferred to the selector module 112.
  • the selector module 112 triggers the control module 110 to initiate the speech processing methods implemented in the speech processing module 106.
  • the control module 110 will also suppress the voice generated by the electronic device to make the work of the speech processing module 106 easy.
  • the speech processing module 106 has a digital signal processor (DSP) which processes the audio commands from the user.
  • DSP digital signal processor
  • the audio driver 104 receives the audio commands from the user and the audio commands are then passed to the speech processing module 106.
  • the DSP of the speech processing module 106 processes the audio signal as described below.
  • the speech processing module 106 has a means for selecting the phonetically mutually exclusive audio commands. Use of such commands reduces the chance of incorrect classification.
  • mutually-exclusive audio command refers to unique commands isolated from one another in terms of the following parameters:
  • Table 1 The parameters to calculate the distance between the audio commands are calculated as given below:
  • MinLev Minimum Levenshtein Distance
  • MinLCS is calculated as: ⁇ I candidate lest candidate lcsi ] . " ,
  • n is the number of commands.
  • a fuzzy factor is assigned to the longest common subsequence in a scale of 0 to 1. iii. Length based factor
  • a TDA Time Domain Analysis means is provided in the speech processing module 106 to calculate frame-based signal energy of the captured audio commands. Using this means, it is possible to extract the basic units, namely the consonants and vowels in the utterance. This process is fairly light in terms of CPU (Central Processing Unit) and memory requirement. To make it further simpler, it is possible to detect a change from a consonant to vowel and a vowel to consonant. For example, the utterance "PREVIOUS CHANNEL" is made up of the following consonants and vowels as given in Table 2:
  • an LPC Linear Predictive Coding
  • the information regarding further classification of consonants and vowels is also extracted by the LPC means.
  • the winning candidate can be decided and the recognition process can be terminated. If this is not the case, then the score is checked with a lower threshold value. If the score exceeds the lower threshold value, then the system asks the user for a confirmation of the candidate selected by it. In other words, the user is prompted to say "YES” or "NO” which is again detected by the isolated word recognition method and the corresponding action is taken accordingly. If the scores fail to exceed both the higher and lower threshold values, then the system informs the user that the recognition process has failed. Then the system will restart listening to the user's utterance.
  • the same process of prompting the user for confirmation is employed if two or more candidates have a tie in their scores. In this case, the user is prompted for the winning candidate one at a time till his confirmation is obtained.
  • the speech processing module 106 passes the command to the control module 110 which in turn is received by the selector module 112 from the control module 110.
  • the selector module 112 passes the command to the concerned application lying in the application module 114 so that the application can operate according to the user's command.
  • a push-to-talk mechanism is implemented.
  • the user just has to push a button (activating a switch) on his electronic device. This activation can also be done using a key press event of the remote controller 300.
  • VAD Voice Activity Detection
  • FIG. 1 illustrates the process overview of the word recognition system in accordance with the present invention.
  • the major process steps involved are i) the capturing of the audio commands by the audio driver (104 of Figure 1) and for the parallel suppression of the audio signals generated by the electronic device; ii) the preprocessing by the control module (110 of Figure 1) for invoking the speech processing module (106 of Figure 1); and iii) the decision making process by the speech processing module (106 of Figure 1) involving the time domain analysis, linear predictive coding, f equency transformation analysis, confidence scoring process and the decision making based on the confidence scoring.
  • the technical advancements of the present invention include realization of a word recognition system for controlling electronic devices which:

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A system for word recognition, said system comprising: means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands; a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising: pattern determination means adapted to determine the consonant-vowel pattern in the user's utterances; length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances; comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands; score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.

Description

WORD RECOGNITION SYSTEM
FIELD OF THE INVENTION
The present invention relates to the field of telecommunications.
Particularly, the present invention relates to the field of word recognition.
BACKGROUND OF THE INVENTION
Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable inputs. The term "voice recognition" is sometimes used to refer to speech recognition. Word recognition refers to the identification of each of the words in a spoken word string. Speech recognition finds its main application in interactive voice response (IVR) systems, audio command-and-control systems and the like. Using audio commands makes the process of external control of electronic devices very easy and user friendly (no need to press a sequence of keys to carry out a task). This will also prove as a blessing for physically challenged and blind persons.
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to recognize the speech. If the words spoken fitted into a certain set of rules, the system could determine what the words were. However, human language has numerous exceptions to its own rules, even when it is spoken consistently. Accents, dialects and mannerisms can vastly change the way certain words or phrases are spoken.
Most of the people usually do not pronounce their words very carefully. For example, considering the sentence, "I'm going to see the ocean", the spoken l result might come out as "I'm goin' da see tha ocean". Many people speak with no noticeable break, such as "I'm goin'" and "tha ocean." Rules-based speech recognition systems proved to be unsuccessful because those couldn't handle the accent and dialect variations. This also explains why earlier systems could not handle continuous speech. For them to work properly, the speaker had to speak each word separately, with a brief pause in between them.
Today's speech recognition systems use powerful and complicated statistical modeling systems. These systems use probability and mathematical functions to determine the most likely outcome. The two models that dominate the field of speech recognition today are the Hidden Markov Model and neural networks. These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it.
The Hidden Markov Model is the most common speech recognition method. In this model, each phoneme (a phoneme is the smallest segmental unit of the sound used to form meaningful contrasts between utterances) is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the system attempts to match the digital sound with the phoneme that is most likely to come the next. During this process, the system assigns a probability score to each phoneme, based on its built-in dictionary and user training.
This process is even more complicated for phrases and sentences. In such scenarios, the system has to figure out where each word stops and starts. The classic example is the phrase "recognize speech," which sounds a lot like "wreck a nice beach" when it is said very quickly. The system has to analyze the phonemes using the phrase that came before it in order to get it right. Another problem of the speech recognition systems lies with the detection of the start of the whole speech activity. A technique named as voice activity detection (VAD) is implemented for the same. Voice activity detection (also known as speech activity detection) is a technique used in speech processing, wherein the presence or absence of human speech is detected in regions of audio frequency (which may also contain music, noise, or other sound). Implementation of VAD in speech recognition systems adds to making them more complex and less responsive.
Implementation of such complex speech recognition systems to detect and analyze the audio commands used to control electronic devices will make the purpose of using audio commands ineffective. Also, the whole electronic device control system will become very costly.
Therefore, there is felt a need for a word recognition system for controlling electronic devices which:
• is less complex;
• is highly responsive;
• does not require the incorporation of VAD;
• makes use of cheaper components; and
• reduces the development time and effort.
PRIOR ART:
US20020071577 focuses on Remote Control with speech recognition, which uses templates to perform speech recognition. These templates are received by the remote from an external database. It further uses a database to allow the user to customize the voice commands
However, there is a need to eliminate the dependency on external databases. Also, there is a need for a light-weight speech recognition algorithm (with a voice controlled remote control unit for a STB being presented as an example application).
WO 2008/084476 refers to a method which deals with Consonant and Vowel (CV) detection. This states its applications in the area of Continuous Speech Recognition, or Large Vocabulary Continuous Speech Recognition (LVCSR).
However, this application does not concentrate on fixed-vocabulary isolated word recognition.
Indian Patent Application No. 1028/MUM/2008 discloses 'Methods and Apparatus for implementation of a set of Interactive Applications using a flexible Framework i.e. Methods and systems for wireless and wired transmission'.
Indian Patent Application No. 2035/MUM/2008 discloses an 'Inputting Device'.
Indian Patent Application No. 63 MUM/2008 discloses 'Input Mechanisms' which are methods and systems for wireless and wired transmission. Indian Patent Application No 400/MUM/2008 discloses 'Remote Controlling Using Gestures'.
Although, these Indian Patent applications relation to various inputting mechanisms, systems, apparatus, and methods, none of them relate to utilising speech as an input parameter for control.
OBJECTS OF THE INVENTION
It is an object of the present invention to provide an isolated word recognition system for controlling electronic devices which is less complex, and operates on a pre-determined, fixed vocabulary.
It is another object of the present invention to provide a word recognition system for controlling electronic devices which is highly responsive.
It is yet another object of the present invention to provide a word recognition system for controlling electronic devices which does not require the incorporation of VAD.
It is still another object of the present invention to provide a word recognition system for controlling electronic devices which makes use of cheaper components.
One more object of the present invention is to provide a word recognition system for controlling electronic devices which reduces the development time and effort. It is still another object of the present invention to provide a novel user interface to users with certain handicaps who require enhanced accessibility.
SUMMARY OF THE INVENTION:
According to this invention, there is provided a system for word recognition, said system comprises:
- means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands;
- a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising:
• pattern determination means adapted to determine the consonant- vowel pattern in the user's utterances;
• length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances;
■ comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands (vocabulary);
- score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and
• decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.
Typically, said system includes TDA (Time Domain Analysis) means adapted to calculate frame-based signal energy of the captured audio commands in order to extract the basic units of said accepted user's utterances.
Typically, said system includes a first identification means adapted to identify the occurrence of fricatives and plosives in the case of consonants.
Typically, said system includes a second identification means adapted to identify the occurrence of back, mid, and front vowels in the case of vowels.
Typically, said first identification means is a frequency transformation means.
Typically, said second identification means is a frequency transformation means.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The invention will now be described with reference to the accompanying drawings, in which : Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention; and
Figure 2 illustrates the process overview of the word recognition system in accordance with the present invention.
DETAILED DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The drawings and the description thereto are merely illustrative of a word recognition system and only exemplify the system of the invention and in no way limit the scope thereof.
In accordance with the present invention, a system is envisaged for implementing isolated word recognition for controlling electronic devices using audio commands. The system requires only low end target processors, thereby bringing down the component price and development effort, thus making it very well suited for embedded systems. In accordance with one embodiment of the present invention, the system is implemented for controlling interactive Set Top Boxes (iSTBs). A Set Top Box or a Set Top Unit (STU) is an electronic device that is connected to a communication channel, such as a phone or a cable television line, and produces output on a conventional television screen.
Referring to the accompanying drawings, Figure 1 illustrates the block diagram of the word recognition system in accordance with the present invention, indicated generally by the reference numeral 1000. The core component of the system is the speech recognizer module 100 embedded inside the iSTB. The accessory components are a microphone 200 and a remote controller 300. The speech recognizer module 100 has a remote control driver 102, an audio driver 104, a speech processing module 106, a decision module 108, a control module 110, a selector module 112 and an application module 114.
The user presses a button on the remote controller 300 to generate a starting signal for initiating the audio command controlling process for the iSTB. The remote control driver 102 of the speech recognizer module 100 receives the starting signal from the remote controller 300 and transfers the starting signal to the decision module 108.
The decision module 108 filters the different input commands received from different input drivers including the remote control driver 102. After filtration, the decision module 108 takes decision regarding to which module the input command has to be passed. For the input command obtained from the remote control driver 102 corresponding to the initiation of audio command controlling (starting signal), the command is transferred to the selector module 112. The selector module 112 triggers the control module 110 to initiate the speech processing methods implemented in the speech processing module 106. The control module 110 will also suppress the voice generated by the electronic device to make the work of the speech processing module 106 easy. The speech processing module 106 has a digital signal processor (DSP) which processes the audio commands from the user.
The audio driver 104 receives the audio commands from the user and the audio commands are then passed to the speech processing module 106. The DSP of the speech processing module 106 processes the audio signal as described below. The speech processing module 106 has a means for selecting the phonetically mutually exclusive audio commands. Use of such commands reduces the chance of incorrect classification. The term mutually-exclusive audio command refers to unique commands isolated from one another in terms of the following parameters:
• length, or total number of basic units (consonants and vowels);
• consonant-vowel (CV) pattern; and
• further sub-classification in basic units such as;
a) fricatives as plosives in case of consonants; and
b) back, mid, and front positions in case of vowels.
Considering four functions which are to be controlled for an STB, as shown in the Table 1 given below,
Figure imgf000011_0001
Table 1 The parameters to calculate the distance between the audio commands are calculated as given below:
i. Levenshtein distance between utterance and vocabulary
Minimum Levenshtein Distance, MinLev is calculated as:
, ,. , , candidate levenshteim , , , _
MinLev = 1 = V/ = UoN
candidate _ length where N is the number of commands. A fuzzy factor is then assigned to the distance in a scale of 0 to 1. ii. Longest Common Subsequence in Utterance and Vocabulary
Minimum LCS, MinLCS is calculated as: · I candidate lest candidate lcsi ] . " ,
MinLCS = Mini = , = Wi = \toN
[candidate length template length J >
where n is the number of commands. A fuzzy factor is assigned to the longest common subsequence in a scale of 0 to 1. iii. Length based factor
\candidate _ length - template _ length \
length _ factor = -τ-. τ V/ - ItoN
maxj \candidate length— template length \ \ where N is the number of commands. A fuzzy factor is also assigned to the length based factor.
A TDA (Time Domain Analysis) means is provided in the speech processing module 106 to calculate frame-based signal energy of the captured audio commands. Using this means, it is possible to extract the basic units, namely the consonants and vowels in the utterance. This process is fairly light in terms of CPU (Central Processing Unit) and memory requirement. To make it further simpler, it is possible to detect a change from a consonant to vowel and a vowel to consonant. For example, the utterance "PREVIOUS CHANNEL" is made up of the following consonants and vowels as given in Table 2:
Figure imgf000013_0001
Table 2
These parts of utterances total up to nine consonants and vowels, and the pattern thus formed by "PREVIOUS CHANNEL" is "CVCVCVCVC". Patterns for all the audio commands in the supported vocabulary are found offline and stored as constants. This reduces the search time significantly. In this example, some commands can be easily isolated based on the length alone. "PREVIOUS CHANNEL" for instance, is the longest in the set of utterances chosen. Since feature extraction may not work accurately at all times it is not recommended that the decision be made entirely based on length. Instead, if a length greater than 7 or so is detected, then it is safe to say that the utterance cannot be "MUTE". However there is still a chance that "FAST FORWARD" is misunderstood as "PREVIOUS CHANNEL" by the system. Thus, the commands can be partitioned to a certain extent based on this feature, but a confirmed decision cannot be made. This feature will be useful to improve the searching speed, particularly when a larger set of commands are in use. The other time-domain metrics, namely Levenshtein Distance and LCS are used to independently compute the correlation between the user's utterance and the parameters of the supported commands. The higher the value of this correlation with a supported command, the more is the probability of the user's utterance being the supported command.
Then, an LPC (Linear Predictive Coding) means in the speech processing module 106 computes the linear predictive co-efficients. The information regarding further classification of consonants and vowels is also extracted by the LPC means. These linear predictive co-efficients can be effectively used to improve the accuracy of the system several fold.
Further, if frequency transformation is applied, it is possible to distinguish and classify the consonants as fricatives and plosives, and vowels, as back, mid, and front vowels. This information increases the resolution of the system by adding to it the capability of isolating one vowel from the others and one consonant from the others. While this processing will increase the CPU load slightly, it will improve the accuracy of the recognizer, thus making the system very robust. The correlation of the user's utterance with the parameters of the supported commands is translated into confidence scores and they are scanned for maxima. The command with the maximum score is declared to be the winning candidate or "recognized" candidate. However, there is a further advantage in going for confidence scoring. The score maxima can be compared to pre-determined thresholds to improve the system's accuracy.
If the score maxima is in excess of a higher threshold, the winning candidate can be decided and the recognition process can be terminated. If this is not the case, then the score is checked with a lower threshold value. If the score exceeds the lower threshold value, then the system asks the user for a confirmation of the candidate selected by it. In other words, the user is prompted to say "YES" or "NO" which is again detected by the isolated word recognition method and the corresponding action is taken accordingly. If the scores fail to exceed both the higher and lower threshold values, then the system informs the user that the recognition process has failed. Then the system will restart listening to the user's utterance.
The same process of prompting the user for confirmation is employed if two or more candidates have a tie in their scores. In this case, the user is prompted for the winning candidate one at a time till his confirmation is obtained.
After the recognition of the audio command, the speech processing module 106 passes the command to the control module 110 which in turn is received by the selector module 112 from the control module 110. The selector module 112 passes the command to the concerned application lying in the application module 114 so that the application can operate according to the user's command.
In accordance with another aspect of the present invention, a push-to-talk mechanism is implemented. For activating the audio command controlling system, the user just has to push a button (activating a switch) on his electronic device. This activation can also be done using a key press event of the remote controller 300. This provides several advantages as given below.
• It eliminates the need for a start of speech detection or Voice Activity Detection (VAD), thereby making the system less complex, and more responsive. The system starts listening for the user's commands only when the user activates the switch.
• This is particularly of use when the appliance being controlled (a television or a music player) is sourcing audio signals. When the user activates the abovementioned switch, the audio signal can be attenuated, thus making the work of the word recognition system easier and reducing the chances of false detection. In the absence of this functionality, the electronic device may need to implement a noise suppressor or an equivalent to solve this problem.
• To operate the word recognition system, no initial training is required and is thus user-independent. But, accuracy of the word recognition system can be improved further by introducing initial training. When the user trains the system in a low-noise environment, user-specific patterns for the commands can be stored for look up purpose by the speech processing module 106. Figure 2 illustrates the process overview of the word recognition system in accordance with the present invention. The major process steps involved are i) the capturing of the audio commands by the audio driver (104 of Figure 1) and for the parallel suppression of the audio signals generated by the electronic device; ii) the preprocessing by the control module (110 of Figure 1) for invoking the speech processing module (106 of Figure 1); and iii) the decision making process by the speech processing module (106 of Figure 1) involving the time domain analysis, linear predictive coding, f equency transformation analysis, confidence scoring process and the decision making based on the confidence scoring.
TECHNICAL ADVANCEMENTS
The technical advancements of the present invention include realization of a word recognition system for controlling electronic devices which:
• is less complex;
• is highly responsive;
• does not require the incorporation of VAD;
• makes use of cheaper components; and
• reduces the development time and effort.
While considerable emphasis has been placed herein on the particular features of this invention, it will be appreciated that various modifications can be made, and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other modifications in the nature of the invention or the preferred embodiments will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

Claims

Claims:
1. A system for word recognition, said system comprising:
• means for accepting a user's utterances corresponding to a voice command selected from a plurality of pre-defined voice commands;
- a parameter determination module adapted to determine a set of predefined parameters of the user's utterances, said processing module comprising:
• pattern determination means adapted to determine the consonant- vowel pattern in the user's utterances;
• length determination means co-operating with said pattern determination means adapted to determine the length of consonant vowel pattern in the user's utterances;
■ comparing means co-operating with said processing module adapted to compare the parameters determined by said parameter determination module with pre stored parameters of the pre-defined voice commands;
• score providing means adapted to provide parameter specific scores for every parameter to the user's utterances depending upon the degree of closeness to the determined parameters with the parameters of said pre-defined voice commands; and
■ decision means co-operating with score providing means adapted to calculate the cumulative score of the parameter specific scores for the user's utterances and decide the voice command uttered by the user.
2. A system as claimed in claim 1 wherein, said system includes TDA (Time Domain Analysis) means adapted to calculate frame-based signal energy of the captured audio commands in order to extract the basic units of said accepted user's utterances.
3. A system as claimed in claim 1 wherein, said system includes a first identification means adapted to identify the occurrence of fricatives and plosives in the case of consonants.
4. A system as claimed in claim 1 wherein, said system includes a second identification means adapted to identify the occurrence of back, mid, and front vowels in the case of vowels.
5. A system as claimed in claim 3 wherein, said first identification means is a frequency transformation means.
A system as claimed in claim 4 wherein, said second identification is a frequency transformation means.
PCT/IN2010/000588 2009-09-08 2010-09-03 Word recognition system WO2011039764A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2062/MUM/2009 2009-09-08
IN2062MU2009 2009-09-08

Publications (2)

Publication Number Publication Date
WO2011039764A2 true WO2011039764A2 (en) 2011-04-07
WO2011039764A3 WO2011039764A3 (en) 2011-06-16

Family

ID=43826735

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2010/000588 WO2011039764A2 (en) 2009-09-08 2010-09-03 Word recognition system

Country Status (1)

Country Link
WO (1) WO2011039764A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077822A (en) * 2021-03-24 2021-07-06 北京儒博科技有限公司 Method, device and equipment for evaluating plosive and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0103258A1 (en) * 1982-09-06 1984-03-21 Nec Corporation Pattern matching apparatus
JP2001083978A (en) * 1999-07-15 2001-03-30 Matsushita Electric Ind Co Ltd Speech recognition device
US20030167171A1 (en) * 2002-01-08 2003-09-04 Theodore Calderone Method and apparatus for voice control of a television control device
CN101394466A (en) * 2008-10-24 2009-03-25 天津三星电子有限公司 Sound controlled digital multifunctional set-top box

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0103258A1 (en) * 1982-09-06 1984-03-21 Nec Corporation Pattern matching apparatus
JP2001083978A (en) * 1999-07-15 2001-03-30 Matsushita Electric Ind Co Ltd Speech recognition device
US20030167171A1 (en) * 2002-01-08 2003-09-04 Theodore Calderone Method and apparatus for voice control of a television control device
CN101394466A (en) * 2008-10-24 2009-03-25 天津三星电子有限公司 Sound controlled digital multifunctional set-top box

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077822A (en) * 2021-03-24 2021-07-06 北京儒博科技有限公司 Method, device and equipment for evaluating plosive and storage medium
CN113077822B (en) * 2021-03-24 2022-09-27 北京如布科技有限公司 Method, device and equipment for evaluating plosive and storage medium

Also Published As

Publication number Publication date
WO2011039764A3 (en) 2011-06-16

Similar Documents

Publication Publication Date Title
US10074363B2 (en) Method and apparatus for keyword speech recognition
US10847137B1 (en) Trigger word detection using neural network waveform processing
Li et al. Spoken language recognition: from fundamentals to practice
Arora et al. Automatic speech recognition: a review
EP2048655B1 (en) Context sensitive multi-stage speech recognition
KR100679044B1 (en) Method and apparatus for speech recognition
KR100742888B1 (en) Speech recognition method
JP6708035B2 (en) Utterance content recognition device
WO2006083020A1 (en) Audio recognition system for generating response audio by using audio data extracted
Këpuska Wake-up-word speech recognition
JP3776391B2 (en) Multilingual speech recognition method, apparatus, and program
JP2996019B2 (en) Voice recognition device
KR101529918B1 (en) Speech recognition apparatus using the multi-thread and methmod thereof
JPH11231895A (en) Method and device speech recognition
WO2011039764A2 (en) Word recognition system
JP2000020089A (en) Speed recognition method and apparatus therefor as well as voice control system
KR101250897B1 (en) Apparatus for word entry searching in a portable electronic dictionary and method thereof
JP5300000B2 (en) Articulation feature extraction device, articulation feature extraction method, and articulation feature extraction program
Gupta et al. An Automatic Speech Recognition System: A systematic review and Future directions
JP2000244609A (en) Speaker's situation adaptive voice interactive device and ticket issuing device
Syadida et al. Sphinx4 for indonesian continuous speech recognition system
JP4163207B2 (en) Multilingual speaker adaptation method, apparatus and program
Rathor et al. Speech recognition and system controlling using Hindi language
JPH06161488A (en) Speech recognizing device
Khalifa et al. Statistical modeling for speech recognition

Legal Events

Date Code Title Description
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10820014

Country of ref document: EP

Kind code of ref document: A2