WO2004036939A1 - Appareil de communication mobile numerique portable, procede de commande vocale et systeme - Google Patents

Appareil de communication mobile numerique portable, procede de commande vocale et systeme Download PDF

Info

Publication number
WO2004036939A1
WO2004036939A1 PCT/CN2003/000870 CN0300870W WO2004036939A1 WO 2004036939 A1 WO2004036939 A1 WO 2004036939A1 CN 0300870 W CN0300870 W CN 0300870W WO 2004036939 A1 WO2004036939 A1 WO 2004036939A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
recognition
module
input
Prior art date
Application number
PCT/CN2003/000870
Other languages
English (en)
Chinese (zh)
Inventor
Jian Liu
Yonghong Yan
Lingyun Tuo
Baohai Sun
Jielin Pan
Jiang Han
Luguang Miao
Original Assignee
Institute Of Acoustics Chinese Academy Of Sciences
Beijing Kexin Technology Co.,Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Acoustics Chinese Academy Of Sciences, Beijing Kexin Technology Co.,Ltd. filed Critical Institute Of Acoustics Chinese Academy Of Sciences
Priority to CNB200380101122XA priority Critical patent/CN100403828C/zh
Priority to AU2003272871A priority patent/AU2003272871A1/en
Publication of WO2004036939A1 publication Critical patent/WO2004036939A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/26Devices for calling a subscriber
    • H04M1/27Devices whereby a plurality of signals may be stored simultaneously
    • H04M1/271Devices whereby a plurality of signals may be stored simultaneously controlled by voice recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Definitions

  • Portable digital mobile communication device and voice control method and system thereof are Portable digital mobile communication device and voice control method and system thereof
  • the present invention relates to a digital communication device and a control method and system thereof, and more particularly, to a portable digital mobile device that can be controlled by voice, and a voice control method and system thereof. Background technique
  • Speech recognition and synthesis technology has made rapid progress in the past ten years, especially in practical systems.
  • the mainstream algorithms of speech recognition systems have been unified in the areas of modeling, training, search, and adaptive. Most of them use statistical continuous probability density hidden Markov models and Viterbi search.
  • Algorithms In order to obtain a high degree of naturalness, speech synthesis systems mostly use the method of waveform splicing and large-scale sound database to improve the naturalness by finding the matching segment with the maximum length.
  • the adoption of these methods is certainly not a problem in terms of the CPU processing capacity, memory and other resources of the current desktop personal computer.
  • some voice applications can also rely on the support of the back-end server.
  • There is also an application system based on the Distributed Speech Recognition mode but the voice control technology running directly on the handheld device is not yet complete.
  • acoustic models usually use hidden Markov models with continuous density distribution.
  • CDHMM Chinese single-syllable speech recognition system
  • CDHMM the acoustic model in the existing unspecified Chinese single-syllable speech recognition system occupies 4M bytes of space, even the latest embedded speech In a synthesis system, its sound library also needs about 1.2M to 2M bytes of space.
  • CDHMM the probability distribution function of a feature vector in a certain state is described by the weighted sum of multiple Gaussian distribution functions, but in If CDHMM is used in a large vocabulary speech recognition system, the Gaussian probability needs to be calculated multiple times during the decoding process. The calculation amount is large, which will cause the speech recognition system to respond slowly.
  • the existing embeddings that can only recognize imperative entries The identification system needs to run on a CPU of about 100MIPS or more. This is almost impossible to achieve on an embedded hardware platform with limited resources. Method to meet the needs of practical use.
  • the voice operating system on today's mobile phones uses voice recognition technology based on the control of specific person commands, and the recognized entries or The number of commands is very limited, that is, the user needs to train one or more times to use all the terms or commands to be recognized before using them, and can not arbitrarily change the statement of the command. If he wants to change, he must retrain.
  • speech recognition technology that is not specific to people and does not limit the pronunciation content on mobile phones or other handheld devices.
  • the phonetic Chinese character input method according to the present invention is the same as the most commonly used pinyin input on existing handheld devices. Compared with the stroke input method, it has obvious advantages, that is, it has a very small number of keystrokes, which averages about 1.6 times per Chinese character, while other input methods require at least 3 times per Chinese character. On average, each Chinese character also has at least It takes 5 or 6 times.
  • a speech recognition system is disclosed. It consists of an extraction unit, a feature codebook, a quantization encoding unit, a decoding operation unit, a probability table, and a language model.
  • the analog input signal is converted into a digital signal by an analog-to-digital conversion unit, and the digital signal is subjected to frame processing in the feature extraction unit.
  • the feature parameters of each frame of speech are extracted to obtain the feature vector of the input speech. Sequence, then use the feature codebook to quantize and encode the feature vector sequence to obtain the corresponding feature codeword sequence; finally, perform a decoding operation to find an identification result from the dictionary tree that has the largest matching probability with the feature codeword sequence. In the calculation, for each codeword in the feature codeword sequence, it is only necessary to directly find the matching probability with the Gaussian codeword in the Gaussian codebook from the probability table.
  • the feature codebook and Gaussian codebook preferably use the improved K-means clustering algorithm disclosed in this application to compress the feature vector set and the acoustic model.
  • the clustering method can be expressed from the M-dimensional subspace.
  • the codebook for this space -Suppose there are N vectors in this space. To cluster these N vectors into a K-bit codebook, the K value must be set in advance.
  • the final codebook obtained contains 2 K codewords, each codeword. It consists of the M-dimensional center vector.
  • step 120 let 1 ⁇ + 1, divide all the sub-sets into two, and the method of splitting is to first calculate all the vectors in each sub-set relative to the The average variance value of the center vector of the sub-collection, adding and subtracting half of the average variance value of the center vector of the sub-collection to generate two new center vectors, and combining all the center vectors to generate a k-bits codebook; step 130, Find the center vector with the smallest distance metric for each vector in the subspace, and allocate an appropriate amount to the subset corresponding to the center vector.
  • Step 140 calculate the total distance metric change rate of all the vectors in the subspace, first give a total Initial value of the distance metric; step 150, comparing the change rate with a preset total distance metric change rate threshold, and if the threshold is not exceeded, the newly calculated total distance metric replaces the original The total distance metric, and return to step 110, if the threshold is exceeded, go to the next step; step 160, based on the number of vectors in each subset and the total distance metric of each subset, if the number of vectors is less than a predetermined Define the threshold of the number of vectors, merge this sub-set, and delete its center vector in the codebook.
  • a sub-set is merged, you need to select a sub-set to split, according to all the vectors in the sub-set and the center Divide the total distance metric of the vector by the number of vectors contained in the subset to get the average distance metric for each subset, and then split the subset with the largest average distance metric. Use the same method as in step 120 to calculate the new split.
  • the set center vector forms a new codebook together with the original center vector, replaces the original total distance metric with the newly calculated total distance metric, and returns to step 130.
  • the above method adds a step of dynamically merging and splitting the sub-sets according to the number of vectors in the sub-set and the total distance measure of the vectors in the original K-means method, reducing the distance between the vector in the set after clustering and its corresponding codeword The sum of the measures improves the accuracy of the clustering.
  • the codebook compressed by this method is applied to speech recognition, which can ensure the recognition performance of the speech system and greatly reduce the storage capacity of the system.
  • the step of quantizing encoding to convert the feature vector sequence of the input speech into a feature codeword sequence is: dividing the feature vector sequence into the same number of subspaces as the feature codebook, and each subspace corresponds to a codebook Calculate the distance metric between all feature vectors in each subspace and 4 codewords in the corresponding codebook, and use the codeword with the smallest distance metric to the feature vector as the corresponding feature vector in the feature codeword sequence Combining the codewords corresponding to all the vectors in each subspace of the feature vector sequence in the original vector order to obtain the codeword sequence of the corresponding feature codebook.
  • the probability table is generated by the following steps: calculating a mean-vector sum corresponding to each codeword in a Gaussian codebook: a variance vector; using the above-mentioned mean vector and variance vector, calculating each codeword in the feature codebook and In the Gaussian codebook: Logarithmic probability value of all codewords matching; Match all codewords in the feature codebook with all codewords in the Gaussian codebook, and store the probability values of ' ⁇ :. to get Probability table. :
  • This speech recognition system can replace the 4M ⁇ acoustic model in the existing system with a 412K feature codebook and probability table, thereby greatly reducing the requirements for device storage space.
  • the Gaussian probability need not be calculated in the decoding operation, the amount of decoding operation is greatly reduced, and the recognition speed can be increased by more than 50%.
  • the speech synthesis system includes: input modules connected in order. , Text analysis module and polyphonic word processing module, codeword sequence generation module, speech decoding module, waveform splicing and synthesis module, digital speech signal output module, and polyphonic wordlist module connected to the text analysis and polyphonic word processing module, and A compressed speech library connected to the codeword sequence generation module.
  • the compressed speech database stores the encoded and compressed speech data (codewords) packed according to certain rules.
  • the original speech database that stores all Chinese syllables and some special characters, numbers and symbols is subjected to a certain compression algorithm (such as Code Excited Linear Prediction Algorithm) is obtained by compressing the codeword, and adding the index mark to form a file.
  • a certain compression algorithm such as Code Excited Linear Prediction Algorithm
  • Its speech synthesis method is: the input module generates text information for speech output; the text analysis and polyphonic word processing module receives text form input, and analyzes the format and content of the input text to convert the input Chinese characters into corresponding spellings. Phonetic symbols; the codeword sequence generation module retrieves the compressed speech codeword sequence by searching the compressed speech library module according to these pinyin symbol sequences; the speech decoding module receives the compressed codeword sequence and passes the corresponding compression method The decompression algorithm restores the uncompressed digital signal of the original voice, and outputs it to the waveform splicing and synthesis module for splicing. Finally, the digital voice signal output module converts the obtained digital signal into a sound signal that is finally heard by the user.
  • the text analysis and polyphonic word processing module will find whether there are polyphonic characters in the input text to be analyzed according to the information provided by the polyphonic wordlist module, and determine a Chinese character with polyphonic characters in the input text according to the result of the text analysis Pronounce it correctly.
  • the speech synthesis system uses a high compression rate and low distortion speech codec algorithm, which greatly reduces the storage space required for the sound library. Its total resources (including the sound library and system dynamic memory) are less than 500K bytes. It is a word operation, and the computational complexity is less than 5MIPS, so it can run in the current mainstream mobile phones.
  • the technical problem to be solved by the present invention is to provide a portable digital communication device with a voice operating system, which can realize the speech recognition of non-specific persons and implement control operations with voice commands.
  • Another technical problem to be solved by the present invention is to provide a portable digital communication device with a voice operating system, which can realize full-syllable voice input of Chinese characters.
  • Yet another technical problem to be solved by the present invention is to provide a portable digital communication device with a voice operating system, which can implement voice prompts for arbitrarily combined text information in the device.
  • the present invention provides a portable digital mobile communication device with a voice operating system.
  • the operating system further includes an embedded voice recognition device, and the device further includes:
  • a voice input module is used to convert the input voice analog signal of the Oriental language into a digital signal;
  • a feature extraction / compression module is used to perform frame processing on the digital signal, and extract feature parameters of each frame of voice to obtain a proper amount of features Sequence, using a feature codebook to quantize and encode the feature vector sequence to obtain a corresponding feature code word sequence;
  • a speech recognition module configured to receive the above-mentioned characteristic codeword sequence, perform a decoding operation to find the best matching speech model, and then output a recognition result corresponding to the model;
  • An intent analysis module for analyzing the intent of input information and expressing it as a semantics within the program Symbol, and output to the dialog management / control module;
  • a dialog management / control module is configured to receive the semantic symbols output by the intent analysis module, determine the control action to be taken by the device, and execute the control action in combination with the current state of the device.
  • the above device is characterized in that it further includes a phonetic character conversion and input selection module for converting the recognition result of the pinyin form into a candidate Chinese character, which is displayed on the device and selects the required Chinese character according to user input.
  • the features also include:
  • a language generation module configured to receive the information output by the dialog management / control module, automatically call text information to be presented to the user in the prompt word list, and output to the speech synthesis module;
  • a speech synthesis module configured to obtain and output a digital representation of a speech signal corresponding to the text by processing and receiving the text information
  • a speech output module is used for receiving a digital representation of a speech signal output by the speech synthesis module, and transmitting the sound to a user through a speech output device on the device.
  • the above device is characterized in that the embedded speech recognition and synthesis device is implemented within 30MIPS and 1M memory space.
  • the voice recognition module further includes:
  • An identification unit for the codeword sequence will find its most closely matches the speech model by the decoding operation, the final output of the 4 most similar to the input speech recognition result, the effective operation of each code word speech pattern vector sequences Word, look up its observation probability on the search path directly from the probability table;
  • the feature is that the compression algorithm used in the feature codebook and Gaussian codebook is to add the K-ineans clustering algorithm to delete the center vector of the sub-set containing the number of vectors less than a preset value, and then delete the center vector.
  • the decoding operation component further includes:
  • a codeword sequence filtering component which is used to remove redundant codewords from the input code sequence to the decoding operation, thereby speeding up the decoding operation;
  • a search path adaptive trimming component is used to dynamically adjust the pruning threshold according to the maximum likelihood probability of the local search path, so as to effectively remove search paths that are not useful for decoding operations, thereby accelerating the decoding speed.
  • the dictionary tree includes: a dictionary tree mainly composed of single words in the Eastern language, used for ' When using speech to realize the input of Eastern languages, it is used in the decoding operation; and a dictionary tree mainly composed of command entries, stored information, and / or common short words is used in non-verbal input when used in the decoding operation.
  • the above device is characterized in that the candidate characters displayed on the screen of the device by the phonetic character conversion and input selection module are Chinese characters, and the display order of the Chinese characters is arranged according to the frequency of use of the Chinese characters; or
  • the alternative text is a Japanese kana, and the display order of the kana is arranged according to the matching probability of the recognition result.
  • the above-mentioned device is further characterized by a tone recognition module for extracting the fundamental frequency of the voice signal from the voice data from the voice input module, identifying the tone according to the fundamental frequency change of the entire voice, and The tone is output to the phonetic word conversion and input selection module.
  • the above-mentioned device is characterized in that the semantic analysis module is connected to a recognition vocabulary, and the language generation module is connected to a prompt vocabulary, and both of the vocabularies can be personalized and customized by an entry customization module.
  • the above-mentioned device is characterized in that the speech synthesis module uses a text and polyphonic word processing unit and a sentence pause technique to improve the intelligibility and naturalness of the synthesized speech.
  • the present invention also provides a voice operation control method for a digital mobile communication device, including the following steps:
  • a voice input step converting the input voice analog signal of the oriental language into a digital signal
  • the speech recognition step performs frame processing on the digital signal, extracts feature parameters of each frame of speech to obtain a feature vector sequence, and uses the feature codebook to quantize and encode the feature vector sequence to obtain a corresponding feature code word sequence. Perform a decoding operation to find the best matching speech model, and output the recognition result corresponding to the model;
  • a semantic analysis step analyzing the intent of the recognition result, expressing it as a kind of semantic symbol inside the program and outputting it;
  • the semantic symbol is received, and a control action to be taken by the device is determined based on the current state of the device, and executed.
  • the feature further includes a phonetic character conversion and input selection step: when the result of the judgment in the dialog management / control step is a text input, the recognition result in the form of pinyin is converted into an alternative Chinese character and displayed by the device , And select the required Chinese characters based on user input.
  • the above method is characterized in that in the dialog management / control step, the system further generates an information text to prompt the user, and synthesizes the generated information text to a digital representation of a voice signal to be prompted by the user, and plays it to the user through a voice output device.
  • the above method is characterized in that in the speech recognition step, when searching for a speech model that most closely matches the characteristic codeword sequence, each codeword of the effective speech characteristic codeword vector sequence is directly checked from the probability table. To its search Observation probability on the path.
  • the screen is arranged according to the frequency of the use of Chinese characters; or the candidate is displayed on the screen according to the matching probability of the recognition result Japanese kana.
  • the above method is characterized in that in the voice recognition step, a fundamental frequency of the voice signal is further extracted from the digital signal, a tone is identified according to a fundamental frequency change of the entire voice, and a backup is displayed on the screen.
  • a sorting method that combines the results of pitch recognition and the frequency of using Chinese characters / frequency of common words is used.
  • the above method is characterized in that waveform splicing and speech compression algorithms are used in speech synthesis, and text and Chinese character polysyllabic word processing and sentence pause technology are used.
  • the above method is characterized in that when text is input through the device, the voice operating system adopts a single-syllable recognition mode, and a single-word dictionary tree is enabled when decoding, and the pinyin recognition result is obtained and the phonetic conversion is displayed on the device screen for the user to choose;
  • the speech operating system uses the entry recognition mode, and the entry dictionary tree is enabled when decoding, and the device is controlled to complete corresponding control actions based on the semantics of the recognition result and the current state of the system.
  • the above method is characterized in that a continuous recognition engine is started in the voice recognition step, and if there is an unrecognized voice frame after completing the recognition, the recognition process is restarted.
  • the above method is characterized in that the voice operation and the key operation can coexist, and the key operation has priority.
  • the communication device and the voice operation control method using the above scheme can use the unrecognized person's speech recognition technology under the strict requirements of system resources, and does not limit the content and number of vocabulary. Users can add, delete, — Modify and customize the content of the command, which can be recognized without training; and the user can control all function menus of the device through voice, so the user interface is more friendly and flexible.
  • the system has modules for semantic analysis, dialogue management, and language generation, which can handle complex dialogue processes and generate flexible prompt information to the user.
  • the communication device using the above scheme can not only have a non-specific voice input function, a voice prompt function, or an input function of Chinese syllables, but also the voice input method provided by the present invention is the fastest and most economical input method.
  • FIG. 1 is a schematic diagram of function modules of a mobile phone voice operating system according to a first embodiment of the present invention
  • Figure 2a is a flowchart of a process for enabling a voice phone by a mobile phone voice operating system according to the first embodiment of the present invention
  • 2b is a flowchart of a state transition control process of a mobile phone voice operating system according to the first embodiment of the present invention
  • FIG. 3 is a flowchart of the process of sending a new short message by the mobile phone voice operating system according to the first embodiment of the present invention
  • FIG. 4 is a flowchart of the operation of the mobile phone voice operating system after receiving a new short message according to the first embodiment of the present invention
  • FIG. 5 It is the operation flowchart of the mobile phone voice operating system in the first embodiment of the present invention to complete the modification function in the phone book operation;
  • FIG. 6 is a flowchart of a mobile phone voice operating system setting a mobile phone to a vibration mode according to a first embodiment of the present invention
  • FIG. 7 is a flowchart of a mobile phone voice operating system writing a memo according to the first embodiment of the present invention
  • FIG. 8 is a flowchart of an operation of viewing a memo by a voice operating system of a mobile phone according to the first embodiment of the present invention. detailed description
  • Example 1 Although the speech recognition and synthesis technology mentioned in the background technology provides the possibility for real speech interaction on a mobile phone, in order to realize the above functions in a mobile phone, the above-mentioned speech recognition and synthesis system needs to be adaptively adjusted. , Increase the part for voice operation management, and organically combine the three parts. Appropriately installed in existing mobile phones. In addition, it is necessary to provide a voice operation management method for the new system, which can be effectively combined with the existing mobile phone operation method. The present invention will be described in detail below with reference to the embodiments and the accompanying drawings.
  • Example 1 Example 1
  • the operating system with a voice interaction function (hereinafter referred to as a voice operating system) in this embodiment is composed of a voice recognition part, a voice operation management part, and a voice synthesis part.
  • the embedded speech recognition and synthesis module are generally controlled within 30MIPS and 1M memory space. As shown in Figure 1.
  • the voice recognition part of the voice operating system of this embodiment includes: a voice input module 10, a feature extraction / compression module 20, and a voice recognition module 30.
  • the speech input module 10 corresponds to the analog-to-digital conversion unit in the application, and is configured to convert the input speech analog signal into a digital signal;
  • the feature extraction / compression module 20 corresponds to The feature extraction unit, feature codebook and quantization coding unit in this application are used to perform frame processing on the digital signal, extract feature parameters of each frame of speech, and use the feature codebook to quantize and encode the feature vector sequence.
  • the speech recognition module 30 corresponds to the decoding operation unit, dictionary tree and probability table in the application, and is used to receive the above feature codeword sequence, perform a decoding operation to find the best matching speech model, and then output The recognition result corresponding to the model.
  • the feature codebook and Gaussian codebook corresponding to the probability table in this embodiment use the improved K-means compression method in Chinese patent application 02148683.2 to compress the speech feature set and the general acoustic model, respectively. get.
  • the decoding operation unit preferably uses a Chinese patent.
  • the decoding method includes the following steps: (1) initializing a decoding operation unit in the speech recognition system; (2) sequentially extracting features of the next speech frame from the speech feature codeword sequence of length T in the input decoding operation unit.
  • Codeword vector set it as the current voice frame; (3) filter the current voice frame, if the current voice frame is filtered, go to step (2), otherwise, set the current voice frame as the current valid voice frame; ( 4) Based on the currently valid voice frame, judge each active node in the token of each layer I of the dictionary tree token resource at time t, and expand the node token resource table if the judgment belongs to an expandable token And the newly generated token is chained into the token resource table of the target node, where I is an index variable; (5) processing the token in the dictionary tree node; (6) according to the local path at time t The maximum probability and the maximum probability of the local path at the time corresponding to the previous valid speech frame, adaptively adjust the threshold related to pruning; (7) Repeat the steps (2) ⁇ (6) above to obtain the input speech result.
  • step (4) when the score of the token from one state phoneme to another state phoneme is calculated, the latter state phoneme (corresponding to a Gaussian codeword sequence) used for the current speech frame (corresponding to a The observation probability of the feature codeword sequence) is obtained by performing a table lookup operation on the probability table described in the patent application with an application number of 02148683.2.
  • the above decoding operation method has the following improvements: an adaptive pruning strategy based on the approximate ratio of the local path is added; and a speech frame filtering strategy based on the feature codeword vector is added.
  • the former effectively reduces the average number of tokens M (10% ⁇ 20%) in the process of understanding the code, and the latter can remove 20% ⁇ 30% of invalid speech frames in the speech feature codeword vector sequence. Therefore, after the above decoding algorithm is adopted, the speed of speech recognition is further improved.
  • the dictionary tree of the speech recognition part includes a dictionary tree of vocabulary entries and a dictionary tree of Chinese characters, corresponding to two recognition modes used at runtime: entry recognition mode and single-syllable recognition mode.
  • entry recognition mode the vocabulary in the dictionary tree used in the decoding operation is composed of all Chinese characters and used for the recognition of Chinese phonetic input; in the entry recognition mode, the dictionary tree used in the decoding operation is The vocabulary is composed of the system's preset entries, user 2's customized entries, and some stored information (such as a person's name). It also includes some single-word words.
  • the existing mobile phone's voice operating system also has limited voice recognition functions for imperative entries. For example, user 2 can directly call the operation menu for starting a short message by saying "send a short message".
  • This speech recognition is very different from the speech recognition system of the present invention.
  • the recognition part of the embodiment of the present invention can not only recognize a single Chinese character spoken by the user at will, but also can train non-specific people without training before recognition Recognition of speech and free customization of imperative entry. This will be explained in detail in the voice operation management section.
  • the speech synthesis part of this embodiment includes: a speech synthesis module 70 and a speech output module 80.
  • the speech synthesis module 70 corresponds to the text analysis and polyphonic word processing module, codeword sequence generation module, speech decoding module, waveform splicing and synthesis module, polyphonic word list module, and compressed speech library of the application.
  • the module is configured to obtain the pinyin corresponding to the text through text and polyphonic word processing according to the received text information, obtain the codeword sequence corresponding to the pinyin sequence by searching the compressed speech database, and restore the codeword sequence from the codeword sequence using a decompression algorithm.
  • the voice digital signal corresponding to the pinyin is obtained by performing wave splicing to obtain the voice digital signal of the entire sentence text information; the voice output module 80 corresponds to the digital voice signal output module in the application and is used to output the digital voice signal output by the voice synthesis module It is converted to a sound signal and transmitted to the user.
  • the compressed speech database is obtained by compressing the original speech database including all Chinese syllables using code-excited linear prediction or other high compression rate and low distortion speech compression algorithms / codecs when offline, as long as the operation is complicated And storage requirements to run in a mobile phone.
  • the compressed speech database also includes some special symbols, such as punctuation, and customized pronunciation compression codes corresponding to the meaning of speech pauses, questions, and other meanings. word.
  • the text analysis and polyphonic word processing module further includes processing of numbers and symbols in the input text.
  • the voice prompt function of existing mobile phones is implemented by simple recording of voice prompts, which is not a true synthesis system, and its voice output function is very limited.
  • the speech synthesis part of the present invention is very different from it.
  • the most fundamental is a compressed speech database with full syllables of Chinese characters. Therefore, after randomly composing words or sentences according to the pronunciation of Chinese characters, they can be divided into Pinyin sequences.
  • the corresponding codeword sequence is found in the compressed speech library, and the corresponding speech signal can be obtained after decompression, reduction and splicing. Therefore, in the present invention, as long as the information stored in the mobile phone in text form can be read out in a voice manner, such as the content of a received short message, the name in a phone book, and the content in a memo, etc.
  • the above-mentioned speech recognition and synthesis module greatly expands the possibility of speech interaction between the mobile phone and the user.
  • the voice operation management section includes an intent analysis module 40 connected to the recognition vocabulary 100, The speech management / control module 50, the language generation module 60 connected to the prompt word table 110, and the phonetic word conversion and input selection module 90.
  • the intent analysis module 40 is configured to analyze the intent of the input information and express it as a semantic symbol inside the program to determine the user's intent and output it to the dialog management / control module.
  • Information input mainly includes keys, voice, and incoming calls.
  • this module performs semantic analysis on speech input.
  • the recognition vocabulary 100 is needed.
  • the processing method is the same as the existing system of the mobile phone.
  • the recognition vocabulary 100 is a list containing internal semantic symbols and entry-related information. Each semantic symbol in the list is associated with a number of entries, but the semantic symbols referred to in this article are not related to words obtained by speech recognition. Articles are related, but should be understood in a broad sense. Input information in various ways such as keystrokes and handwriting can be understood as a language of the user.
  • the entry may include a command entry (can be customized), stored information such as a contact name in a phone book, some common phrases, and the like.
  • the recognition result is a command entry, such as "vibration mode"
  • the semantic symbol corresponding to the "vibration mode” entry can be found and output to the dialog management / control module to trigger the execution of the operation to switch to the "vibration mode”;
  • the recognition result is an informational entry that does not directly correspond to an operation. For example, when sending a text message, the mobile phone prompts the user to say the name of the recipient. After the user says "XXX”, it is not a command word It is only a piece of information required to perform an operation. Its corresponding semantic symbol is mainly determined by the current state of the system.
  • the semantic symbols corresponding to the speech recognition results are also determined by the current state of the system.
  • the recognized Chinese characters will trigger the process of sending them to the phonetic conversion module, and so on.
  • This embodiment also includes a vocabulary customization module (not shown in the figure) associated with the recognition vocabulary 100 and the prompt vocabulary, which is composed of a customized menu and a customized processing unit, and is used to add a customized vocabulary to the two vocabularies. Entry. Because there is no need to remodel, the present invention allows customization of any number of terms, thereby making the user's mobile phone more personal.
  • a vocabulary customization module (not shown in the figure) associated with the recognition vocabulary 100 and the prompt vocabulary, which is composed of a customized menu and a customized processing unit, and is used to add a customized vocabulary to the two vocabularies. Entry. Because there is no need to remodel, the present invention allows customization of any number of terms, thereby making the user's mobile phone more personal.
  • the custom processing unit will decompose the entry into a series of phonemes according to the Chinese characters of the entry, such as the TRIPHONE phoneme corresponding to the Hidden Markov Model ( ⁇ MM), and add it to the corresponding position of the imperative entry dictionary tree.
  • the vocabulary 100 associates this entry with the semantic symbol corresponding to the "vibration mode" entry, so that the mobile phone can recognize the voice command and confirm its true language Meaning, and perform the corresponding operation.
  • the user can also use the entry customization module to personalize the prompt information. For example, if you want to change the prompt of the letter in the future to "read the letter of XX" or "you have a letter", etc., or change some Longer prompts are tailored to phrases the user likes.
  • the operation of its custom menu is similar to the customization of imperative entries, but the custom processing unit only needs to store information without processing the dictionary tree, but sometimes it needs to use a custom processing unit to set a certain entry to a default entry.
  • the dialog management / control module 50 is configured to receive the semantic symbols output by the intent analysis module, determine the control action to be taken by the device according to the status and / or instruction information of the device, and execute the control action. For example: Call up the user's desired menu or the next menu; use voice to prompt the next operation; directly execute the user's voice command to make a call, send a text message, switch the voice function, and so on.
  • the input of the existing mobile phone is generally completed by pressing keys, and the control action to be taken is determined according to the pressed key position and the current and previous state of the mobile phone (in terms of the program, that is, the current running position).
  • management of speech input and output of Chinese syllables is added.
  • the dialog management / control module 50 receives the semantic symbols corresponding to the voice input output by the intent analysis module 40, combines the current and previous states of the mobile phone, determines the control action that the current mobile phone should take, and executes it .
  • Take the operation of saving after the editing of the short message is taken as an example. After completing the short message editing on the existing mobile phone and entering the number, a menu will pop up, which lists "Send / Save / Save and Send", and the user presses a specific key to select "Save "After the menu, a save operation is performed. This operation is triggered by the user selecting a specific key in the above state.
  • the dialog management / control module 50 can trigger the save operation on the short message and the information word.
  • a semantic symbol may correspond to different operations in different states. For example, after the text message is edited or a new contact name and phone number are entered, the "save" voice command can be used. The former will save the text message in the outbox. , Which will save the new contact name and phone number in the phone book. For such a situation that a semantic symbol may correspond to multiple operations, the dialog management / control module 50 determines the operation to be performed in combination with the current state. As in the above example, the dialog management / control module 50 receives the corresponding entry of "Save" After combining the semantic symbols of the mobile phone and the current state of the mobile phone, that is, during the editing process of the short message, the operation of saving the short message is performed. To be more precise, the operation that a voice command actually performs is hereinafter referred to as the true semantics of the voice command.
  • This type of voice command input brings a great convenience to the user, which can realize the jump between the lower-level menus, without having to use the keys to return from the lower-level menu to the main menu level by level, and then one level Ground into the bottom of another branch Layer menu.
  • the system can know the user's true semantics through semantic symbols (some need to combine with the current state), and perform corresponding operations. For example, when the user is in the setting mode of the vibration mode, he can say "check phone" and directly enter the menu of phone book query. Greatly improve the efficiency and convenience of operation.
  • the current state of the mobile phone 1 can be divided into two types, one is a text input state, and the other is a non-text input state, which respectively correspond to a single syllable recognition mode and an entry recognition mode.
  • the dialog management / control module 50 outputs the speech recognition result obtained by the speech recognition part to the phonetic conversion and input selection module 90; in the non-text input state, the speech recognition result is used to control the corresponding part of the device to complete the user's Instruction or provide required information, and if necessary, output prompt information about the completion of the corresponding instruction and / or the current state of the mobile phone to the language generation module 60.
  • the phonetic word conversion and input selection module 90 is used to receive a single syllable recognition result input from the dialogue management / control module 50 when the mobile phone is in a text input state, and use the phonetic word conversion unit to select the alternative Chinese character according to the frequency of use of the Chinese character. Output to the display screen of mobile phone 1 in turn, and then user 2 can select the correct Chinese character to be input through different keys or other accessories specified by the mobile phone.
  • the input selection unit will record and save each Chinese character selected by the user until this time. After all the secondary text information is input (for example, the user presses the "OK" key), the entire content is sent to the intent analysis module 40 through the mobile phone.
  • the Chinese character input selection control logic has the function of displaying "associative Chinese characters" of the next Chinese character candidate input list according to the word frequency obtained from the statistics of commonly used vocabularies, thereby making the input of Chinese characters faster and more accurate.
  • the system will simultaneously display Chinese characters corresponding to similar sounds to solve the situation where the speaker may not pronounce correctly.
  • the processing process is the same as that of the existing method of inputting pinyin by using a keyboard, except that the pinyin is obtained through recognition of the voice input.
  • the language generation module 60 is configured to automatically generate text information to be presented to the user based on the information from the dialog management / control module 50, and output it to the speech synthesis module. These prompt words are stored in the prompt word list 110.
  • the language generation module 60 calls up the prompt words in the prompt word list 110 and outputs the prompt words to the speech synthesis module. , And then play the voice signal to the user through the voice output module 80.
  • the conversation management / control module 50 determines that a voice prompt is required according to the system settings, and then finds a corresponding prompt entry in the prompt word list 110, such as "XXX (name of the recipient, the system (The information is extracted in real time), and it is output to the language generation module 60, and then the voice synthesis 70 and the output module 80 send out a prompt voice.
  • a voice prompt is required according to the system settings, and then finds a corresponding prompt entry in the prompt word list 110, such as "XXX (name of the recipient, the system (The information is extracted in real time), and it is output to the language generation module 60, and then the voice synthesis 70 and the output module 80 send out a prompt voice.
  • the voice operating system of this embodiment is designed and operated according to the human-computer dialogue mode.
  • the system controls the process through the intent analysis module 40, the dialog management / control module 50, the phonetic conversion and input selection module. And language generation module 60 to complete.
  • the intent analysis module 40 judges the user's intention based on the result given by the speech recognition module 30 or other input information from the mobile phone; the dialog management / control module 50 takes corresponding actions based on the determined user's intent and combines the current state of the mobile phone, necessary
  • the language generation module 60 generates the information text expression to be prompted to the user
  • the speech synthesis module 70 completes the generation of the prompt sound and plays it to the user through the speech output module; when the user performs Chinese character speech input, it is completed by the phonetic conversion and input selection module Chinese character editing.
  • All the methods and usage modes described in this embodiment can directly use the resources of the existing mobile phone, and add a powerful "voice operation" function to the existing device by adding / integrating on the basis of the existing mobile phone. This greatly expands the use mode and method of the original mobile phone. This is obviously different from other methods or devices that need to define a new mobile phone in order to reflect similar usage patterns, and its advantages are very obvious.
  • all the above modules exist in the operating system of the mobile phone in the form of software embedding, such as Palm OS operating system, Microsoft Windows CE operating system, EPOC operating system, Symbian operating system, or Linux operating system.
  • software embedding such as Palm OS operating system, Microsoft Windows CE operating system, EPOC operating system, Symbian operating system, or Linux operating system.
  • the adopted speech decompression algorithm can be implemented by software or by a chip. In other words, you can use the existing DSP chip on the phone. Of course, new chips can also be used, but this will increase the manufacturing cost of mobile phones.
  • the present invention also provides a method for controlling the operation of a voice mobile phone, and provides an application mode of the method of speech recognition and synthesis on the mobile phone. It is a method of inputting Chinese characters on a mobile phone through voice control, and controlling the operation and use of the mobile phone, and using voice output to enable users to know the current status and information of the mobile phone without looking at the display screen of the mobile phone.
  • the overall process of the voice mobile phone operation control method of the present invention can be divided into a voice mobile phone enabling process and a state control transfer process.
  • the following description is mainly from the perspective of operation and state control.
  • the specific method in the process of speech recognition and synthesis has been introduced in the previous system, and will not be repeated.
  • FIG. 2a is a flowchart of enabling a voice handset part in FIG. 1.
  • the mobile phone operating system automatically loads the corresponding modules of the voice recognition and synthesis sections, the name of everyone in the phone book, control commands for menus, etc. to prepare for voice recognition and synthesis.
  • Step 200 Wait until the phone enters the power-on waiting screen, Step 210; Select to enable "Voice Phone” through a pre-defined button or menu.
  • a special voice command can also be used as another implementation method to enable the system, where The voice commands can be set by the user, such as: "Enable Voice Mobile Phone / Voice Secretary", “Turn off Voice Mobile Phone Call", etc.
  • Step 220 After enabling, the system prompts the user "Welcome to use voice phone, please speak the instructions ", Step 230; then enter the waiting input state A, waiting for the information input, if the voice machine is not enabled, the waiting state is represented as A *.
  • User The mobile phone can be controlled by voice, keys or other input methods. When the mobile phone receives a new text message or an incoming call, it is also regarded as an input method.
  • FIG. 2b is a flowchart of the state control transition section in FIG. 2.
  • FIG. 1 if there is information input, step 300; the system recognizes the input information of voice, keys, or other methods to obtain the recognition result, step 310; determine the operation intention, that is, perform semantic analysis on the recognition result to determine the user's operation intention , Step 320; Does the system judge that the user wants to disable the phone? Step 330; if not, go to step 340; if the user does want to turn off the "voice phone", the system prompts the user: "voice phone” has been turned off, if you want to enable it, press the key or select from the menu, step 340a, and return to the status A *, that is, waiting for input status when voice phone is not enabled.
  • Step 340 Determine whether the system is inputting Chinese characters (including numbers)? For example, if the content of a short message is entered, the content of a memo or a new name is added to the phone book, if yes, go to step 350a, if no, go to step 350; in step 350a, the phonetic word conversion unit will change the corresponding backup according to the identified pinyin candidate result.
  • the selected Chinese character is displayed on the mobile phone screen, and then returns to the waiting input state A, and then executes the process of steps 1020 to 1023. In this process, the user selects the Chinese character just uttered from the mobile phone screen. The selection process can be performed on the mobile phone. Different key combinations are used to obtain the system.
  • the system After the word is selected, the system returns to state A; then the user speaks the next word, and then performs speech recognition, semantic analysis, phonetic conversion, and input selection steps in sequence; this cycle is repeated until all After the information is input, the user indicates that all the information is input through a pre-specified special key, such as the "Confirm" key. At this time, when the system performs a semantic analysis on the key input, the Chinese character input process will end and the next step will be executed. operating.
  • An implementation of controlling the input of Chinese characters by mobile phone keys The method is to use: up, down, left and right keys plus six keys such as a confirmation key and a cancel key.
  • press the up key to select the Chinese character corresponding to the current cursor and input it to the top of the phone screen; press the key to scroll through the Chinese character candidate list on the entire screen; press the left key to switch between different pinyin candidates; press the right key to indicate that the cursor is in Chinese Move backward one by one on the candidate result (right); press the confirmation key to indicate that when the editing of the Chinese character has been completed, that is, when the text message content has been completely entered, press this key, the system will exit the single syllable recognition state and continue Dialogue process; pressing the cancel key means that when an incorrect Chinese character is selected, each time the key is pressed to remove an entered Chinese character, or the Chinese character to be selected cannot be selected in the recognition result, the user can press this key Speak again.
  • the system performs the corresponding operation, step 350.
  • the operation is various, such as sending a text message, modifying the corresponding settings, calling the next menu, etc., where the state transition is related to
  • the recognition mode is switched to the entry mode, and the corresponding system is switched to a non-text input state; or when the system enters the Chinese character input state, the recognition mode is switched to a single syllable mode. Trigger the system to enter the state of Chinese characters Enter the information, you can learn from the examples later.
  • Step 360 if required, generate a prompt message text, synthesize a prompt voice and output it to the user, step 370; if not, process it in the original way of the system, that is, notify the user of the information through the mobile phone screen, ringtone / music or vibration, Or directly return to state A, step 360a.
  • the voice prompt function of this embodiment can also be set to be disabled.
  • the voice function of the present invention can be enabled or disabled on the mobile phone at any time.
  • the voice function is not enabled, the functions of the original phone will not be affected.
  • the user can interact with the mobile phone through voice, without prior training, can complete the switching between the various function menus and the lower-level management of some commonly used menus, such as the management of the electronic phonebook ⁇ Add / Modify / delete> All functions supported by the mobile phone menu, such as functions, voice prompts for incoming calls and new text messages, sending / receiving text messages, recording and reading memos, setting alarms, and controlling the status of the phone. You can also input Chinese characters on your phone by voice.
  • FIGS. 3-9 the application of a voice operating system to a mobile phone according to the present invention is described.
  • a speech recognition and synthesis method is given.
  • Technology-based human-machine interface technology is a new type of interaction generated by the application of mobile phones, but the original process of human-machine interaction is still retained.
  • the input interface or selection interface of the mobile phone screen is the same as that of existing mobile phones. You can use voice mode, key mode, or other modes for input. All voice input steps can be performed by the mobile phone keys.
  • the voice function does not conflict with the operation of the original mobile phone, but naturally coexists.
  • FIG. 3 shows a flowchart of a Chinese short message sending process on a voice mobile phone according to the present invention.
  • the mobile phone is in state A.
  • the mobile phone user first says “send a text message", step 400; after the system understands that the user wants to send a new text message through voice recognition and semantic analysis, it uses a voice prompt "Who to send it to?”
  • enter the menu interface that requires the user to enter the recipient's phone number, step 410;
  • the user can dictate the recipient's name (if the person's phone has been stored in the phone book) or type / speak the incoming phone number, of course
  • You can also directly enter the phone number by pressing a key, step 420; the system will automatically switch to the SMS editing interface after completing the recognition step, prompting "please press the content", and switch to single-syllable recognition mode, step 430; determine whether the user presses ""Confirm” key, step 440, if yes, it indicates that all the contents of
  • FIG. 4 shows a flowchart of a Chinese short message receiving process on a voice mobile phone according to the present invention.
  • the mobile phone is in state A. If the mobile phone receives a new text message, step 500; the system prompts "received a new letter from XXX" is the name or phone number in the phone book), step 510; the user says “view content", Step 520; After the system recognizes and analyzes the instruction, use the speech synthesis function to read the text message aloud.
  • Step 530 The user can perform different operations after listening to the text message: delete, save or forward, if the user says the command "Delete” In step 540a, the system prompts "Do you want to delete the short message?" In step 550a, the user says “yes”, the system performs the operation of deleting the short message, step 560a, and prompts "SMS deleted", step 570a, and then return to the state.
  • step 550b the user says the name of the person to send or Phone
  • step 560b the system performs the operation of forwarding the short message
  • step 570b the short message
  • step 540c The system executes the operation of saving the short message, and it prompts "SMS has been saved", step 550c, and then enters state A.
  • the user can complete the process of receiving the short message completely by voice. Compared with the original interactive mode, With voice prompts to communicate with users, 'speaking hands-free. This function is very convenient and useful when the user is driving or the like.
  • FIG. 5 shows an example of modifying a phone book in a mobile phone by using voice interaction.
  • the mobile phone is in state A. If the user speaks the "modify phone” instruction, step 600; the system prompts "please say name”, step 610; the user says “ XX “, step 620; the system switches to single-call syllable recognition mode, Find the person's phone number and prompt "The original phone number of XX is XXXXXXXX , please say a new number", step 630; the user speaks (or enters by key) the new phone number "xxxxxxxx", step 640; After the system recognizes it, switch back to the command entry recognition mode and ask "Do you want to modify XX's phone to xxxxxx?”, Step 650; if the user says “Yes”, step 660, the system performs the operation of modifying the phone and prompts "XX "The phone has been modified", step 670, and then enters state A; if the user says “no", step
  • the mobile phone is in a state A at the beginning, and the user says "vibration mode” or “ringing mode", step 700; the system performs a mode switching operation after recognizing the instruction, step 710; displays the mode switching state on the mobile phone screen, and The user is prompted by voice that "the mobile phone has switched to the vibration / ringtone mode", step 720.
  • the user can Speak the desired mode in any state of the voice phone, the spoken language system will understand the user's request, determine it as a ringtone or vibration, and set it to the corresponding mode.
  • FIG. 7 shows an operation example of writing a memo by using voice interaction.
  • the mobile phone is in the state A, the user says the "open calendar” instruction, step 800; the system displays the corresponding menu of the calendar function, step 810; the user says the "write memo” instruction, step 820.
  • Figure 8 shows an example of viewing memos using voice interaction.
  • the user says “view memo”, step 900; the system looks for the current record, step 910; if there is no record, the system prompts "no event today", step 920a; if there is a record, the system reads "there are events today: 1.
  • This operation is actually a user request to check the schedule of the day, and the system reads the stored information aloud and displays it.
  • the other control operations are not detailed one by one, and the present invention is suitable for controlling all operations of the mobile phone.
  • a key-first method is adopted in this embodiment.
  • the voice prompt function can be interrupted with subsequent key operations. Because when the mobile phone screen can be seen, the voice prompt is too long may be annoying to the user.
  • voice input will greatly simplify the process of mobile phone control operations.
  • the system can automatically transition from one state to another state without the need for layer-by-layer menu transitions.
  • the user can directly say "vibration mode", and the system will automatically understand the command and directly perform the operation.
  • the present invention adopts the recognition technology of full syllables of Chinese characters, controls the input of Chinese characters in combination with the phonetic character conversion and the input selection module, and the whole process is completed in series. This is different from the methods that use some other methods to aid recognition at the same time. Those methods generally use other input methods at the same time, such as handwriting, and then use a certain information fusion method. Multiple methods are used in parallel to output the final Chinese character. Obviously, Utilizing more system resources adds additional costs.
  • the voice phone of the present invention has the convenience of being directly applicable under existing resources.
  • speech recognition users can directly speak the syllables of the Chinese characters they want to input, without the need for other auxiliary methods.
  • the mobile phone voice operating system in this embodiment does not have the function of speech synthesis, but can realize full-syllable speech recognition, accept voice commands of non-specific persons, and can realize full-syllable input of Eastern languages by voice.
  • the control related to the voice prompt is to be cancelled.
  • the voice recognition part of this embodiment also includes a tone recognition module, which is used to receive data from the voice input module 10 in single-syllable recognition mode, extract the fundamental frequency of the voice signal after frame processing, and according to the entire voice The fundamental frequency change of the tone is recognized, and then the recognition result of the tone is output to the phonetic word conversion and selection input module 90.
  • a tone recognition module which is used to receive data from the voice input module 10 in single-syllable recognition mode, extract the fundamental frequency of the voice signal after frame processing, and according to the entire voice The fundamental frequency change of the tone is recognized, and then the recognition result of the tone is output to the phonetic word conversion and selection input module 90.
  • the order of the Chinese characters of the same tone on the mobile phone screen is still arranged according to the frequency of use of the Chinese characters. For other tones, that is, the tone that is not recognized, they are uniformly sorted according to the frequency of use of Chinese characters, and are listed after all the Chinese characters corresponding to the identified tone.
  • the Chinese characters displayed on the mobile phone screen are: Cai Cai Cai Caicai Caicai stepping guess, and the recognition given in this embodiment The result is one beep, then the Chinese character that appears on the screen of the mobile phone at this time is "Guaicaicaicai Cai Caicai Caicai stepping on the hoe".
  • the same tone is still sorted according to the frequency of use of Chinese characters.
  • the given tonal candidate results are 1 to 4, 4 to 3, and 2 tones.
  • the order of Chinese characters on the screen will be "Guess Cai Caicai hi-tech talents", that is, identify by tone.
  • the results are sorted.
  • the tone is sorted according to the frequency of appearance of the Chinese characters.
  • the introduction of the tone recognition module can greatly improve the accuracy of the appearance order of Chinese characters in the candidate list, thereby making the input of Chinese characters faster.
  • a voice adaptation module is added to the voice recognition part to continuously and automatically learn the function of the voice of the mobile phone user, thereby making the performance of the voice operating system better and better.
  • the speech adaptation module may use the maximum likelihood linear regression ⁇ MLLR> method, the maximum posterior probability ⁇ MAP> method, or any other speech model adaptation method.
  • the voice operating system and control method proposed by the present invention are not limited to the application of a Chinese voice mobile phone, and are applicable to any oriental language, such as Japanese, Korean / Korean, Mongolian, etc.
  • the characteristics of such languages are: It cannot be directly represented by the ASC characters on the keyboard, that is, its input cannot be directly input with the existing "qwert" keyboard, and it must be converted by some input method; in addition, they are all monosyllabic pronunciations, corresponding to a certain type or Several kinds of text represent symbols, and the pronunciation types are limited.
  • This embodiment is a mobile phone implementation using Japanese voice as the operating system.
  • the composition of the modules involved is exactly the same as that of the module described in FIG. 1, except that the phonetic conversion and input selection module 90 is more For the sake of simplicity, it is reflected in the control process.
  • the main differences are the input steps of Japanese voice and the display steps of Japanese characters. Since the Japanese pronunciation has a syllabary, each pronunciation corresponds to a Japanese kana, so the Japanese kana input method does not need to select a word, that is, the phonetic characters can be mapped one by one.
  • Japanese kana is divided into two types: “Hiragana” and "Katakana”. Differentiating them can be achieved in a similar way to distinguishing between uppercase and lowercase letters, that is, two types of kana can be realized with a single key. Switch between.
  • some simple terms such as "next screen” and "second” are added to the dictionary tree of Chinese characters, and the continuous recognition engine is enabled in the single-syllable recognition mode of Chinese characters. That is, if there are remaining voice frames after the first result is recognized, a new recognition process is restarted, so that terms such as “next screen", "second” and the like can be recognized and executed accordingly.
  • the user can use these instructions to select Chinese characters on the screen or perform screen scrolling operations without having to press keys to implement the hands-free function of Chinese character input. It is especially convenient for disabled people who have inconvenient hand-held phones. At the same time, all the digits of the phone number can be read out at one time, and the system can complete the identification.
  • the mobile phone voice operating system in this embodiment does not have an Oriental language voice input function, but can implement voice recognition and voice prompt functions for command entries of unspecified persons.
  • the phonetic conversion and input selection module in Fig. 1 is changed to a module that implements Eastern language input using a keyboard or other methods.
  • the relevant control regarding the input of Chinese characters and the switching of the recognition mode may be cancelled.
  • the composition and control of other parts are the same as those of the first embodiment.
  • This embodiment can also use a voice to send a short message, and the content of the short message directly uses pinyin instead of being converted into Chinese (or other oriental languages) characters.
  • the other party can't see the content on the screen, it can be understood through the prompt voice. It's as if two people don't need to write when speaking, this feature is very useful for the blind.
  • the mobile phone voice operating system in this embodiment does not have an Oriental language voice input function and a voice prompt function, but can realize the recognition of full syllables of Chinese characters, and recognize voice commands for non-specific persons.
  • the phonetic conversion and input selection module in Fig. 1 is changed to a module that implements Oriental language input by using a keyboard or other methods, and the language generation and speech synthesis modules are eliminated.
  • the control method compared to Fig. 2b, the control of the Eastern language voice input and voice prompts should be cancelled.
  • the composition and control of other parts are the same as those of the first embodiment.
  • Example 5 The mobile phone voice operating system in this embodiment does not have the function of oriental language voice input and the full-syllable recognition function of Chinese characters, but can perform full-syllable voice prompts.
  • the voice recognition part in FIG. 1 is cancelled, and the phonetic conversion and input selection module is changed to a module that implements Eastern language input by using a keyboard or other methods.
  • the control method compared to FIG. 2b, the related control of speech recognition and Chinese character speech input should be cancelled.
  • the composition and control of other parts are the same as those of the first embodiment.
  • the application of the present invention is not limited to this. According to the main idea of the present invention, those skilled in the art can make many similar or equivalent changes.
  • the present invention can be applied not only to mobile phones, but also to all Portable digital mobile communication devices, such as PDAs, have outstanding application effects.
  • the dictionary tree may be composed of several types according to needs, to facilitate the speed and accuracy of recognition.
  • the combinations of the various modules of the present invention are not limited to the above. Therefore, the protection scope of the present invention should be based on the content shown in the claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

L'invention concerne un appareil de communication mobile numérique et portable doté d'un système de commande vocale et un procédé de commande vocale. Les séquences vectorielles caractéristiques de signaux vocaux sont codées de manière quantitative lorsque les signaux vocaux sont reconnus, et dans l'opération de décodage, chaque code parmi les codes de caractères vocaux efficaces constitue une probabilité d'observation de la voie de recherche à partir du plan de probabilité de l'opération de décodage. La reconnaissance vocale syllabique intégrale peut être réalisée dans un téléphone mobile sans formation, ainsi que l'entrée vocale de caractères chinois et la sollicitation vocale avec syllabe intégrale. Ce système comprend un module d'analyse sémantique, de gestion du dialogue et de génération de langage, et peut également traiter une procédure de dialogue compliquée et renvoyer un message de sollicitation flexible à l'utilisateur: La commande vocale peut être personnalisée ainsi que la sollicitation de contenu par l'utilisateur.
PCT/CN2003/000870 2002-10-18 2003-10-17 Appareil de communication mobile numerique portable, procede de commande vocale et systeme WO2004036939A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB200380101122XA CN100403828C (zh) 2002-10-18 2003-10-17 一种便携式数字移动通讯设备及其语音控制方法和***
AU2003272871A AU2003272871A1 (en) 2002-10-18 2003-10-17 Portable digital mobile communication apparatus, method for controlling speech and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN02146276.3 2002-10-18
CN02146276 2002-10-18

Publications (1)

Publication Number Publication Date
WO2004036939A1 true WO2004036939A1 (fr) 2004-04-29

Family

ID=32098107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2003/000870 WO2004036939A1 (fr) 2002-10-18 2003-10-17 Appareil de communication mobile numerique portable, procede de commande vocale et systeme

Country Status (3)

Country Link
CN (1) CN100403828C (fr)
AU (1) AU2003272871A1 (fr)
WO (1) WO2004036939A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103137125A (zh) * 2011-11-30 2013-06-05 北京德信互动网络技术有限公司 基于声控的智能电子设备和声控方法
CN108028043A (zh) * 2015-09-24 2018-05-11 微软技术许可有限责任公司 在参与者之间的对话中检测可行动项
WO2019136675A1 (fr) * 2018-01-11 2019-07-18 华为技术有限公司 Dispositif terminal, circuit et procédé de lecture audio dsd
CN111862985A (zh) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 一种语音识别装置、方法、电子设备及存储介质

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100359445C (zh) * 2006-01-27 2008-01-02 南京联慧通信技术有限公司 移动信息终端使用词组联想和语音提示的汉字输入方法
CN101132435B (zh) * 2006-08-21 2012-08-08 杭州小尔科技有限公司 盲人手机及通话方式
CN101094445B (zh) * 2007-06-29 2010-12-01 中兴通讯股份有限公司 一种实现文本短信语音播放的***及方法
WO2010098130A1 (fr) * 2009-02-27 2010-09-02 パナソニック株式会社 Dispositif de détermination de tonalité et procédé de détermination de tonalité
CN102237082B (zh) * 2010-05-05 2015-04-01 三星电子株式会社 语音识别***的自适应方法
CN102280106A (zh) * 2010-06-12 2011-12-14 三星电子株式会社 用于移动通信终端的语音网络搜索方法及其装置
CN102543076A (zh) * 2011-01-04 2012-07-04 ***通信集团公司 用于语音输入法的语音训练方法及相应的***
US8768707B2 (en) * 2011-09-27 2014-07-01 Sensory Incorporated Background speech recognition assistant using speaker verification
CN103092887B (zh) * 2011-11-07 2016-10-05 联想(北京)有限公司 电子设备及其语音信息提供方法
CN103543905B (zh) * 2012-07-16 2017-07-25 百度在线网络技术(北京)有限公司 语音控制终端的界面的方法及装置
CN102880649B (zh) * 2012-08-27 2016-03-02 北京搜狗信息服务有限公司 一种个性化信息处理方法和***
TW201409462A (zh) * 2012-08-31 2014-03-01 Chung Han Interlingua Knowledge Co Ltd 語意辨識方法
CN103699530A (zh) * 2012-09-27 2014-04-02 百度在线网络技术(北京)有限公司 根据语音输入信息在目标应用中输入文本的方法与设备
CN103731545B (zh) * 2012-10-16 2016-09-21 南京中兴软件有限责任公司 转屏控制方法、装置及电子设备
US8589164B1 (en) * 2012-10-18 2013-11-19 Google Inc. Methods and systems for speech recognition processing using search query information
CN103903618B (zh) * 2012-12-28 2017-08-29 联想(北京)有限公司 一种语音输入方法及电子设备
JP6059253B2 (ja) * 2012-12-28 2017-01-11 株式会社レイトロン 音声認識デバイス
CN103077716A (zh) * 2012-12-31 2013-05-01 威盛电子股份有限公司 辅助启动装置、语音操控***及其方法
CN104050962B (zh) * 2013-03-16 2019-02-12 广东恒电信息科技股份有限公司 基于语音合成技术的多功能阅读器
CN104375884B (zh) * 2013-08-15 2018-03-23 联想(北京)有限公司 一种信息处理方法和电子设备
US20150370787A1 (en) * 2014-06-18 2015-12-24 Microsoft Corporation Session Context Modeling For Conversational Understanding Systems
CN104599670B (zh) * 2015-01-30 2017-12-26 泰顺县福田园艺玩具厂 点读笔的语音识别方法
CN105141919A (zh) * 2015-09-01 2015-12-09 武汉同迅智能科技有限公司 一种语音远程控制的监控终端装置
CN105845139B (zh) * 2016-05-20 2020-06-16 北方民族大学 一种离线语音控制方法和装置
CN106205611B (zh) * 2016-06-29 2020-03-27 北京儒博科技有限公司 一种基于多模态历史响应结果的人机交互方法及***
CN106782517A (zh) * 2016-12-15 2017-05-31 咪咕数字传媒有限公司 一种语音音频关键词过滤方法及装置
CN106933809A (zh) * 2017-03-27 2017-07-07 三角兽(北京)科技有限公司 信息处理装置及信息处理方法
CN108198566B (zh) * 2018-01-24 2021-07-20 咪咕文化科技有限公司 信息处理方法及装置、电子设备及存储介质
CN109670185B (zh) * 2018-12-27 2023-06-23 北京百度网讯科技有限公司 基于人工智能的文本生成方法和装置
CN112153213A (zh) * 2019-06-28 2020-12-29 青岛海信移动通信技术股份有限公司 一种确定语音信息的方法和设备
CN110781305B (zh) * 2019-10-30 2023-06-06 北京小米智能科技有限公司 基于分类模型的文本分类方法及装置,以及模型训练方法
CN110827827A (zh) * 2019-11-27 2020-02-21 维沃移动通信有限公司 一种语音播报方法及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1142647A (zh) * 1995-03-01 1997-02-12 精工爱普生株式会社 语音识别对话装置
CN1162365A (zh) * 1994-11-01 1997-10-15 英国电讯公司 语音识别
US6249759B1 (en) * 1998-01-16 2001-06-19 Nec Corporation Communication apparatus using speech vector comparison and recognition
CN1316863A (zh) * 2000-04-04 2001-10-10 李秀星 语音识别操作便携电话机的方法和***
WO2002005263A1 (fr) * 2000-07-07 2002-01-17 Siemens Aktiengesellschaft Procede d'entree et de reconnaissance vocale

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020076009A1 (en) * 2000-12-15 2002-06-20 Denenberg Lawrence A. International dialing using spoken commands
US6937983B2 (en) * 2000-12-20 2005-08-30 International Business Machines Corporation Method and system for semantic speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1162365A (zh) * 1994-11-01 1997-10-15 英国电讯公司 语音识别
CN1142647A (zh) * 1995-03-01 1997-02-12 精工爱普生株式会社 语音识别对话装置
US6249759B1 (en) * 1998-01-16 2001-06-19 Nec Corporation Communication apparatus using speech vector comparison and recognition
CN1316863A (zh) * 2000-04-04 2001-10-10 李秀星 语音识别操作便携电话机的方法和***
WO2002005263A1 (fr) * 2000-07-07 2002-01-17 Siemens Aktiengesellschaft Procede d'entree et de reconnaissance vocale

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103137125A (zh) * 2011-11-30 2013-06-05 北京德信互动网络技术有限公司 基于声控的智能电子设备和声控方法
CN108028043A (zh) * 2015-09-24 2018-05-11 微软技术许可有限责任公司 在参与者之间的对话中检测可行动项
CN108028043B (zh) * 2015-09-24 2021-11-19 微软技术许可有限责任公司 在参与者之间的对话中检测可行动项
WO2019136675A1 (fr) * 2018-01-11 2019-07-18 华为技术有限公司 Dispositif terminal, circuit et procédé de lecture audio dsd
US11606459B2 (en) 2018-01-11 2023-03-14 Honor Device Co., Ltd. Terminal device, and DSD audio playback circuit and method
CN111862985A (zh) * 2019-05-17 2020-10-30 北京嘀嘀无限科技发展有限公司 一种语音识别装置、方法、电子设备及存储介质
CN111862985B (zh) * 2019-05-17 2024-05-31 北京嘀嘀无限科技发展有限公司 一种语音识别装置、方法、电子设备及存储介质

Also Published As

Publication number Publication date
CN100403828C (zh) 2008-07-16
AU2003272871A8 (en) 2004-05-04
CN1703923A (zh) 2005-11-30
AU2003272871A1 (en) 2004-05-04

Similar Documents

Publication Publication Date Title
WO2004036939A1 (fr) Appareil de communication mobile numerique portable, procede de commande vocale et systeme
Rudnicky et al. Survey of current speech technology
US6463413B1 (en) Speech recognition training for small hardware devices
KR100769029B1 (ko) 다언어의 이름들의 음성 인식을 위한 방법 및 시스템
CA2466652C (fr) Technique de compression de donnees de dictionnaire
TWI281146B (en) Apparatus and method for synthesized audible response to an utterance in speaker-independent voice recognition
US20080126093A1 (en) Method, Apparatus and Computer Program Product for Providing a Language Based Interactive Multimedia System
WO2000058943A1 (fr) Systeme et procede de synthese de la parole
WO2009006081A2 (fr) Correction de prononciation de synthétiseurs texte-parole entre différentes langues parlées
Cohen Embedded speech recognition applications in mobile phones: Status, trends, and challenges
WO2022057759A1 (fr) Procédé de conversion de voix et dispositif associé
CN112131359A (zh) 一种基于图形化编排智能策略的意图识别方法及电子设备
CN110634466A (zh) 具有高感染力的tts处理技术
CN112669815A (zh) 歌曲定制生成方法及其相应的装置、设备、介质
Kurian et al. Continuous speech recognition system for Malayalam language using PLP cepstral coefficient
CN114242093A (zh) 语音音色转换方法、装置、计算机设备和存储介质
JP4230142B2 (ja) 悪環境下でのキーパッド/音声を用いたハイブリッドな東洋文字認識技術
WO2008118038A1 (fr) Procédé d'échange de messages et dispositif permettant sa mise en oeuvre
Gilbert et al. Intelligent virtual agents for contact center automation
JP2004252121A (ja) 言語処理装置および言語処理方法、並びにプログラムおよび記録媒体
CA2597826C (fr) Methode, logiciel et dispositif pour identifiant unique d'un contact desire dans une base de donnees de contact base sur un seul enonce
Gardner-Bonneau et al. Spoken language interfaces for embedded applications
JP7511623B2 (ja) 情報処理装置、情報処理システム、情報処理方法及びプログラム
KR102574311B1 (ko) 음성 합성 서비스를 제공하는 장치, 단말기 및 방법
KR102392992B1 (ko) 음성 인식 기능을 활성화시키는 호출 명령어 설정에 관한 사용자 인터페이싱 장치 및 방법

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 20038A1122X

Country of ref document: CN

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP