WO2001048738A1 - Approche globale de segmentation de caracteres en mots - Google Patents

Approche globale de segmentation de caracteres en mots Download PDF

Info

Publication number
WO2001048738A1
WO2001048738A1 PCT/CN1999/000213 CN9900213W WO0148738A1 WO 2001048738 A1 WO2001048738 A1 WO 2001048738A1 CN 9900213 W CN9900213 W CN 9900213W WO 0148738 A1 WO0148738 A1 WO 0148738A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability
segmentation
path
word
segmentation path
Prior art date
Application number
PCT/CN1999/000213
Other languages
English (en)
Inventor
Yonghong Yan
Lingyun Tuo
Zhiwei Lin
Xiangdong Zhang
Robert Yung
Original Assignee
Intel Corporation
Intel Architecture Development Shanghai Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation, Intel Architecture Development Shanghai Co., Ltd. filed Critical Intel Corporation
Priority to AU17672/00A priority Critical patent/AU1767200A/en
Priority to CNB998170828A priority patent/CN1192354C/zh
Priority to PCT/CN1999/000213 priority patent/WO2001048738A1/fr
Publication of WO2001048738A1 publication Critical patent/WO2001048738A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • the present invention relates to speech recognition systems and, more particularly, to segmenting characters into words in a speech recognition system.
  • a popular way to capture the syntactic structure of a given language is using conditional probability to capture the sequential information embedded in the word strings in sentences.
  • a language model can be constructed indicating the probabilities that certain other words W2, W3, ... Wn, will follow Wl .
  • P31 is the probability word W3 will follow word Wl
  • P41 is the probability word W4 will follow word Wl
  • Pnl being the probability that Wn will follow word Wl .
  • the maximum of P21, P31, ... Pnl can be identified and used in the language model.
  • the preceding examples are for bi-gram probabilities, although tri- gram conditional probabilities may also be computed.
  • Language models are often created through looking at written literature (such as newspapers) and determining the conditional probabilities of the vocabulary words with respect to others of the vocabulary words.
  • words can be written as one or more symbolic characters, for example, Han zi (Chinese) and Kanji (Japanese).
  • Sentences are composed of character strings, where words are implicit because there are no spaces between contiguous words.
  • a particular character may be a word all by itself or may join with a character before it or after it (or possibly both before and after it) to form the word.
  • the meaning of words can change depending on how the characters are joined or separated in creating the words. In written form, however, there are no spaces between the characters so it is not visually evident whether a particular character is a word all by itself or joins with another character or characters to form the word. Rather, the word a particular character belongs to is understood from the context.
  • the words are explicitly extracted by putting spaces on the word boundaries.
  • the greedy algorithm involves the following: (1) Start from the beginning of the given sentence being processed and exhaust all the possible words that match the initial part of character string in the sentence.
  • the greedy algorithm does not always make the best choice from a global perspective. In fact, it may choose combinations that are not only not optimal but which are also not syntactically correct. As stated in T. Cormen et al. "Introduction to Algorithms," (The MIT Press 1990), p. 329: "A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution.
  • the invention includes a method.
  • the method involves creating a path list of segmentation paths of characters using a vocabulary.
  • a probability of a first segmentation path is determined and designated as the best segmentation path.
  • the probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
  • the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system.
  • Figure 1 is a high level schematic block diagram representation of a computer system that may be used in connection with some embodiments of the invention.
  • Figure 2 is a high level schematic representation of a hand-held computer system that may be used in connection with some embodiments of the invention.
  • the invention involves a system and method to segment words from characters. That is, the invention involves deciding which word a character should belong to.
  • the invention has particular application in connection with languages such as Chinese and Japanese that do not have spaces between characters indicating word segmentation, but is not limited to such use.
  • the disclosed invention is designed to make a better segmentation of words given any sentence. This leads to a language model better than the one obtained by traditional method, which uses a greedy algorithm, described above. A better language model will lead to better recognition accuracy since it describes the language better in terms of word strings.
  • the invention performs segmentation by using a dynamic programming algorithm equipped with statistical language model.
  • the dynamic algorithm can be implemented.
  • One example of a dynamic algorithm is as follows. First, the corpus (i.e., the characters to be segmented into words) is processed through the traditional greedy algorithm to calculate an n-gram language model. Then, a Viterbi algorithm is used to re-segment the sentence.
  • the corpus i.e., the
  • Viterbi algorithm is a type of dynamic programming, which may be used in global optimization. See, T. Cormen et al., "Introduction to Algorithms," (The MIT Press, 1990), pp. 301-28.
  • the Viterbi algorithm we used can be described as in equation (1) as follows:
  • Equation (1) involves finding the word w, that leads to the maximum in equation (1).
  • Equation (1) is in a bi-gram format, but other formats, such as tri-gram or uni- gram formats, if they are available in the language model. Backoff weights and other techniques may be used.
  • each character may be a word all by itself. However, the invention involves determining whether the characters may be joined with other characters to make additional words or are better left alone.
  • the word consisting of multiple characters may also be called a term or phrase.
  • a line buffer is a group of memory, and is not restricted // to any particular form while (line buffer is not empty)
  • find the longest word in vocabulary which matches the head of the line buffer; output this word and a word separator character; remove the matched head from the line buffer;
  • a segmentation algorithm using a language model includes the following: Read language model; // load the language model into memory or
  • a segmentation path is a possible segmentation of characters; // different formats may be used to store the paths, e.g., a list or // a tree structure
  • Equation (1) or another equation may be // used to calculate the probability
  • the sentence When correctly segmented, the sentence means "has ways and strength to solve the problem."
  • the present invention successfully segments the sentence, while the traditional way failed to do so.
  • Example 1 In Example 1, consider the original text as consisting of the following eight characters in order: CI, C2, C3, C4, C5, C6, C7, and C8. From the original text, it is not visually clear how to group the characters to form the words. Table 1 as follows gives two possible ways of grouping characters into five words Wl - W5.
  • a greedy algorithm is used to create a greedy segmentation path as follows.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character CI is one consisting of only character CI.
  • C1C2 is not a word in the vocabulary. Therefore, word Wl is character CI .
  • word Wl leaves the line buffer and the next character becomes the head of the line, although that is an implementation detail that is not required.
  • the next character is C2.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C2 is a word consisting of characters C2C3.
  • C2C3 is in the vocabulary, but C2C3C4 is not. Therefore, word W2 is characters C2C3.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C4 is one consisting of characters C4C5. Therefore, word W3 is C4C5.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C6 is a word consisting of C6. Therefore, word W4 is C6.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C7 is a word consisting of C7C8. Therefore, word W5 is C7C8.
  • the probability of this greedy segmentation path is calculated. With respect to words Wl and W2 and characters C1,C2, and C3, the only segmentation path that is included in the vocabulary is that already chosen by the greedy algorithm.
  • the line may be a sentence.
  • the term "sentence" refers to a group of successive words ending in with a symbol, such as a period.
  • different groups of characters may be considered in the segmentation path.
  • the segmentation path could consider all characters in a sentence.
  • the segmentation path could consider a moving window of characters without consideration for sentence endings, except to note that the language model will not allow a character at the end of a sentence to join with the first character in the next sentence.
  • the window may be a set number of characters.
  • the segmentation path could include X characters, with a new segmentation path starting with the last character of the previous path if it is not in a word. Other possibilities exist.
  • Figure 1 illustrates a highly schematic representation of a computer system 10 which includes a processor 14, memory 16, and input/output and control block 18 " .
  • Memory 16 may include a line buffer 22.
  • the line buffer is merely a group of memory and does not have to have any particular characteristics. For example, it does not have to have contiguous memory cells.
  • a line buffer 24 is in processor 14, however a line buffer does not need to be in processor 14.
  • not every embodiment of the invention has a line buffer.
  • the segmentation paths do not need to be stored in a line buffer.
  • At least some of the input/output and control block 18 could be on the same chip as processor 14, or be on a separate chip.
  • FIG. 1 is merely exemplary and the invention is not limited to use with such a computer system.
  • Computer system 10 and other computer systems used to carry out the invention may be in a variety of forms, such as desktop, mainframe, and portable computers.
  • Figure 2 illustrates a handheld device 60, with a display 62, which may incorporate some or all the features of Figure 1.
  • the hand held device may at times interface with another computer system, such as that of Figure 1.
  • the shapes and relative sizes of the objects in Figures 1 and 2 are not intended to suggest actual shapes and relative sizes.
  • perplexity is an entropy measuring the complexity of the language.
  • the model has lower perplexity is better than the ones with higher perplexity.
  • evaluations were conducted using People's Daily 94 to 98 year's data with trigram models estimated with different segmentation methods.
  • the traditional (greedy) way has a perplexity of 182 while an embodiment of the invention resulted in 143. This is a significant ' improvement in modeling accuracy compared with the previous technique.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé mis en oeuvre dans certains modes de réalisation qui consiste à créer une liste de chemins de segmentation de caractères utilisant un vocabulaire, à déterminer une probabilité d'un premier chemin de segmentation considéré comme étant le meilleur chemin, à déterminer et comparer la probabilité d'un chemin supplémentaire parmi les chemins de segmentation avec la probabilité du meilleur chemin de segmentation. Si la probabilité du chemin de segmentation supplémentaire dépasse celle du meilleur chemin de segmentation, elle est tenue pour être la meilleure. On répète ces opérations jusqu'à ce que la probabilité pour tous les chemins de segmentation restants soit déterminée et comparée avec la probabilité du meilleur chemin de segmentation. Selon certains modes de réalisation, l'invention concerne un appareil comprenant un support lisible informatique mettant en oeuvre ce procédé. Selon d'autres modes de réalisation, l'invention concerne un système informatique. Font encore l'objet de cette invention des modes de réalisation supplémentaires.
PCT/CN1999/000213 1999-12-23 1999-12-23 Approche globale de segmentation de caracteres en mots WO2001048738A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU17672/00A AU1767200A (en) 1999-12-23 1999-12-23 A global approach for segmenting characters into words
CNB998170828A CN1192354C (zh) 1999-12-23 1999-12-23 划分字为词的全局方法
PCT/CN1999/000213 WO2001048738A1 (fr) 1999-12-23 1999-12-23 Approche globale de segmentation de caracteres en mots

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN1999/000213 WO2001048738A1 (fr) 1999-12-23 1999-12-23 Approche globale de segmentation de caracteres en mots

Publications (1)

Publication Number Publication Date
WO2001048738A1 true WO2001048738A1 (fr) 2001-07-05

Family

ID=4575157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN1999/000213 WO2001048738A1 (fr) 1999-12-23 1999-12-23 Approche globale de segmentation de caracteres en mots

Country Status (3)

Country Link
CN (1) CN1192354C (fr)
AU (1) AU1767200A (fr)
WO (1) WO2001048738A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609671B (zh) * 2009-07-21 2011-09-07 北京邮电大学 一种连续语音识别结果评价的方法和装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4667341A (en) * 1982-02-01 1987-05-19 Masao Watari Continuous speech recognition system
EP0380297A2 (fr) * 1989-01-24 1990-08-01 Canon Kabushiki Kaisha Procédé et dispositif pour la reconnaissance de la parole
US5706397A (en) * 1995-10-05 1998-01-06 Apple Computer, Inc. Speech recognition system with multi-level pruning for acoustic matching
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5862519A (en) * 1996-04-02 1999-01-19 T-Netix, Inc. Blind clustering of data with application to speech processing systems
WO1999014741A1 (fr) * 1997-09-18 1999-03-25 Siemens Aktiengesellschaft Procede pour la reconnaissance d'un mot de passe dans un message oral
JP2000075895A (ja) * 1998-08-05 2000-03-14 Texas Instr Inc <Ti> 連続音声認識用n最良検索方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4667341A (en) * 1982-02-01 1987-05-19 Masao Watari Continuous speech recognition system
EP0380297A2 (fr) * 1989-01-24 1990-08-01 Canon Kabushiki Kaisha Procédé et dispositif pour la reconnaissance de la parole
US5706397A (en) * 1995-10-05 1998-01-06 Apple Computer, Inc. Speech recognition system with multi-level pruning for acoustic matching
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5862519A (en) * 1996-04-02 1999-01-19 T-Netix, Inc. Blind clustering of data with application to speech processing systems
WO1999014741A1 (fr) * 1997-09-18 1999-03-25 Siemens Aktiengesellschaft Procede pour la reconnaissance d'un mot de passe dans un message oral
JP2000075895A (ja) * 1998-08-05 2000-03-14 Texas Instr Inc <Ti> 連続音声認識用n最良検索方法

Also Published As

Publication number Publication date
CN1398395A (zh) 2003-02-19
CN1192354C (zh) 2005-03-09
AU1767200A (en) 2001-07-09

Similar Documents

Publication Publication Date Title
JP6827548B2 (ja) 音声認識システム及び音声認識の方法
JP4302326B2 (ja) テキストの自動区分
Siivola et al. Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner
US7634406B2 (en) System and method for identifying semantic intent from acoustic information
US6983239B1 (en) Method and apparatus for embedding grammars in a natural language understanding (NLU) statistical parser
US6738741B2 (en) Segmentation technique increasing the active vocabulary of speech recognizers
US7480612B2 (en) Word predicting method, voice recognition method, and voice recognition apparatus and program using the same methods
US8849668B2 (en) Speech recognition apparatus and method
Riley et al. Automatic generation of detailed pronunciation lexicons
JP2008262279A (ja) 音声検索装置
US8255220B2 (en) Device, method, and medium for establishing language model for expanding finite state grammar using a general grammar database
Hakkinen et al. N-gram and decision tree based language identification for written words
WO2007097208A1 (fr) Dispositif de traitement de langue, procede de traitement de langue et programme de traitement de langue
WO2019014183A1 (fr) Reconnaissance vocale automatique basée sur des syllabes
EP3598321A1 (fr) Procédé d&#39;analyse de texte en langue naturelle ayant des liaisons de construction constituantes
US11869491B2 (en) Abstract generation device, method, program, and recording medium
JP4878220B2 (ja) モデル学習方法、情報抽出方法、モデル学習装置、情報抽出装置、モデル学習プログラム、情報抽出プログラム、およびそれらプログラムを記録した記録媒体
Siivola et al. Morfessor and VariKN machine learning tools for speech and language technology
WO2001048738A1 (fr) Approche globale de segmentation de caracteres en mots
WO2001048737A2 (fr) Systeme de reconnaissance vocale dote d&#34;un arbre lexical utilisant le modele de langage de type n-gram
KR100277690B1 (ko) 화행 정보를 이용한 음성 인식 방법
JP5137588B2 (ja) 言語モデル生成装置及び音声認識装置
CN115188365B (zh) 一种停顿预测方法、装置、电子设备及存储介质
JP2000075885A (ja) 音声認識装置
JP5046902B2 (ja) 音声検索装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 998170828

Country of ref document: CN

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase