WO2001048738A1 - A global approach for segmenting characters into words - Google Patents

A global approach for segmenting characters into words Download PDF

Info

Publication number
WO2001048738A1
WO2001048738A1 PCT/CN1999/000213 CN9900213W WO0148738A1 WO 2001048738 A1 WO2001048738 A1 WO 2001048738A1 CN 9900213 W CN9900213 W CN 9900213W WO 0148738 A1 WO0148738 A1 WO 0148738A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability
segmentation
path
word
segmentation path
Prior art date
Application number
PCT/CN1999/000213
Other languages
French (fr)
Inventor
Yonghong Yan
Lingyun Tuo
Zhiwei Lin
Xiangdong Zhang
Robert Yung
Original Assignee
Intel Corporation
Intel Architecture Development Shanghai Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation, Intel Architecture Development Shanghai Co., Ltd. filed Critical Intel Corporation
Priority to CNB998170828A priority Critical patent/CN1192354C/en
Priority to PCT/CN1999/000213 priority patent/WO2001048738A1/en
Priority to AU17672/00A priority patent/AU1767200A/en
Publication of WO2001048738A1 publication Critical patent/WO2001048738A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • the present invention relates to speech recognition systems and, more particularly, to segmenting characters into words in a speech recognition system.
  • a popular way to capture the syntactic structure of a given language is using conditional probability to capture the sequential information embedded in the word strings in sentences.
  • a language model can be constructed indicating the probabilities that certain other words W2, W3, ... Wn, will follow Wl .
  • P31 is the probability word W3 will follow word Wl
  • P41 is the probability word W4 will follow word Wl
  • Pnl being the probability that Wn will follow word Wl .
  • the maximum of P21, P31, ... Pnl can be identified and used in the language model.
  • the preceding examples are for bi-gram probabilities, although tri- gram conditional probabilities may also be computed.
  • Language models are often created through looking at written literature (such as newspapers) and determining the conditional probabilities of the vocabulary words with respect to others of the vocabulary words.
  • words can be written as one or more symbolic characters, for example, Han zi (Chinese) and Kanji (Japanese).
  • Sentences are composed of character strings, where words are implicit because there are no spaces between contiguous words.
  • a particular character may be a word all by itself or may join with a character before it or after it (or possibly both before and after it) to form the word.
  • the meaning of words can change depending on how the characters are joined or separated in creating the words. In written form, however, there are no spaces between the characters so it is not visually evident whether a particular character is a word all by itself or joins with another character or characters to form the word. Rather, the word a particular character belongs to is understood from the context.
  • the words are explicitly extracted by putting spaces on the word boundaries.
  • the greedy algorithm involves the following: (1) Start from the beginning of the given sentence being processed and exhaust all the possible words that match the initial part of character string in the sentence.
  • the greedy algorithm does not always make the best choice from a global perspective. In fact, it may choose combinations that are not only not optimal but which are also not syntactically correct. As stated in T. Cormen et al. "Introduction to Algorithms," (The MIT Press 1990), p. 329: "A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution.
  • the invention includes a method.
  • the method involves creating a path list of segmentation paths of characters using a vocabulary.
  • a probability of a first segmentation path is determined and designated as the best segmentation path.
  • the probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
  • the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system.
  • Figure 1 is a high level schematic block diagram representation of a computer system that may be used in connection with some embodiments of the invention.
  • Figure 2 is a high level schematic representation of a hand-held computer system that may be used in connection with some embodiments of the invention.
  • the invention involves a system and method to segment words from characters. That is, the invention involves deciding which word a character should belong to.
  • the invention has particular application in connection with languages such as Chinese and Japanese that do not have spaces between characters indicating word segmentation, but is not limited to such use.
  • the disclosed invention is designed to make a better segmentation of words given any sentence. This leads to a language model better than the one obtained by traditional method, which uses a greedy algorithm, described above. A better language model will lead to better recognition accuracy since it describes the language better in terms of word strings.
  • the invention performs segmentation by using a dynamic programming algorithm equipped with statistical language model.
  • the dynamic algorithm can be implemented.
  • One example of a dynamic algorithm is as follows. First, the corpus (i.e., the characters to be segmented into words) is processed through the traditional greedy algorithm to calculate an n-gram language model. Then, a Viterbi algorithm is used to re-segment the sentence.
  • the corpus i.e., the
  • Viterbi algorithm is a type of dynamic programming, which may be used in global optimization. See, T. Cormen et al., "Introduction to Algorithms," (The MIT Press, 1990), pp. 301-28.
  • the Viterbi algorithm we used can be described as in equation (1) as follows:
  • Equation (1) involves finding the word w, that leads to the maximum in equation (1).
  • Equation (1) is in a bi-gram format, but other formats, such as tri-gram or uni- gram formats, if they are available in the language model. Backoff weights and other techniques may be used.
  • each character may be a word all by itself. However, the invention involves determining whether the characters may be joined with other characters to make additional words or are better left alone.
  • the word consisting of multiple characters may also be called a term or phrase.
  • a line buffer is a group of memory, and is not restricted // to any particular form while (line buffer is not empty)
  • find the longest word in vocabulary which matches the head of the line buffer; output this word and a word separator character; remove the matched head from the line buffer;
  • a segmentation algorithm using a language model includes the following: Read language model; // load the language model into memory or
  • a segmentation path is a possible segmentation of characters; // different formats may be used to store the paths, e.g., a list or // a tree structure
  • Equation (1) or another equation may be // used to calculate the probability
  • the sentence When correctly segmented, the sentence means "has ways and strength to solve the problem."
  • the present invention successfully segments the sentence, while the traditional way failed to do so.
  • Example 1 In Example 1, consider the original text as consisting of the following eight characters in order: CI, C2, C3, C4, C5, C6, C7, and C8. From the original text, it is not visually clear how to group the characters to form the words. Table 1 as follows gives two possible ways of grouping characters into five words Wl - W5.
  • a greedy algorithm is used to create a greedy segmentation path as follows.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character CI is one consisting of only character CI.
  • C1C2 is not a word in the vocabulary. Therefore, word Wl is character CI .
  • word Wl leaves the line buffer and the next character becomes the head of the line, although that is an implementation detail that is not required.
  • the next character is C2.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C2 is a word consisting of characters C2C3.
  • C2C3 is in the vocabulary, but C2C3C4 is not. Therefore, word W2 is characters C2C3.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C4 is one consisting of characters C4C5. Therefore, word W3 is C4C5.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C6 is a word consisting of C6. Therefore, word W4 is C6.
  • the longest word in the vocabulary of consecutive characters in the corpus that starts with character C7 is a word consisting of C7C8. Therefore, word W5 is C7C8.
  • the probability of this greedy segmentation path is calculated. With respect to words Wl and W2 and characters C1,C2, and C3, the only segmentation path that is included in the vocabulary is that already chosen by the greedy algorithm.
  • the line may be a sentence.
  • the term "sentence" refers to a group of successive words ending in with a symbol, such as a period.
  • different groups of characters may be considered in the segmentation path.
  • the segmentation path could consider all characters in a sentence.
  • the segmentation path could consider a moving window of characters without consideration for sentence endings, except to note that the language model will not allow a character at the end of a sentence to join with the first character in the next sentence.
  • the window may be a set number of characters.
  • the segmentation path could include X characters, with a new segmentation path starting with the last character of the previous path if it is not in a word. Other possibilities exist.
  • Figure 1 illustrates a highly schematic representation of a computer system 10 which includes a processor 14, memory 16, and input/output and control block 18 " .
  • Memory 16 may include a line buffer 22.
  • the line buffer is merely a group of memory and does not have to have any particular characteristics. For example, it does not have to have contiguous memory cells.
  • a line buffer 24 is in processor 14, however a line buffer does not need to be in processor 14.
  • not every embodiment of the invention has a line buffer.
  • the segmentation paths do not need to be stored in a line buffer.
  • At least some of the input/output and control block 18 could be on the same chip as processor 14, or be on a separate chip.
  • FIG. 1 is merely exemplary and the invention is not limited to use with such a computer system.
  • Computer system 10 and other computer systems used to carry out the invention may be in a variety of forms, such as desktop, mainframe, and portable computers.
  • Figure 2 illustrates a handheld device 60, with a display 62, which may incorporate some or all the features of Figure 1.
  • the hand held device may at times interface with another computer system, such as that of Figure 1.
  • the shapes and relative sizes of the objects in Figures 1 and 2 are not intended to suggest actual shapes and relative sizes.
  • perplexity is an entropy measuring the complexity of the language.
  • the model has lower perplexity is better than the ones with higher perplexity.
  • evaluations were conducted using People's Daily 94 to 98 year's data with trigram models estimated with different segmentation methods.
  • the traditional (greedy) way has a perplexity of 182 while an embodiment of the invention resulted in 143. This is a significant ' improvement in modeling accuracy compared with the previous technique.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

In some embodiments, the invention includes a method. The method involves creating a path list of segmentation paths of characters using a vocabulary. A probability of a first segmentation path is determined and designated as the best segmentation path. The probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path. In some embodiments, the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system. Additional embodiments are described and claimed.

Description

A GLOBAL APPROACH FOR SEGMENTING CHARACTERS INTO WORDS
Background of the Invention
Technical Field of the Invention: The present invention relates to speech recognition systems and, more particularly, to segmenting characters into words in a speech recognition system.
Background Art: One component in a speech recognizer is the language model. A popular way to capture the syntactic structure of a given language is using conditional probability to capture the sequential information embedded in the word strings in sentences. For example, if the current word is Wl, a language model can be constructed indicating the probabilities that certain other words W2, W3, ... Wn, will follow Wl . The probabilities of the words can be expressed such that P21 is the probability that word W2 will follow word Wl, where P21 = (W2|W1). In this notation, P31 is the probability word W3 will follow word Wl; P41 is the probability word W4 will follow word Wl, and so forth with Pnl being the probability that Wn will follow word Wl . The maximum of P21, P31, ... Pnl can be identified and used in the language model. The preceding examples are for bi-gram probabilities, although tri- gram conditional probabilities may also be computed.
Language models are often created through looking at written literature (such as newspapers) and determining the conditional probabilities of the vocabulary words with respect to others of the vocabulary words.
In some languages, such as Chinese and Japanese, words can be written as one or more symbolic characters, for example, Han zi (Chinese) and Kanji (Japanese). Sentences are composed of character strings, where words are implicit because there are no spaces between contiguous words. A particular character may be a word all by itself or may join with a character before it or after it (or possibly both before and after it) to form the word. The meaning of words can change depending on how the characters are joined or separated in creating the words. In written form, however, there are no spaces between the characters so it is not visually evident whether a particular character is a word all by itself or joins with another character or characters to form the word. Rather, the word a particular character belongs to is understood from the context. In order to apply a statistical method for language modeling, the words are explicitly extracted by putting spaces on the word boundaries.
Traditionally, character segmentation into words is done by a "greedy algorithm." The greedy algorithm involves the following: (1) Start from the beginning of the given sentence being processed and exhaust all the possible words that match the initial part of character string in the sentence.
(2) Pick the longest word (i.e., the word that has the largest number of characters) and put a space at the end of the matched sub string in the sentence, treat the remaining character string as a new sentence and repeat step (1) until all the characters in the sentence are processed.
The greedy algorithm does not always make the best choice from a global perspective. In fact, it may choose combinations that are not only not optimal but which are also not syntactically correct. As stated in T. Cormen et al. "Introduction to Algorithms," (The MIT Press 1990), p. 329: "A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution.
Summary
In some embodiments, the invention includes a method. The method involves creating a path list of segmentation paths of characters using a vocabulary. A probability of a first segmentation path is determined and designated as the best segmentation path. The probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
In some embodiments, the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system.
Additional embodiments are described and claimed. Brief Description of the Drawings
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only.
Figure 1 is a high level schematic block diagram representation of a computer system that may be used in connection with some embodiments of the invention. Figure 2 is a high level schematic representation of a hand-held computer system that may be used in connection with some embodiments of the invention. Detailed Description
The invention involves a system and method to segment words from characters. That is, the invention involves deciding which word a character should belong to. The invention has particular application in connection with languages such as Chinese and Japanese that do not have spaces between characters indicating word segmentation, but is not limited to such use. The disclosed invention is designed to make a better segmentation of words given any sentence. This leads to a language model better than the one obtained by traditional method, which uses a greedy algorithm, described above. A better language model will lead to better recognition accuracy since it describes the language better in terms of word strings. In some embodiments, the invention performs segmentation by using a dynamic programming algorithm equipped with statistical language model. There are various ways in which the dynamic algorithm can be implemented. One example of a dynamic algorithm is as follows. First, the corpus (i.e., the characters to be segmented into words) is processed through the traditional greedy algorithm to calculate an n-gram language model. Then, a Viterbi algorithm is used to re-segment the sentence. The
Viterbi algorithm is a type of dynamic programming, which may be used in global optimization. See, T. Cormen et al., "Introduction to Algorithms," (The MIT Press, 1990), pp. 301-28. The Viterbi algorithm we used can be described as in equation (1) as follows:
Pw, = max(Pw + prob(wl | w , )) ( 1 ) i i i-\ In equation (1), P is probability and "prob" involves the language model. In equation (1), w, is the ith word, w,., is the word immediately preceding w„ PWl_, is the probability of the w^th word occurring, and prob(w1I w1.1) is the conditional probability of word w, occurring if word w,.[ occurs. Equation (1) involves finding the word w, that leads to the maximum in equation (1). By solving equation (1), the resulting word sequence (wOwl ...wN) will guarantee the selected partition is the best in the maximal likelihood sense. In some embodiments, when i = N, such that the end of a sentence is reached, there is a global maximization.
Equation (1) is in a bi-gram format, but other formats, such as tri-gram or uni- gram formats, if they are available in the language model. Backoff weights and other techniques may be used.
As noted, in some languages, each character may be a word all by itself. However, the invention involves determining whether the characters may be joined with other characters to make additional words or are better left alone. The word consisting of multiple characters may also be called a term or phrase.
A version of the greedy algorithm is provided in pseudo code below:
Read vocabulary; // the vocabulary is the list of possible words
Open language corpus; // the language corpus includes the characters
// to segment into words while (not end of language corpus)
{ read a line from language corpus and put it into a line buffer;
// a line buffer is a group of memory, and is not restricted // to any particular form while (line buffer is not empty)
{ find the longest word in vocabulary which matches the head of the line buffer; output this word and a word separator character; remove the matched head from the line buffer;
} output a line separator character;
}
Close language corpus;
In some embodiments, a segmentation algorithm using a language model according to the present invention includes the following: Read language model; // load the language model into memory or
// otherwise make it available Read vocabulary; Open language corpus;
while (not end of language corpus)
{
Read a line from language corpus and put it into line buffer;
// the number of characters in a line can vary depending // on the implementation; a line can be a sentence
Create a path list containing all the possible segmentation paths using the vocabulary;
// a segmentation path is a possible segmentation of characters; // different formats may be used to store the paths, e.g., a list or // a tree structure
Find the greedy segmentation path and save it as the best path;
// various greedy algorithms may be used such as the one provided above;
// in this implementation of the invention, the greedy segmentation path is
// initially consider the best path, but other initial paths // could be used Calculate the probability of this path using the language model and set this value as the maximum probability; // the language model specifies the probability of a word
// occurring and a probability of a word following // another word. Equation (1) or another equation may be // used to calculate the probability
while (path list is not empty)
{
Select a path from the path list and set it as current path; Calculate the probability of current path using language model;
if ( the probability of current path > the maximum probability)
{ the maximum probability = the probability of current path; Save current path as the best path;
} Remove current path from the path list;
}
Output the best path;
}
Close language corpus;
An example of the algorithm is given with Chinese characters in the following sentence. Original Text
Figure imgf000008_0001
Segmentation result using greedy method
Segmentation re suit using language model
Figure imgf000008_0002
Example 1.
When correctly segmented, the sentence means "has ways and strength to solve the problem." The present invention successfully segments the sentence, while the traditional way failed to do so.
In Example 1, consider the original text as consisting of the following eight characters in order: CI, C2, C3, C4, C5, C6, C7, and C8. From the original text, it is not visually clear how to group the characters to form the words. Table 1 as follows gives two possible ways of grouping characters into five words Wl - W5.
Table 1 :
Figure imgf000008_0003
A greedy algorithm is used to create a greedy segmentation path as follows. The longest word in the vocabulary of consecutive characters in the corpus that starts with character CI is one consisting of only character CI. In other words, C1C2 is not a word in the vocabulary. Therefore, word Wl is character CI . In some embodiments, word Wl leaves the line buffer and the next character becomes the head of the line, although that is an implementation detail that is not required. In this example, the next character is C2. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C2 is a word consisting of characters C2C3. In other words, C2C3 is in the vocabulary, but C2C3C4 is not. Therefore, word W2 is characters C2C3. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C4 is one consisting of characters C4C5. Therefore, word W3 is C4C5. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C6 is a word consisting of C6. Therefore, word W4 is C6. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C7 is a word consisting of C7C8. Therefore, word W5 is C7C8. The probability of this greedy segmentation path is calculated. With respect to words Wl and W2 and characters C1,C2, and C3, the only segmentation path that is included in the vocabulary is that already chosen by the greedy algorithm. One way to deal with this situation is to not recompute probabilities, but rather to not calculate another probability when there is an alternative path that is allowed by the vocabulary. Another way is to recalculate probabilities of the same path, only to determine they are the same and have the current path not replace the maximum probability.
However, with words W3 and W4, there are two paths. The first is that chosen by the greedy algorithm in which W3 is C4C5 and W4 is C6. An alternative segmentation path that is allowed by the vocabulary is W3 is C4 and W4 is C5C6. In the example, assume that C4 followed by the combination of C5C6 is more probable than the combination of C4C5 followed by C6. (Word W5 is the same in each case.) Then in equation (1), the probability of the current path would be greater that the probability of the greedy segmentation path and it would replace that of the greedy segmentation path. Note that following interesting possibility. Assume that the combination of C4C5 is more probable than C4 alone. From that single bit of information, the greedy segmentation path would be selected. However, that would not lead to the better global solution because C4 followed by C5C6 is more probable than C4C5 followed by C6.
The line may be a sentence. As used herein, the term "sentence" refers to a group of successive words ending in with a symbol, such as a period. In different embodiments, different groups of characters may be considered in the segmentation path. For example, the segmentation path could consider all characters in a sentence. The segmentation path could consider a moving window of characters without consideration for sentence endings, except to note that the language model will not allow a character at the end of a sentence to join with the first character in the next sentence. The window may be a set number of characters. The segmentation path could include X characters, with a new segmentation path starting with the last character of the previous path if it is not in a word. Other possibilities exist.
There are a variety of computer systems that may be used in training and for speech recognition system. Merely as an example, Figure 1 illustrates a highly schematic representation of a computer system 10 which includes a processor 14, memory 16, and input/output and control block 18". Memory 16 may include a line buffer 22. The line buffer is merely a group of memory and does not have to have any particular characteristics. For example, it does not have to have contiguous memory cells. There may be a substantially amount of memory in processor 14 and memory 16 may represent both memory that is off the chip of processor 14 or memory that is partially on and partially off the chip of processor 14. (Or memory 16 could be completely on the chip of processor 14). In some embodiments, a line buffer 24 is in processor 14, however a line buffer does not need to be in processor 14. Further, not every embodiment of the invention has a line buffer. The segmentation paths do not need to be stored in a line buffer. At least some of the input/output and control block 18 could be on the same chip as processor 14, or be on a separate chip. A microphone
26, monitor 30, additional memory 34, and input devices (such as a keyboard and mouse 38), a network connection 42, and speaker(s) 44 may interface with input/output and control block 18. Memory 34 represents a variety of memory such as a hard drive and CD ROM or DVD discs. These include computer readable media that can hold instructions to be executed causing some embodiments of the invention to occur. It is emphasized that FIG. 1 is merely exemplary and the invention is not limited to use with such a computer system. Computer system 10 and other computer systems used to carry out the invention may be in a variety of forms, such as desktop, mainframe, and portable computers. For example, Figure 2 illustrates a handheld device 60, with a display 62, which may incorporate some or all the features of Figure 1. The hand held device may at times interface with another computer system, such as that of Figure 1. The shapes and relative sizes of the objects in Figures 1 and 2 are not intended to suggest actual shapes and relative sizes.
Other Information and Embodiments
Traditionally the quality of a language model is measured by perplexity, which is an entropy measuring the complexity of the language. For the same training and evaluation text corpora, the model has lower perplexity is better than the ones with higher perplexity. As an experiment, evaluations were conducted using People's Daily 94 to 98 year's data with trigram models estimated with different segmentation methods. The traditional (greedy) way has a perplexity of 182 while an embodiment of the invention resulted in 143. This is a significant'improvement in modeling accuracy compared with the previous technique.
Reference in the specification to "an embodiment," "one embodiment," "some embodiments," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the invention. The various appearances "an embodiment," "one embodiment," or "some embodiments" are not necessarily all referring to the same embodiments.
If the specification states a component, feature, structure, or characteristic "may", "might", or "could" be included, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to "a" or
"an" element, that does not mean there is only one of the element. If the specification or claims refer to "an additional" element, that does not preclude there being more than one of the additional element.
Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Accordingly, it is the following claims including any amendments thereto that define the scope of the invention.

Claims

CLAIMSWhat is claimed is:
1. A method comprising:
(a) creating a path list of segmentation paths of characters using a vocabulary; (b) determining a probability of a first segmentation path and designate it as the best segmentation path;
(c) determining a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designating the additional segmentation path as the best segmentation path, and repeating (c) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
2. The method of claim 1, wherein the first sentence is obtained through a greedy algorithm.
3. The method of claim 1, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.
4. The method of claim 1, wherein the characters included in a segmentation path are those in a single sentence.
5. The method of claim 1, wherein the characters included in a segmentation path are in a sliding window.
6. The method of claim 1, wherein the probability is determined through use a language model.
7. The method of claim 1, wherein the probability is determined through a calculation involving the following equation: Pw, = max(Pw + prob(wl \ w{_, )) , i i i-\ where w, is the ith word, w,_, is the word immediately preceding w„ PWl., is the probability of the w,.,th word occurring, and prob(w, \ w ) is the conditional probability of word w, occurring if word w occurs.
An apparatus comprising: a computer readable medium having instructions thereon which when executed cause a computer system to:
(a) create a path list of segmentation paths of characters using a vocabulary;
(b) determine a probability of a first segmentation path and designate it as the best segmentation path;
(c) determine a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designate the additional segmentation path as the best segmentation path, and repeat (c) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
9. The apparatus of claim 8, wherein the first sentence is obtained through a greedy algorithm.
10. The apparatus of claim 8, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.
11. The apparatus of claim 8, wherein the characters included in a segmentation path are those in a single sentence.
12. The apparatus of claim 8, wherein the characters included in a segmentation path are in a sliding window.
13. The apparatus of claim 8, wherein the probability is determined through use a language model.
14. The apparatus of claim 8, wherein the probability is determined through a calculation involving the following equation: Pw, = max( w + prob(wt \ w^, )) , i i i-\ where w, is the ith word, w is the word immediately preceding w„ PWl_! is the probability of the w,.,th word occurring, and prob(w, | w^) is the conditional probability of word w, occurring if word w,_, occurs.
15. The apparatus of claim 8, wherein the apparatus is a disc.
16. A computer system comprising: memory holding a list of segmentation paths of characters forming words in a vocabulary; a processor that
(a) determines a probability of a first segmentation path and designate it as the best segmentation path;
(b) determines a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designates the additional segmentation path as the best segmentation path, and repeats (b) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.
17. The apparatus of claim 16, wherein the first sentence is obtained through a greedy algorithm.
18. The apparatus of claim 16, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.
19. The apparatus of claim 16, wherein the characters included in a segmentation path are those in a single sentence.
20. The apparatus of claim 16, wherein the characters included in a segmentation path are in a sliding window.
21. The apparatus of claim 16, wherein the probability is determined through use a language model.
22. The apparatus of claim 16, wherein the probability is determined through a calculation involving the following equation: Pw. = max( w + prob(wl | w )) , where w, is the ith word, w,., is the word i i -I immediately preceding w„ PWl_, is the probability of the w,.,th word occurring, and prob(w. I w,.,) is the conditional probability of word w, occurring if word w,., occurs.
PCT/CN1999/000213 1999-12-23 1999-12-23 A global approach for segmenting characters into words WO2001048738A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CNB998170828A CN1192354C (en) 1999-12-23 1999-12-23 Global approach for segmenting characters into words
PCT/CN1999/000213 WO2001048738A1 (en) 1999-12-23 1999-12-23 A global approach for segmenting characters into words
AU17672/00A AU1767200A (en) 1999-12-23 1999-12-23 A global approach for segmenting characters into words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN1999/000213 WO2001048738A1 (en) 1999-12-23 1999-12-23 A global approach for segmenting characters into words

Publications (1)

Publication Number Publication Date
WO2001048738A1 true WO2001048738A1 (en) 2001-07-05

Family

ID=4575157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN1999/000213 WO2001048738A1 (en) 1999-12-23 1999-12-23 A global approach for segmenting characters into words

Country Status (3)

Country Link
CN (1) CN1192354C (en)
AU (1) AU1767200A (en)
WO (1) WO2001048738A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609671B (en) * 2009-07-21 2011-09-07 北京邮电大学 Method and device for continuous speech recognition result evaluation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4667341A (en) * 1982-02-01 1987-05-19 Masao Watari Continuous speech recognition system
EP0380297A2 (en) * 1989-01-24 1990-08-01 Canon Kabushiki Kaisha Method and apparatus for speech recognition
US5706397A (en) * 1995-10-05 1998-01-06 Apple Computer, Inc. Speech recognition system with multi-level pruning for acoustic matching
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5862519A (en) * 1996-04-02 1999-01-19 T-Netix, Inc. Blind clustering of data with application to speech processing systems
WO1999014741A1 (en) * 1997-09-18 1999-03-25 Siemens Aktiengesellschaft Method for recognising a keyword in speech
JP2000075895A (en) * 1998-08-05 2000-03-14 Texas Instr Inc <Ti> N best retrieval method for continuous speech recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4667341A (en) * 1982-02-01 1987-05-19 Masao Watari Continuous speech recognition system
EP0380297A2 (en) * 1989-01-24 1990-08-01 Canon Kabushiki Kaisha Method and apparatus for speech recognition
US5706397A (en) * 1995-10-05 1998-01-06 Apple Computer, Inc. Speech recognition system with multi-level pruning for acoustic matching
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5862519A (en) * 1996-04-02 1999-01-19 T-Netix, Inc. Blind clustering of data with application to speech processing systems
WO1999014741A1 (en) * 1997-09-18 1999-03-25 Siemens Aktiengesellschaft Method for recognising a keyword in speech
JP2000075895A (en) * 1998-08-05 2000-03-14 Texas Instr Inc <Ti> N best retrieval method for continuous speech recognition

Also Published As

Publication number Publication date
AU1767200A (en) 2001-07-09
CN1192354C (en) 2005-03-09
CN1398395A (en) 2003-02-19

Similar Documents

Publication Publication Date Title
JP6827548B2 (en) Speech recognition system and speech recognition method
Siivola et al. Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner
US7634406B2 (en) System and method for identifying semantic intent from acoustic information
US6983239B1 (en) Method and apparatus for embedding grammars in a natural language understanding (NLU) statistical parser
US6738741B2 (en) Segmentation technique increasing the active vocabulary of speech recognizers
US7480612B2 (en) Word predicting method, voice recognition method, and voice recognition apparatus and program using the same methods
US8849668B2 (en) Speech recognition apparatus and method
JP2002531892A (en) Automatic segmentation of text
Riley et al. Automatic generation of detailed pronunciation lexicons
JP2008262279A (en) Speech retrieval device
US8255220B2 (en) Device, method, and medium for establishing language model for expanding finite state grammar using a general grammar database
WO2007005884A2 (en) Generating chinese language couplets
Hakkinen et al. N-gram and decision tree based language identification for written words
WO2007097208A1 (en) Language processing device, language processing method, and language processing program
WO2019014183A1 (en) Syllable based automatic speech recognition
US11869491B2 (en) Abstract generation device, method, program, and recording medium
JP4878220B2 (en) Model learning method, information extraction method, model learning device, information extraction device, model learning program, information extraction program, and recording medium recording these programs
Siivola et al. Morfessor and VariKN machine learning tools for speech and language technology
WO2001048738A1 (en) A global approach for segmenting characters into words
WO2001048737A2 (en) Speech recognizer with a lexical tree based n-gram language model
KR100277690B1 (en) Speech Recognition Using Speech Act Information
JP5137588B2 (en) Language model generation apparatus and speech recognition apparatus
CN115188365B (en) Pause prediction method and device, electronic equipment and storage medium
Gao et al. Long distance dependency in language modeling: an empirical study
JP2000075885A (en) Voice recognition device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 998170828

Country of ref document: CN

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase