WO2001048738A1

WO2001048738A1 - A global approach for segmenting characters into words

Info

Publication number: WO2001048738A1
Application number: PCT/CN1999/000213
Authority: WO
Inventors: Yonghong Yan; Lingyun Tuo; Zhiwei Lin; Xiangdong Zhang; Robert Yung
Original assignee: Intel Corporation; Intel Architecture Development Shanghai Co., Ltd.
Priority date: 1999-12-23
Filing date: 1999-12-23
Publication date: 2001-07-05
Also published as: AU1767200A; CN1192354C; CN1398395A

Abstract

In some embodiments, the invention includes a method. The method involves creating a path list of segmentation paths of characters using a vocabulary. A probability of a first segmentation path is determined and designated as the best segmentation path. The probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path. In some embodiments, the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system. Additional embodiments are described and claimed.

Description

A GLOBAL APPROACH FOR SEGMENTING CHARACTERS INTO WORDS

Background of the Invention

Technical Field of the Invention: The present invention relates to speech recognition systems and, more particularly, to segmenting characters into words in a speech recognition system.

Background Art: One component in a speech recognizer is the language model. A popular way to capture the syntactic structure of a given language is using conditional probability to capture the sequential information embedded in the word strings in sentences. For example, if the current word is Wl, a language model can be constructed indicating the probabilities that certain other words W2, W3, ... Wn, will follow Wl . The probabilities of the words can be expressed such that P21 is the probability that word W2 will follow word Wl, where P21 = (W2|W1). In this notation, P31 is the probability word W3 will follow word Wl; P41 is the probability word W4 will follow word Wl, and so forth with Pnl being the probability that Wn will follow word Wl . The maximum of P21, P31, ... Pnl can be identified and used in the language model. The preceding examples are for bi-gram probabilities, although tri- gram conditional probabilities may also be computed.

Language models are often created through looking at written literature (such as newspapers) and determining the conditional probabilities of the vocabulary words with respect to others of the vocabulary words.

In some languages, such as Chinese and Japanese, words can be written as one or more symbolic characters, for example, Han zi (Chinese) and Kanji (Japanese). Sentences are composed of character strings, where words are implicit because there are no spaces between contiguous words. A particular character may be a word all by itself or may join with a character before it or after it (or possibly both before and after it) to form the word. The meaning of words can change depending on how the characters are joined or separated in creating the words. In written form, however, there are no spaces between the characters so it is not visually evident whether a particular character is a word all by itself or joins with another character or characters to form the word. Rather, the word a particular character belongs to is understood from the context. In order to apply a statistical method for language modeling, the words are explicitly extracted by putting spaces on the word boundaries.

Traditionally, character segmentation into words is done by a "greedy algorithm." The greedy algorithm involves the following: (1) Start from the beginning of the given sentence being processed and exhaust all the possible words that match the initial part of character string in the sentence.

(2) Pick the longest word (i.e., the word that has the largest number of characters) and put a space at the end of the matched sub string in the sentence, treat the remaining character string as a new sentence and repeat step (1) until all the characters in the sentence are processed.

The greedy algorithm does not always make the best choice from a global perspective. In fact, it may choose combinations that are not only not optimal but which are also not syntactically correct. As stated in T. Cormen et al. "Introduction to Algorithms," (The MIT Press 1990), p. 329: "A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution.

Summary

In some embodiments, the invention includes a method. The method involves creating a path list of segmentation paths of characters using a vocabulary. A probability of a first segmentation path is determined and designated as the best segmentation path. The probability of an additional one of the segmentation paths is determined and compared with the probability of the best segmentation path. If the probability of the additional segmentation path exceeds that of the best segmentation path, the additional segmentation path is designated the best segmentation path. This is repeated until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.

In some embodiments, the invention is an apparatus including a computer readable medium that performs such a method. In still other embodiments, the invention is a computer system.

Additional embodiments are described and claimed. Brief Description of the Drawings

The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only.

Figure 1 is a high level schematic block diagram representation of a computer system that may be used in connection with some embodiments of the invention. Figure 2 is a high level schematic representation of a hand-held computer system that may be used in connection with some embodiments of the invention. Detailed Description

The invention involves a system and method to segment words from characters. That is, the invention involves deciding which word a character should belong to. The invention has particular application in connection with languages such as Chinese and Japanese that do not have spaces between characters indicating word segmentation, but is not limited to such use. The disclosed invention is designed to make a better segmentation of words given any sentence. This leads to a language model better than the one obtained by traditional method, which uses a greedy algorithm, described above. A better language model will lead to better recognition accuracy since it describes the language better in terms of word strings. In some embodiments, the invention performs segmentation by using a dynamic programming algorithm equipped with statistical language model. There are various ways in which the dynamic algorithm can be implemented. One example of a dynamic algorithm is as follows. First, the corpus (i.e., the characters to be segmented into words) is processed through the traditional greedy algorithm to calculate an n-gram language model. Then, a Viterbi algorithm is used to re-segment the sentence. The

Viterbi algorithm is a type of dynamic programming, which may be used in global optimization. See, T. Cormen et al., "Introduction to Algorithms," (The MIT Press, 1990), pp. 301-28. The Viterbi algorithm we used can be described as in equation (1) as follows:

P_w, = max(P_w + prob(w_l | w , )) ( 1 ) i i i-\ In equation (1), P is probability and "prob" involves the language model. In equation (1), w, is the ith word, w,., is the word immediately preceding w„ P_Wl_, is the probability of the w^th word occurring, and prob(w₁I w₁.₁) is the conditional probability of word w, occurring if word w,._[ occurs. Equation (1) involves finding the word w, that leads to the maximum in equation (1). By solving equation (1), the resulting word sequence (wOwl ...wN) will guarantee the selected partition is the best in the maximal likelihood sense. In some embodiments, when i = N, such that the end of a sentence is reached, there is a global maximization.

Equation (1) is in a bi-gram format, but other formats, such as tri-gram or uni- gram formats, if they are available in the language model. Backoff weights and other techniques may be used.

As noted, in some languages, each character may be a word all by itself. However, the invention involves determining whether the characters may be joined with other characters to make additional words or are better left alone. The word consisting of multiple characters may also be called a term or phrase.

A version of the greedy algorithm is provided in pseudo code below:

Read vocabulary; // the vocabulary is the list of possible words

Open language corpus; // the language corpus includes the characters

// to segment into words while (not end of language corpus)

{ read a line from language corpus and put it into a line buffer;

// a line buffer is a group of memory, and is not restricted // to any particular form while (line buffer is not empty)

{ find the longest word in vocabulary which matches the head of the line buffer; output this word and a word separator character; remove the matched head from the line buffer;

} output a line separator character;

}

Close language corpus;

In some embodiments, a segmentation algorithm using a language model according to the present invention includes the following: Read language model; // load the language model into memory or

// otherwise make it available Read vocabulary; Open language corpus;

while (not end of language corpus)

{

Read a line from language corpus and put it into line buffer;

// the number of characters in a line can vary depending // on the implementation; a line can be a sentence

Create a path list containing all the possible segmentation paths using the vocabulary;

// a segmentation path is a possible segmentation of characters; // different formats may be used to store the paths, e.g., a list or // a tree structure

Find the greedy segmentation path and save it as the best path;

// various greedy algorithms may be used such as the one provided above;

// in this implementation of the invention, the greedy segmentation path is

// initially consider the best path, but other initial paths // could be used Calculate the probability of this path using the language model and set this value as the maximum probability; // the language model specifies the probability of a word

// occurring and a probability of a word following // another word. Equation (1) or another equation may be // used to calculate the probability

while (path list is not empty)

{

Select a path from the path list and set it as current path; Calculate the probability of current path using language model;

if ( the probability of current path > the maximum probability)

{ the maximum probability = the probability of current path; Save current path as the best path;

} Remove current path from the path list;

}

Output the best path;

}

Close language corpus;

An example of the algorithm is given with Chinese characters in the following sentence. Original Text

Segmentation result using greedy method

Segmentation re suit using language model

Example 1.

When correctly segmented, the sentence means "has ways and strength to solve the problem." The present invention successfully segments the sentence, while the traditional way failed to do so.

In Example 1, consider the original text as consisting of the following eight characters in order: CI, C2, C3, C4, C5, C6, C7, and C8. From the original text, it is not visually clear how to group the characters to form the words. Table 1 as follows gives two possible ways of grouping characters into five words Wl - W5.

Table 1 :

A greedy algorithm is used to create a greedy segmentation path as follows. The longest word in the vocabulary of consecutive characters in the corpus that starts with character CI is one consisting of only character CI. In other words, C1C2 is not a word in the vocabulary. Therefore, word Wl is character CI . In some embodiments, word Wl leaves the line buffer and the next character becomes the head of the line, although that is an implementation detail that is not required. In this example, the next character is C2. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C2 is a word consisting of characters C2C3. In other words, C2C3 is in the vocabulary, but C2C3C4 is not. Therefore, word W2 is characters C2C3. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C4 is one consisting of characters C4C5. Therefore, word W3 is C4C5. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C6 is a word consisting of C6. Therefore, word W4 is C6. The longest word in the vocabulary of consecutive characters in the corpus that starts with character C7 is a word consisting of C7C8. Therefore, word W5 is C7C8. The probability of this greedy segmentation path is calculated. With respect to words Wl and W2 and characters C1,C2, and C3, the only segmentation path that is included in the vocabulary is that already chosen by the greedy algorithm. One way to deal with this situation is to not recompute probabilities, but rather to not calculate another probability when there is an alternative path that is allowed by the vocabulary. Another way is to recalculate probabilities of the same path, only to determine they are the same and have the current path not replace the maximum probability.

However, with words W3 and W4, there are two paths. The first is that chosen by the greedy algorithm in which W3 is C4C5 and W4 is C6. An alternative segmentation path that is allowed by the vocabulary is W3 is C4 and W4 is C5C6. In the example, assume that C4 followed by the combination of C5C6 is more probable than the combination of C4C5 followed by C6. (Word W5 is the same in each case.) Then in equation (1), the probability of the current path would be greater that the probability of the greedy segmentation path and it would replace that of the greedy segmentation path. Note that following interesting possibility. Assume that the combination of C4C5 is more probable than C4 alone. From that single bit of information, the greedy segmentation path would be selected. However, that would not lead to the better global solution because C4 followed by C5C6 is more probable than C4C5 followed by C6.

The line may be a sentence. As used herein, the term "sentence" refers to a group of successive words ending in with a symbol, such as a period. In different embodiments, different groups of characters may be considered in the segmentation path. For example, the segmentation path could consider all characters in a sentence. The segmentation path could consider a moving window of characters without consideration for sentence endings, except to note that the language model will not allow a character at the end of a sentence to join with the first character in the next sentence. The window may be a set number of characters. The segmentation path could include X characters, with a new segmentation path starting with the last character of the previous path if it is not in a word. Other possibilities exist.

There are a variety of computer systems that may be used in training and for speech recognition system. Merely as an example, Figure 1 illustrates a highly schematic representation of a computer system 10 which includes a processor 14, memory 16, and input/output and control block 18^". Memory 16 may include a line buffer 22. The line buffer is merely a group of memory and does not have to have any particular characteristics. For example, it does not have to have contiguous memory cells. There may be a substantially amount of memory in processor 14 and memory 16 may represent both memory that is off the chip of processor 14 or memory that is partially on and partially off the chip of processor 14. (Or memory 16 could be completely on the chip of processor 14). In some embodiments, a line buffer 24 is in processor 14, however a line buffer does not need to be in processor 14. Further, not every embodiment of the invention has a line buffer. The segmentation paths do not need to be stored in a line buffer. At least some of the input/output and control block 18 could be on the same chip as processor 14, or be on a separate chip. A microphone

26, monitor 30, additional memory 34, and input devices (such as a keyboard and mouse 38), a network connection 42, and speaker(s) 44 may interface with input/output and control block 18. Memory 34 represents a variety of memory such as a hard drive and CD ROM or DVD discs. These include computer readable media that can hold instructions to be executed causing some embodiments of the invention to occur. It is emphasized that FIG. 1 is merely exemplary and the invention is not limited to use with such a computer system. Computer system 10 and other computer systems used to carry out the invention may be in a variety of forms, such as desktop, mainframe, and portable computers. For example, Figure 2 illustrates a handheld device 60, with a display 62, which may incorporate some or all the features of Figure 1. The hand held device may at times interface with another computer system, such as that of Figure 1. The shapes and relative sizes of the objects in Figures 1 and 2 are not intended to suggest actual shapes and relative sizes.

Other Information and Embodiments

Traditionally the quality of a language model is measured by perplexity, which is an entropy measuring the complexity of the language. For the same training and evaluation text corpora, the model has lower perplexity is better than the ones with higher perplexity. As an experiment, evaluations were conducted using People's Daily 94 to 98 year's data with trigram models estimated with different segmentation methods. The traditional (greedy) way has a perplexity of 182 while an embodiment of the invention resulted in 143. This is a significant^'improvement in modeling accuracy compared with the previous technique.

Reference in the specification to "an embodiment," "one embodiment," "some embodiments," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the invention. The various appearances "an embodiment," "one embodiment," or "some embodiments" are not necessarily all referring to the same embodiments.

If the specification states a component, feature, structure, or characteristic "may", "might", or "could" be included, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to "a" or

"an" element, that does not mean there is only one of the element. If the specification or claims refer to "an additional" element, that does not preclude there being more than one of the additional element.

Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Accordingly, it is the following claims including any amendments thereto that define the scope of the invention.

Claims

CLAIMSWhat is claimed is:

1. A method comprising:

(a) creating a path list of segmentation paths of characters using a vocabulary; (b) determining a probability of a first segmentation path and designate it as the best segmentation path;

(c) determining a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designating the additional segmentation path as the best segmentation path, and repeating (c) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.

2. The method of claim 1, wherein the first sentence is obtained through a greedy algorithm.

3. The method of claim 1, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.

4. The method of claim 1, wherein the characters included in a segmentation path are those in a single sentence.

5. The method of claim 1, wherein the characters included in a segmentation path are in a sliding window.

6. The method of claim 1, wherein the probability is determined through use a language model.

7. The method of claim 1, wherein the probability is determined through a calculation involving the following equation: P_w, = max(P_w + prob(w_l \ w_{_, )) , i i i-\ where w, is the ith word, w,_, is the word immediately preceding w„ P_Wl., is the probability of the w,_.,th word occurring, and prob(w, \ w ) is the conditional probability of word w, occurring if word w occurs.

An apparatus comprising: a computer readable medium having instructions thereon which when executed cause a computer system to:

(a) create a path list of segmentation paths of characters using a vocabulary;

(b) determine a probability of a first segmentation path and designate it as the best segmentation path;

(c) determine a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designate the additional segmentation path as the best segmentation path, and repeat (c) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.

9. The apparatus of claim 8, wherein the first sentence is obtained through a greedy algorithm.

10. The apparatus of claim 8, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.

11. The apparatus of claim 8, wherein the characters included in a segmentation path are those in a single sentence.

12. The apparatus of claim 8, wherein the characters included in a segmentation path are in a sliding window.

13. The apparatus of claim 8, wherein the probability is determined through use a language model.

14. The apparatus of claim 8, wherein the probability is determined through a calculation involving the following equation: P_w, = max( _w + prob(w_t \ w^, )) , i i i-\ where w, is the ith word, w is the word immediately preceding w„ P_Wl__! is the probability of the w,_.,th word occurring, and prob(w, | w^) is the conditional probability of word w, occurring if word w,_, occurs.

15. The apparatus of claim 8, wherein the apparatus is a disc.

16. A computer system comprising: memory holding a list of segmentation paths of characters forming words in a vocabulary; a processor that

(a) determines a probability of a first segmentation path and designate it as the best segmentation path;

(b) determines a probability of an additional one of the segmentation paths and determining whether the probability of the additional segmentation path exceeds the probability of the best segmentation path, and if it so designates the additional segmentation path as the best segmentation path, and repeats (b) until the probability for all remaining segmentation paths have been determined and compared with the probability of the best segmentation path.

17. The apparatus of claim 16, wherein the first sentence is obtained through a greedy algorithm.

18. The apparatus of claim 16, wherein the segmentation paths are stored in a line buffer and removed from the line buffer after the corresponding probabilities have been compared.

19. The apparatus of claim 16, wherein the characters included in a segmentation path are those in a single sentence.

20. The apparatus of claim 16, wherein the characters included in a segmentation path are in a sliding window.

21. The apparatus of claim 16, wherein the probability is determined through use a language model.

22. The apparatus of claim 16, wherein the probability is determined through a calculation involving the following equation: P_w. = max( _w + prob(w_l | w )) , where w, is the ith word, w,., is the word i i -I immediately preceding w„ P_Wl_, is the probability of the w,.,th word occurring, and prob(w. I w,_.,) is the conditional probability of word w, occurring if word w,., occurs.