CN102314415A

CN102314415A - Discriminant word segmentation system and method using idiom knowledge

Info

Publication number: CN102314415A
Application number: CN2010102216290A
Authority: CN
Inventors: 毛新年
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: Shanghai Guoke Electronic Co., Ltd.
Priority date: 2010-07-08
Filing date: 2010-07-08
Publication date: 2012-01-11

Abstract

The invention discloses a discriminant word segmentation method using idiom knowledge, which comprises the following steps of: 1, training a word segmentation knowledge base, namely (1) extracting basic characteristics from a text subjected to manual word segmentation, (2) extracting idiom characteristics from the text subjected to manual word segmentation, and (3) training the extracted characteristics to obtain the knowledge base for word segmentation; 2, acquiring the basic characteristics from the original text to be segmented; 3, acquiring the idiom characteristics from the original text to be segmented; and 4, performing word segmentation on the original text to be segmented by using the word segmentation knowledge base obtained in the step 1 through training. In addition, the invention discloses a discriminant word segmentation system using the idiom knowledge. Under the condition that calculation complexity is not improved, the word segmentation performance of long words can be obviously improved, and the word segmentation accuracy of the long words in a word segmentation algorithm is improved.

Description

Utilize the discriminant Words partition system and the method for Chinese idiom knowledge

Technical field

The present invention relates to a kind of Chinese word segmentation system, relate in particular to a kind of discriminant Words partition system, be specifically related to a kind of discriminant Words partition system that utilizes Chinese idiom knowledge; In addition, the invention still further relates to a kind of discriminant segmenting method that utilizes Chinese idiom knowledge.

Background technology

Discriminant Words partition system commonly used in the present Chinese word segmentation system, not enough based on participle technique performance in long word identification of discriminant machine learning, these long words mainly are Chinese idiom and imitative speech (time, date etc.).In present existing discriminant Words partition system; Only consider for the processing of long word and to have fixed mode imitative speech (time, date etc.); Do not consider the special processing of Chinese idiom, idiom as a kind of common long word; Existing method utilizes the word characteristic of plurality of windows to carry out participle; Can not catch long distance, the long word cutting precision such for Chinese idiom is often not high.

Existing discriminant segmenting method mainly relies on the word characteristic of specified window size to realize participle; This method performance in the long word cutting is not good; The cutting performance that this method will improve long word can only still enlarge the huge raising that window ranges can be brought calculated amount through enlarging the scope of window.

Therefore, need a kind of new method to improve the long word cutting accuracy rate in the participle.

Summary of the invention

The technical matters that the present invention will solve provides a kind of discriminant Words partition system and method for utilizing Chinese idiom knowledge, and it can significantly improve the participle performance of long word under the situation that does not improve computational complexity, improve to divide in the word algorithm accuracy rate for the long word cutting.

For solving the problems of the technologies described above, the present invention provides a kind of discriminant segmenting method that utilizes Chinese idiom knowledge, comprises the steps:

The first step, the training of participle knowledge base comprises:

Step 1 is to the text extraction essential characteristic of artificial participle;

Step 2 is to the text extraction Chinese idiom characteristic of artificial participle;

Step 3 is trained the characteristic that extracts, and obtains the knowledge base that participle is used;

In second step, from urtext to be slit, obtain essential characteristic;

In the 3rd step, from urtext to be slit, obtain the Chinese idiom characteristic;

In the 4th step, the participle knowledge base of utilizing first step training to obtain is carried out participle to urtext to be slit.

In the step 1 of the first step, it mainly is to extract the word characteristic of plurality of windows that said text to artificial participle extracts essential characteristic.

In the step 2 of the first step, adopt and from dictionary of idioms, extract the Chinese idiom characteristic.

In the 3rd step, adopt and from dictionary of idioms, obtain the Chinese idiom characteristic.

Said Chinese idiom characteristic is mated through dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:

This word is the lead-in of the Chinese idiom of successful match: B-Idiom;

This word is the middle word of the Chinese idiom of successful match: I-Idiom;

This word is the tail word of the Chinese idiom of successful match: E-Idiom;

Other literal that are not matched to merit are composed with Other.

Said dictionary of idioms and urtext to be slit are mated employing forward maximum match or reverse maximum match.

In addition, the present invention also provides a kind of discriminant Words partition system that utilizes Chinese idiom knowledge, comprises participle knowledge base training module, obtains the essential characteristic module, obtains Chinese idiom characteristic module and word-dividing mode; Said participle knowledge base training module comprises extraction essential characteristic module, extracts Chinese idiom characteristic module and training module;

This extraction essential characteristic module is used for the text of artificial participle is extracted essential characteristic; This extraction Chinese idiom characteristic module is used for the text of artificial participle is extracted the Chinese idiom characteristic; This training module is used for the characteristic of above-mentioned extraction is trained, and obtains the participle knowledge base; This obtains the essential characteristic module and is used for obtaining essential characteristic from urtext to be slit; This extraction Chinese idiom characteristic module is used for obtaining the Chinese idiom characteristic from urtext to be slit; The participle knowledge base that this word-dividing mode is used to utilize participle knowledge base training module to obtain is carried out participle to urtext to be slit.

Said extraction Chinese idiom characteristic module and the said Chinese idiom characteristic module that obtains mate through dictionary of idioms and urtext to be slit, compose with the Chinese idiom characteristic.

Beneficial effect of the present invention is: the inventive method utilizes dictionary of idioms as knowledge source; As the cutting characteristic in the discriminant machine learning algorithm, utilize the characteristic of these reinforcements to improve in the branch word algorithm accuracy rate dictionary of idioms and the automatic matching result of urtext to be slit for the Chinese idiom cutting.The inventive method utilizes dictionary of idioms knowledge as a kind of enhancing characteristic, is used in combination with original word characteristic, under the situation that does not improve computational complexity, can significantly improve the participle performance of long word.

Description of drawings

Fig. 1 is the training process synoptic diagram of model participle knowledge base in the inventive method;

Fig. 2 utilizes the participle knowledge base to carry out the process synoptic diagram of participle in the inventive method;

Fig. 3 is the modular structure synoptic diagram of system of the present invention.

Embodiment

The inventive method utilizes dictionary of idioms as knowledge source; As the cutting characteristic in the discriminant machine learning algorithm, utilize the characteristic of these reinforcements to improve in the branch word algorithm accuracy rate dictionary of idioms and the automatic matching result of urtext to be slit for the Chinese idiom cutting.

A kind of idiographic flow of the discriminant segmenting method of Chinese idiom knowledge that utilizes of the present invention is seen Fig. 1 and Fig. 2.Fig. 1 is the training process of model participle knowledge base, and Fig. 2 utilizes the participle knowledge base to carry out the process of participle.

As shown in Figure 1, the training process of model participle knowledge base comprises the steps: in the inventive method

Step 1: the text (the participle language material of artificial cutting) to artificial participle extracts essential characteristic, mainly is the word characteristic that extracts plurality of windows, and present existing participle training module also possesses this step;

Step 2: the text to artificial participle extracts the Chinese idiom characteristic, and this is the step that present analyzing and training module does not all have; Can obtain other forms of Chinese idiom characteristic through the obtain manner that changes the Chinese idiom characteristic, for example, can from dictionary of idioms, obtain Chinese idiom characteristic (see figure 1), also can obtain the Chinese idiom characteristic through " Chinese idiom is complete works of " that derives from network, or the like;

Step 3: the characteristic to extracting is trained, and obtains the knowledge base that participle is used.

As shown in Figure 2, the process of utilizing the participle knowledge base that urtext to be slit is carried out participle in the inventive method comprises the steps:

Step 1: from urtext to be slit, obtain essential characteristic;

Step 2: from urtext to be slit, obtain the Chinese idiom characteristic; Can obtain other forms of Chinese idiom characteristic through the obtain manner that changes the Chinese idiom characteristic, for example, can from dictionary of idioms, obtain Chinese idiom characteristic (see figure 2), also can obtain the Chinese idiom characteristic through " Chinese idiom is complete works of " that derives from network, or the like;

Step 3: utilize the participle knowledge base that the step training obtains among Fig. 1 that urtext to be slit is carried out participle.

Chinese idiom characteristic among above-mentioned Fig. 1 and Fig. 2 can obtain as follows:

The Chinese idiom characteristic is mated (forward maximum match or reverse maximum match) through a dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:

This word is the lead-in of the Chinese idiom of successful match: B-Idiom;

This word is the middle word of the Chinese idiom of successful match: I-Idiom;

This word is the tail word of the Chinese idiom of successful match: E-Idiom;

Other literal that are not matched to merit are composed with Other.

Forward (reverse) maximum match refers to: the number of words N that sets speech the longest in the dictionary; With sentence from left to right (forward) or from right to left (reverse) mate with N word; If coupling is unsuccessful; Then remove the last character; Up to the phrase that in dictionary, has mated successfully M word, reach M gets N word once more and mates; Finish until sentence, for example:

Suppose that dictionary is:

Netanyahu

Talk nonsense

Of

Really

Resonable

And suppose that long word is 5 words;

The matching result of forward coupling " Netanyahu says really reason really " is " Netanyahu says really reason really ";

The matching result of reverse coupling " Netanyahu says really reason really " is " Netanyahu says really reason really ".

As shown in Figure 3, a kind of discriminant Words partition system that utilizes Chinese idiom knowledge of the present invention comprises: participle knowledge base training module, obtain the essential characteristic module, obtain Chinese idiom characteristic module and word-dividing mode; Comprise in the participle knowledge base training module and extract essential characteristic module, extraction Chinese idiom characteristic module and training module.This extraction essential characteristic module is used for the text of artificial participle is extracted essential characteristic; This extraction Chinese idiom characteristic module is used for the text of artificial participle is extracted the Chinese idiom characteristic; This training module is used for above-mentioned extraction essential characteristic module and the characteristic that extracts the extraction of Chinese idiom characteristic module are trained, and obtains the participle knowledge base; This obtains the essential characteristic module and is used for obtaining essential characteristic from urtext to be slit; This extraction Chinese idiom characteristic module is used for obtaining the Chinese idiom characteristic from urtext to be slit; The participle knowledge base that this word-dividing mode is used to utilize participle knowledge base training module to obtain is carried out participle to urtext to be slit.

This extraction Chinese idiom characteristic module obtains the Chinese idiom characteristic module with this and mates through dictionary of idioms and urtext to be slit, composes with the Chinese idiom characteristic.

Claims

1. a discriminant segmenting method that utilizes Chinese idiom knowledge is characterized in that, comprises the steps:

The first step, the training of participle knowledge base comprises:

In second step, from urtext to be slit, obtain essential characteristic;

2. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the step 1 of the first step, it mainly is to extract the word characteristic of plurality of windows that said text to artificial participle extracts essential characteristic.

3. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the step 2 of the first step, adopts and from dictionary of idioms, extracts the Chinese idiom characteristic.

4. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the 3rd step, adopts and from dictionary of idioms, obtains the Chinese idiom characteristic.

5. like claim 3 or the 4 described discriminant segmenting methods that utilize Chinese idiom knowledge; It is characterized in that; Said Chinese idiom characteristic is mated through dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:

This word is the lead-in of the Chinese idiom of successful match: B-Idiom;

This word is the middle word of the Chinese idiom of successful match: I-Idiom;

This word is the tail word of the Chinese idiom of successful match: E-Idiom;

Other literal that are not matched to merit are composed with Other.

6. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 5 is characterized in that, said dictionary of idioms and urtext to be slit are mated employing forward maximum match or reverse maximum match.

7. a discriminant Words partition system that utilizes Chinese idiom knowledge is characterized in that, comprises participle knowledge base training module, obtains the essential characteristic module, obtains Chinese idiom characteristic module and word-dividing mode; Said participle knowledge base training module comprises extraction essential characteristic module, extracts Chinese idiom characteristic module and training module;

8. the discriminant Words partition system that utilizes Chinese idiom knowledge as claimed in claim 7 is characterized in that, said extraction Chinese idiom characteristic module and the said Chinese idiom characteristic module that obtains mate through dictionary of idioms and urtext to be slit, composes with the Chinese idiom characteristic.