CN102314415A - Discriminant word segmentation system and method using idiom knowledge - Google Patents

Discriminant word segmentation system and method using idiom knowledge Download PDF

Info

Publication number
CN102314415A
CN102314415A CN2010102216290A CN201010221629A CN102314415A CN 102314415 A CN102314415 A CN 102314415A CN 2010102216290 A CN2010102216290 A CN 2010102216290A CN 201010221629 A CN201010221629 A CN 201010221629A CN 102314415 A CN102314415 A CN 102314415A
Authority
CN
China
Prior art keywords
characteristic
chinese idiom
participle
idiom
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102216290A
Other languages
Chinese (zh)
Inventor
毛新年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Guoke Electronic Co., Ltd.
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN2010102216290A priority Critical patent/CN102314415A/en
Publication of CN102314415A publication Critical patent/CN102314415A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a discriminant word segmentation method using idiom knowledge, which comprises the following steps of: 1, training a word segmentation knowledge base, namely (1) extracting basic characteristics from a text subjected to manual word segmentation, (2) extracting idiom characteristics from the text subjected to manual word segmentation, and (3) training the extracted characteristics to obtain the knowledge base for word segmentation; 2, acquiring the basic characteristics from the original text to be segmented; 3, acquiring the idiom characteristics from the original text to be segmented; and 4, performing word segmentation on the original text to be segmented by using the word segmentation knowledge base obtained in the step 1 through training. In addition, the invention discloses a discriminant word segmentation system using the idiom knowledge. Under the condition that calculation complexity is not improved, the word segmentation performance of long words can be obviously improved, and the word segmentation accuracy of the long words in a word segmentation algorithm is improved.

Description

Utilize the discriminant Words partition system and the method for Chinese idiom knowledge
Technical field
The present invention relates to a kind of Chinese word segmentation system, relate in particular to a kind of discriminant Words partition system, be specifically related to a kind of discriminant Words partition system that utilizes Chinese idiom knowledge; In addition, the invention still further relates to a kind of discriminant segmenting method that utilizes Chinese idiom knowledge.
Background technology
Discriminant Words partition system commonly used in the present Chinese word segmentation system, not enough based on participle technique performance in long word identification of discriminant machine learning, these long words mainly are Chinese idiom and imitative speech (time, date etc.).In present existing discriminant Words partition system; Only consider for the processing of long word and to have fixed mode imitative speech (time, date etc.); Do not consider the special processing of Chinese idiom, idiom as a kind of common long word; Existing method utilizes the word characteristic of plurality of windows to carry out participle; Can not catch long distance, the long word cutting precision such for Chinese idiom is often not high.
Existing discriminant segmenting method mainly relies on the word characteristic of specified window size to realize participle; This method performance in the long word cutting is not good; The cutting performance that this method will improve long word can only still enlarge the huge raising that window ranges can be brought calculated amount through enlarging the scope of window.
Therefore, need a kind of new method to improve the long word cutting accuracy rate in the participle.
Summary of the invention
The technical matters that the present invention will solve provides a kind of discriminant Words partition system and method for utilizing Chinese idiom knowledge, and it can significantly improve the participle performance of long word under the situation that does not improve computational complexity, improve to divide in the word algorithm accuracy rate for the long word cutting.
For solving the problems of the technologies described above, the present invention provides a kind of discriminant segmenting method that utilizes Chinese idiom knowledge, comprises the steps:
The first step, the training of participle knowledge base comprises:
Step 1 is to the text extraction essential characteristic of artificial participle;
Step 2 is to the text extraction Chinese idiom characteristic of artificial participle;
Step 3 is trained the characteristic that extracts, and obtains the knowledge base that participle is used;
In second step, from urtext to be slit, obtain essential characteristic;
In the 3rd step, from urtext to be slit, obtain the Chinese idiom characteristic;
In the 4th step, the participle knowledge base of utilizing first step training to obtain is carried out participle to urtext to be slit.
In the step 1 of the first step, it mainly is to extract the word characteristic of plurality of windows that said text to artificial participle extracts essential characteristic.
In the step 2 of the first step, adopt and from dictionary of idioms, extract the Chinese idiom characteristic.
In the 3rd step, adopt and from dictionary of idioms, obtain the Chinese idiom characteristic.
Said Chinese idiom characteristic is mated through dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:
This word is the lead-in of the Chinese idiom of successful match: B-Idiom;
This word is the middle word of the Chinese idiom of successful match: I-Idiom;
This word is the tail word of the Chinese idiom of successful match: E-Idiom;
Other literal that are not matched to merit are composed with Other.
Said dictionary of idioms and urtext to be slit are mated employing forward maximum match or reverse maximum match.
In addition, the present invention also provides a kind of discriminant Words partition system that utilizes Chinese idiom knowledge, comprises participle knowledge base training module, obtains the essential characteristic module, obtains Chinese idiom characteristic module and word-dividing mode; Said participle knowledge base training module comprises extraction essential characteristic module, extracts Chinese idiom characteristic module and training module;
This extraction essential characteristic module is used for the text of artificial participle is extracted essential characteristic; This extraction Chinese idiom characteristic module is used for the text of artificial participle is extracted the Chinese idiom characteristic; This training module is used for the characteristic of above-mentioned extraction is trained, and obtains the participle knowledge base; This obtains the essential characteristic module and is used for obtaining essential characteristic from urtext to be slit; This extraction Chinese idiom characteristic module is used for obtaining the Chinese idiom characteristic from urtext to be slit; The participle knowledge base that this word-dividing mode is used to utilize participle knowledge base training module to obtain is carried out participle to urtext to be slit.
Said extraction Chinese idiom characteristic module and the said Chinese idiom characteristic module that obtains mate through dictionary of idioms and urtext to be slit, compose with the Chinese idiom characteristic.
Beneficial effect of the present invention is: the inventive method utilizes dictionary of idioms as knowledge source; As the cutting characteristic in the discriminant machine learning algorithm, utilize the characteristic of these reinforcements to improve in the branch word algorithm accuracy rate dictionary of idioms and the automatic matching result of urtext to be slit for the Chinese idiom cutting.The inventive method utilizes dictionary of idioms knowledge as a kind of enhancing characteristic, is used in combination with original word characteristic, under the situation that does not improve computational complexity, can significantly improve the participle performance of long word.
Description of drawings
Fig. 1 is the training process synoptic diagram of model participle knowledge base in the inventive method;
Fig. 2 utilizes the participle knowledge base to carry out the process synoptic diagram of participle in the inventive method;
Fig. 3 is the modular structure synoptic diagram of system of the present invention.
Embodiment
The inventive method utilizes dictionary of idioms as knowledge source; As the cutting characteristic in the discriminant machine learning algorithm, utilize the characteristic of these reinforcements to improve in the branch word algorithm accuracy rate dictionary of idioms and the automatic matching result of urtext to be slit for the Chinese idiom cutting.
A kind of idiographic flow of the discriminant segmenting method of Chinese idiom knowledge that utilizes of the present invention is seen Fig. 1 and Fig. 2.Fig. 1 is the training process of model participle knowledge base, and Fig. 2 utilizes the participle knowledge base to carry out the process of participle.
As shown in Figure 1, the training process of model participle knowledge base comprises the steps: in the inventive method
Step 1: the text (the participle language material of artificial cutting) to artificial participle extracts essential characteristic, mainly is the word characteristic that extracts plurality of windows, and present existing participle training module also possesses this step;
Step 2: the text to artificial participle extracts the Chinese idiom characteristic, and this is the step that present analyzing and training module does not all have; Can obtain other forms of Chinese idiom characteristic through the obtain manner that changes the Chinese idiom characteristic, for example, can from dictionary of idioms, obtain Chinese idiom characteristic (see figure 1), also can obtain the Chinese idiom characteristic through " Chinese idiom is complete works of " that derives from network, or the like;
Step 3: the characteristic to extracting is trained, and obtains the knowledge base that participle is used.
As shown in Figure 2, the process of utilizing the participle knowledge base that urtext to be slit is carried out participle in the inventive method comprises the steps:
Step 1: from urtext to be slit, obtain essential characteristic;
Step 2: from urtext to be slit, obtain the Chinese idiom characteristic; Can obtain other forms of Chinese idiom characteristic through the obtain manner that changes the Chinese idiom characteristic, for example, can from dictionary of idioms, obtain Chinese idiom characteristic (see figure 2), also can obtain the Chinese idiom characteristic through " Chinese idiom is complete works of " that derives from network, or the like;
Step 3: utilize the participle knowledge base that the step training obtains among Fig. 1 that urtext to be slit is carried out participle.
Chinese idiom characteristic among above-mentioned Fig. 1 and Fig. 2 can obtain as follows:
The Chinese idiom characteristic is mated (forward maximum match or reverse maximum match) through a dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:
This word is the lead-in of the Chinese idiom of successful match: B-Idiom;
This word is the middle word of the Chinese idiom of successful match: I-Idiom;
This word is the tail word of the Chinese idiom of successful match: E-Idiom;
Other literal that are not matched to merit are composed with Other.
Forward (reverse) maximum match refers to: the number of words N that sets speech the longest in the dictionary; With sentence from left to right (forward) or from right to left (reverse) mate with N word; If coupling is unsuccessful; Then remove the last character; Up to the phrase that in dictionary, has mated successfully M word, reach M gets N word once more and mates; Finish until sentence, for example:
Suppose that dictionary is:
Netanyahu
Talk nonsense
Of
Really
Really
Really
Resonable
And suppose that long word is 5 words;
The matching result of forward coupling " Netanyahu says really reason really " is " Netanyahu says really reason really ";
The matching result of reverse coupling " Netanyahu says really reason really " is " Netanyahu says really reason really ".
As shown in Figure 3, a kind of discriminant Words partition system that utilizes Chinese idiom knowledge of the present invention comprises: participle knowledge base training module, obtain the essential characteristic module, obtain Chinese idiom characteristic module and word-dividing mode; Comprise in the participle knowledge base training module and extract essential characteristic module, extraction Chinese idiom characteristic module and training module.This extraction essential characteristic module is used for the text of artificial participle is extracted essential characteristic; This extraction Chinese idiom characteristic module is used for the text of artificial participle is extracted the Chinese idiom characteristic; This training module is used for above-mentioned extraction essential characteristic module and the characteristic that extracts the extraction of Chinese idiom characteristic module are trained, and obtains the participle knowledge base; This obtains the essential characteristic module and is used for obtaining essential characteristic from urtext to be slit; This extraction Chinese idiom characteristic module is used for obtaining the Chinese idiom characteristic from urtext to be slit; The participle knowledge base that this word-dividing mode is used to utilize participle knowledge base training module to obtain is carried out participle to urtext to be slit.
This extraction Chinese idiom characteristic module obtains the Chinese idiom characteristic module with this and mates through dictionary of idioms and urtext to be slit, composes with the Chinese idiom characteristic.

Claims (8)

1. a discriminant segmenting method that utilizes Chinese idiom knowledge is characterized in that, comprises the steps:
The first step, the training of participle knowledge base comprises:
Step 1 is to the text extraction essential characteristic of artificial participle;
Step 2 is to the text extraction Chinese idiom characteristic of artificial participle;
Step 3 is trained the characteristic that extracts, and obtains the knowledge base that participle is used;
In second step, from urtext to be slit, obtain essential characteristic;
In the 3rd step, from urtext to be slit, obtain the Chinese idiom characteristic;
In the 4th step, the participle knowledge base of utilizing first step training to obtain is carried out participle to urtext to be slit.
2. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the step 1 of the first step, it mainly is to extract the word characteristic of plurality of windows that said text to artificial participle extracts essential characteristic.
3. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the step 2 of the first step, adopts and from dictionary of idioms, extracts the Chinese idiom characteristic.
4. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 1 is characterized in that, in the 3rd step, adopts and from dictionary of idioms, obtains the Chinese idiom characteristic.
5. like claim 3 or the 4 described discriminant segmenting methods that utilize Chinese idiom knowledge; It is characterized in that; Said Chinese idiom characteristic is mated through dictionary of idioms and urtext to be slit, if some literal by successful match, these literal are just composed with following characteristic:
This word is the lead-in of the Chinese idiom of successful match: B-Idiom;
This word is the middle word of the Chinese idiom of successful match: I-Idiom;
This word is the tail word of the Chinese idiom of successful match: E-Idiom;
Other literal that are not matched to merit are composed with Other.
6. the discriminant segmenting method that utilizes Chinese idiom knowledge as claimed in claim 5 is characterized in that, said dictionary of idioms and urtext to be slit are mated employing forward maximum match or reverse maximum match.
7. a discriminant Words partition system that utilizes Chinese idiom knowledge is characterized in that, comprises participle knowledge base training module, obtains the essential characteristic module, obtains Chinese idiom characteristic module and word-dividing mode; Said participle knowledge base training module comprises extraction essential characteristic module, extracts Chinese idiom characteristic module and training module;
This extraction essential characteristic module is used for the text of artificial participle is extracted essential characteristic; This extraction Chinese idiom characteristic module is used for the text of artificial participle is extracted the Chinese idiom characteristic; This training module is used for the characteristic of above-mentioned extraction is trained, and obtains the participle knowledge base; This obtains the essential characteristic module and is used for obtaining essential characteristic from urtext to be slit; This extraction Chinese idiom characteristic module is used for obtaining the Chinese idiom characteristic from urtext to be slit; The participle knowledge base that this word-dividing mode is used to utilize participle knowledge base training module to obtain is carried out participle to urtext to be slit.
8. the discriminant Words partition system that utilizes Chinese idiom knowledge as claimed in claim 7 is characterized in that, said extraction Chinese idiom characteristic module and the said Chinese idiom characteristic module that obtains mate through dictionary of idioms and urtext to be slit, composes with the Chinese idiom characteristic.
CN2010102216290A 2010-07-08 2010-07-08 Discriminant word segmentation system and method using idiom knowledge Pending CN102314415A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102216290A CN102314415A (en) 2010-07-08 2010-07-08 Discriminant word segmentation system and method using idiom knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102216290A CN102314415A (en) 2010-07-08 2010-07-08 Discriminant word segmentation system and method using idiom knowledge

Publications (1)

Publication Number Publication Date
CN102314415A true CN102314415A (en) 2012-01-11

Family

ID=45427598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102216290A Pending CN102314415A (en) 2010-07-08 2010-07-08 Discriminant word segmentation system and method using idiom knowledge

Country Status (1)

Country Link
CN (1) CN102314415A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Similar Documents

Publication Publication Date Title
Pouget-Abadie et al. Overcoming the curse of sentence length for neural machine translation using automatic segmentation
CN109255113B (en) Intelligent proofreading system
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
CN103971686B (en) Method and system for automatically recognizing voice
CN105426539B (en) A kind of lucene Chinese word cutting method based on dictionary
CN106569995B (en) Chinese ancient poetry word automatic generation method based on corpus and rules and forms rule
CN103123618B (en) Text similarity acquisition methods and device
CN103956162A (en) Voice recognition method and device oriented towards child
CN107578769A (en) Speech data mask method and device
CN104750687A (en) Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device
CN103324626B (en) A kind of set up the method for many granularities dictionary, the method for participle and device thereof
CN110377724A (en) A kind of corpus keyword Automatic algorithm based on data mining
CN101645269A (en) Language recognition system and method
CN109065032A (en) A kind of external corpus audio recognition method based on depth convolutional neural networks
WO2017177809A1 (en) Word segmentation method and system for language text
CN103778243A (en) Domain term extraction method
CN108804608A (en) A kind of microblogging rumour position detection method based on level attention
CN102708147A (en) Recognition method for new words of scientific and technical terminology
CN105095196A (en) Method and device for finding new word in text
CN110853629A (en) Speech recognition digital method based on deep learning
CN111160003B (en) Sentence breaking method and sentence breaking device
CN104156349A (en) Unlisted word discovering and segmenting system and method based on statistical dictionary model
CN103559181A (en) Establishment method and system for bilingual semantic relation classification model
CN109086266A (en) A kind of error detection of text nearly word form and proofreading method
CN105869622B (en) Chinese hot word detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHANGHAI GUOKE ELECTRONIC CO., LTD.

Free format text: FORMER OWNER: SHENGYUE INFORMATION TECHNOLOGY (SHANGHAI) CO., LTD.

Effective date: 20140728

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20140728

Address after: 201203 Pudong New Area Huaxia Road, Lane No. 958, No. 60, Shanghai

Applicant after: Shanghai Guoke Electronic Co., Ltd.

Address before: 201203 Shanghai Guo Shou Jing Road, Pudong New Area Zhangjiang hi tech Park No. 356

Applicant before: Shengle Information Technology (Shanghai) Co., Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 201203 Shanghai Zhang Heng Road, Lane 666, No. 8, building 1, Pudong New Area

Applicant after: SHANGHAI GEAK ELECTRONICS CO., LTD.

Address before: 201203 Pudong New Area Huaxia Road, Lane No. 958, No. 60, Shanghai

Applicant before: Shanghai Guoke Electronic Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120111