CN108845982A

CN108845982A - A kind of Chinese word cutting method of word-based linked character

Info

Publication number: CN108845982A
Application number: CN201711293044.8A
Authority: CN
Inventors: 龙华; 李康康; 邵玉斌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-11-20
Anticipated expiration: 2037-12-08
Also published as: CN108845982B

Abstract

The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.The present invention selects text to be treated from text library, and pre-processes to text library, including removes symbol and form it into sentence, using removing the sentence builder corpus after symbol.Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting forms binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets the reservation of this judgement, forms new corpus.

Description

A kind of Chinese word cutting method of word-based linked character

Technical field

The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.

Background technique

Chinese words segmentation belongs to natural language processing technique scope, in short, people can be by oneself knowledge Understand which is word, which is not word, but how computer to be allowed to will also understand that？Its treatment process is exactly segmentation methods.

Existing segmentation methods can be divided into three categories：Segmenting method based on understanding, the participle side based on string matching Method and traditional segmenting method based on statistics.

Segmenting method based on understanding is to achieve the effect that identify word by allowing the understanding of computer mould personification distich. Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information Phenomenon.It generally includes three parts：Segment subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.Due to Chinese language General, the complexity of knowledge, it is difficult to various language messages are organized into the form that machine can be directly read, therefore currently based on reason The Words partition system of solution is also in experimental stage.

Segmenting method based on string matching, is called and does mechanical segmentation method, it is will be wait divide according to certain strategy The Chinese character string of analysis is matched with the entry in " sufficiently big " machine dictionary, if finding some character string in dictionary, Successful match (identifies a word).According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and inverse To matching；The case where according to different length priority match, can be divided into maximum (longest) matching and minimum (most short) matching；According to Whether combined with part-of-speech tagging process, and the integration side that simple segmenting method and participle are combined with mark can be divided into Method.Common several mechanical segmentation methods have：(1) Forward Maximum Method method (by left-to-right direction)；(2) reverse maximum matching Method (by right to left direction)；(3) minimum cutting (keeping the word number cut out in each sentence minimum).It can also be by these three machinery point Word method is combined with each other, for example, it is two-way Forward Maximum Method method and reverse maximum matching process can be combined composition Matching method.The characteristics of due to Chinese word at word, positive smallest match and reverse smallest match are generally rarely employed.Generally Come, reverse matched cutting precision is slightly above positive matching, and the Ambiguity encountered is also less.The Words partition system of actual use, It is all also to need using mechanical Chinese word segmentation as a kind of just departure section by further increasing cutting using various other language messages Accuracy rate.A kind of improvement method is to improve scanning mode, referred to as mark scanning or mark cutting, preferentially in character string to be analyzed It is middle to identify and be syncopated as some words with obvious characteristic, using these words as breakpoint, former character string can be divided into lesser string Come again into mechanical Chinese word segmentation, to reduce matched error rate.Another improvement method is will to segment to combine with part-of-speech tagging, Help is provided to participle decision using grammatical category information abundant, and word segmentation result is examined in turn again in annotation process It tests, adjust, to greatly improve the accuracy rate of cutting.

The above-mentioned segmenting method based on string matching, that is, in mechanical segmentation method, either Forward Maximum Method Method, reverse maximum matching method or minimum cutting, the purpose of these maximum matching methods are attempt in every point of word all as far as possible Make itself and the word matching length longest in dictionary.The advantages of maximum matching method is that principle is simple, it is easy to accomplish, the disadvantage is that maximum Be not easy to determine with length, time complexity rises if too big, too it is small some be more than that the word of the length can not match, reduce The accuracy rate of participle.The evaluation principle of maximum matching method is " priority of long word ".No matter however existing maximum matching method is forward gone back It is inversely, to increase word or subtract word, is all to carry out maximum matching in subrange, i.e., the matched range of maximum is all i at first every time Or last i character, do not fully demonstrate the principle of " priority of long word " in this way.

The principle of segmenting method based on conventional statistics is formally to see, word is stable combinatorics on words, therefore upper Hereinafter, the number that adjacent word occurs simultaneously is more, is more possible to constitute a word.Therefore the frequency of word co-occurrence adjacent with word Rate or probability can preferably reflect into the confidence level of word.Can frequency to each combinatorics on words of co-occurrence adjacent in corpus into Row statistics, calculates their information that appears alternatively.The information that appears alternatively for defining two words calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y. The information that appears alternatively embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, can think This word group may constitute a word.This method need to only count the word group frequency in corpus, not need cutting dictionary, It is thus called to do no dictionary cutting word method or count and takes word method.But this method also has certain limitation, can often extract one out A little co-occurrence frequency are high but are not the commonly used word group of word, for example, " this ", " one of ", " having ", " I ", " many " etc., And poor to the accuracy of identification of everyday words, space-time expense is big.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of Chinese word cutting methods of word-based linked character, to solve The defect of word certainly can not be effectively identified and extracted in the prior art from large-scale corpus, realize computer system in extensive language It is effectively identified in material and extracts word.

The technical scheme is that：A kind of Chinese word cutting method of word-based linked character：

A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and made it Sentence is formed, using removing the sentence builder corpus after symbol；

B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment；

C, it is spelled using word before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting Method is connect, binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary are formed；

D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word, Quaternary candidate word carries out word frequency statistics；

E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the time of this thresholding It selects word to retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding；

F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidates One unified freedom degree of word and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, and sentences if being unsatisfactory for this Candidate word certainly is then deleted；

G, using participle filter method, further mistake is being carried out to the ternary candidate word and quaternary candidate word screened Filter, forms new dictionary.

The front and back splicing word method, which refers to, to be carried out a Chinese continuously to cut participle since first character, by it It is all to be cut at word word, specially：

The content of text for being included for a Chinese text is assumed to be：

{a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+5.......a_i-1+n,a_i+n, wherein a_iA word being expressed as in text Symbol, n ∈ N；

Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result Binary text fragments set is：{(a_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result Ternary text fragments set is：{(a_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result Quaternary text fragments set is：{(a_ia_i+1a_i+2a_i+3),(a_i+1a_i+2a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+ _na_i-2+na_i-1+na_i+n)}。

The freedom degree refers to：When a text fragments appear in a variety of different text sets, and there is left adjacent word collection It closes and right adjacent word set, left neighbour's word set refers to that the set for appearing in the adjacent character in the text fragments left side, right neighbour's word set are The set for pointing out character adjacent on the right of present text fragments, the comentropy by calculating left adjacent word set and right adjacent word set obtain The comentropy for taking a text fragments takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.

In the text fragments set that the freedom degree is, when a text fragments can appear in a variety of different texts This concentration, and there is left adjacent word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtains Take the comentropy H an of text fragments, that is, H=min { s ', s " } H table Show the freedom degree of candidate word, the right entropy of S ' expression candidate word, s " is the left entropy of candidate word, takes left adjacent word set and right adjacent word set In smaller comentropy as freedom degree.

It is described coagulate it is right refer in a text, probability that a neologisms individually occur is higher than the probability of a combination thereof word Product, i.e. P (AB) > P (A) P (B) is enabledIt is right to coagulate to take the smallest M, and wherein AB indicates a neologisms, P (AB) indicate that the probability that neologisms occur in the text, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent combination The probability that word occurs in the text.

Coagulating right for the statistics candidate word is by calculating the independent probability of candidate word and the ratio of joint probability in corpus Value obtains, the specific steps are：

(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M₂：

Wherein, M₂Indicate candidate word coagulates right, s_iIt is general to indicate that the first character of two-spot candidate word occurs in corpus Rate, N_iIndicate the number that two-spot candidate word first character occurs in corpus, N_i+1Indicate second word of two-spot candidate word in language The number occurred in material, N_i,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, s_i+1Table Show that the probability that second word of two-spot candidate word occurs in corpus, p (i, i+1) indicate that two-spot candidate word occurs in corpus Probability；

(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M₃：

Wherein, M₃Indicate candidate word coagulates right, S_iIt is general to indicate that the first character of ternary candidate word occurs in corpus Rate, s_i+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, s_i,i+1Indicate ternary candidate word The first two word probability for occurring in corpus simultaneously, s_i+2Indicate that the last character of ternary candidate word goes out in corpus Existing probability, N_iIndicate the number that the first character of ternary candidate word occurs in corpus, N_i+2Indicate the of ternary candidate word The number that three words occur in corpus, N_i+1,i+2Indicate time that latter two word of ternary candidate word occurs in corpus Number, N_i,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, N_i,i+1,i+2Indicate ternary candidate word in language The number occurred in material library, P_(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus；

(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M₄：

Wherein, M₄Indicate candidate word coagulates right, S_iIt is general to indicate that the first character of quaternary candidate word occurs in corpus Rate, S_i+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, S_i,i+1,i+2Indicate quaternary The probability that first three word of candidate word occurs in corpus simultaneously, S_i,i+1,Indicate the first two word of quaternary candidate word in corpus The probability occurred in library, S_i+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, N_iIndicate that quaternary is waited The number for selecting the first character of word to occur in corpus, N_i+3Indicate that the 4th word of quaternary candidate word occurs in corpus Number, N_{I, i+1}Indicate the number that the first two word of quaternary candidate word occurs in corpus, N_{I+2, i+3}Indicate quaternary candidate word The number that occurs in corpus of latter two word, N_i+1,i+2,i+3, indicate that rear three words of quaternary candidate word go out in corpus Existing number, N_i+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i + 3) probability that quaternary candidate word occurs in corpus is indicated.

The filter method of the ternary candidate word is：For ternary candidate word, if latter two character is to be present in binary time It selects in dictionary, judges whether first character with left adjacent word constitutes a word, if the first two character is to be present in binary candidate word In library, judge whether the last character with right adjacent word constitutes a word, determine ternary candidate word whether candidate word；

If Then₍A_i-1A_iA_i+1)Belong to ternary word；

Wherein,₍A_i-1A_iA_i+1)Belong to ternary candidate word, A_i-2It is the left adjacent word collection of ternary candidate word, A_i+2It is ternary candidate The right adjacent word of word, { A₀....A_i....A_NIt }, is character set and { (A in corpus₀,A₁)...(A_i,A_i+1)...(A_i-2, A_i-1), it is binary candidate word set.

The filter method of the quaternary candidate word is：To may be at the quaternary candidate word of word, firstly, being split, preceding two A word is a participle segment, latter two word is another participle segment, and respectively to participle segment and the binary word divided Library is matched, in matching then as pre-selection word being then split in the word of centre two to quaternary word, and with divided Good binary dictionary is matched, the then knot as pre-selection word, if two conditions all meet, as participle in matching not Fruit：

And

Wherein, { A_i-2,(A_i-1A_iA_i+1)A_i+2Expression quaternary candidate word, { (A_i-2,A_i-1),(A_i,A_i+1) indicate by quaternary Each word in front and back two that candidate word branches away, { (A₀A₁)...(A_iA_i+1)...(A_NA_N+1) indicate binary dictionary, { (A_i-1, Ai) } table Show quaternary candidate word medium term.

The beneficial effects of the invention are as follows：The correctness and validity comparison of method provided by the present invention are high, and system can be with Efficiently segmented, what the present invention was designed into, right, freedom degree is coagulated, ternary and quaternary segmenting method can be very good to solve Traditional segmenting method institute problem based on statistical model.

Detailed description of the invention

Fig. 1 is flow diagram of the invention；

Specific embodiment

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1：As shown in Figure 1, a kind of Chinese word cutting method of word-based linked character：

The content of text for being included for a Chinese text is assumed to be：

Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result Quaternary text fragments set is：{(a_ia_i+1a_i+2a_i+3)_,(a_i+1a_i+₂a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+ _na_i-2+na_i-1+na_i+n)}。

In obtained text fragments set, the left adjacent word set of text fragments refers to：Appear in text fragments left side phase The set of adjacent character, such as text fragments (a_i+1a_i+2) in text collection { a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+ ₅.......a_i-1+n,a_i+nIn left adjacent word collection be combined into { a_i, the right adjacent word set of text fragments refers to：Appear in the text fragments right side The set of the adjacent character in side, such as text fragments (a_i+1a_i+2) in text collection { a_i,a_i+1,a_i+2,a_i+3,a_i+4, a_i+5.......a_i-1+n,a_i+nIn right adjacent word collection be combined into { a_i+3}。

The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word Freedom degree.

H=min { s ', s " }

Wherein, H indicates the freedom degree of candidate word, and s' indicates the right entropy of candidate word；

Wherein, b_iBelong to the right adjacent word collection of candidate word, nbi indicates b_iThe frequency on the right of candidate word is appeared in, k indicates candidate The right adjacent word of word concentrates Character table number, and s " is the left entropy of candidate word；

Wherein, mi belongs to the left adjacent word collection of candidate word, n_miIndicate that mi appears in the frequency on the candidate word left side, M indicates candidate The left adjacent word of word concentrates Character table number.

If Then (A_i-1A_iA_i+1) belong to ternary word；

Wherein,₍A_i-1A_iA_i+1)Belong to ternary candidate word, A_i-2It is the left adjacent word collection of ternary candidate word, A_i+2It is ternary candidate The right adjacent word of word, { A_0....A_i....A_N., it is character set and { (A in corpus₀,A₁)...(A_i,A_i+1)...(A_i-2, A_i-1), it is binary candidate word set.

And

Embodiment 2：As shown in Figure 1, selecting text to be treated from text library, and text library is pre-processed, Including removing symbol and forming it into sentence, using removing the sentence builder corpus after symbol.

Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.

Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting is formed One binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.

One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets this judgement Retain, forms new corpus.

Statistics candidate word coagulates right and candidate word freedom degree；In the present embodiment, coagulating for the statistics candidate word is right It can be obtained by calculating the independent probability and joint probability ratio of candidate word in corpus；The freedom degree of the candidate word can lead to It crosses and calculates and select candidate word or so adjacent word and integrate comentropy reckling as the freedom degree of candidate word.

Coagulating for the candidate word right is compared with the freedom degree of candidate word with the threshold value of setting.

The candidate word for being greater than threshold value is extracted, as candidate dictionary.

In the present embodiment, novel in four great classical masterpieces is had collected《Journey to the West》.In the dictionary counted, if at the text of word The distribution of this segment is sufficient, then can be higher relative to its solidifying conjunction degree not at the segment of word, and freedom degree is bigger.If by word The adjacent word in left and right regards stochastic variable as, then the comentropy of the adjacent word collection in the left and right of a word just reflect the adjacent word of this word or so with Machine, the bigger left adjacent word set for illustrating the word of entropy or right adjacent word set are abundanter, and it is lesser that we take the adjacent word in left and right to concentrate Entropy is as freedom degree.

In the present embodiment, for the segment at word, its solidifying conjunction degree can be higher, degree of relationship between declarer and word It is just higher, when calculating coagulates right, we take coagulate it is right it is lesser as it is final coagulate it is right.

We segment filtering side passing through ternary using passing through freedom degree and coagulating the right dictionary counted as candidate dictionary Method and quaternary segment filter method, handle ternary candidate dictionary and quaternary candidate's dictionary, finally obtained to be used as dictionary.

Ternary participle filter method and quaternary participle filter method solve and are words and are actually not from subjective seem Word problem, to improve the validity of ternary dictionary and quaternary dictionary.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of Chinese word cutting method of word-based linked character, it is characterised in that：

A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and formed it into Sentence, using removing the sentence builder corpus after symbol；

C, using word splicing side before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting Method forms binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary；

E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the candidate word of this thresholding Retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding；

F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidate words one A unified freedom degree and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, if being unsatisfactory for this judgement Candidate word is then deleted；

G, it using participle filter method, is further filtered to the ternary candidate word and quaternary candidate word that screen, Form new dictionary.

2. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：It spells the front and back It connects word method and refers to and a Chinese is carried out continuously to cut participle since first character, its all is cut at word word Come, specially：

The content of text for being included for a Chinese text is assumed to be：

{a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+5.......a_i-1+n,a_i+n, wherein a_iA character being expressed as in text, n ∈ N；

Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result quaternary Text fragments set is：{(a_ia_i+1a_i+2a_i+3),(a_i+1a_i+2a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+na_i-2+ _na_i-1+na_i+n)}。

3. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：The freedom degree Refer to：When a text fragments appear in a variety of different text sets, and there is left adjacent word set and right adjacent word set, left neighbour Word set refers to the set for appearing in the adjacent character in the text fragments left side, and right neighbour's word set, which refers to, to be appeared on the right of text fragments The set of adjacent character, the comentropy by calculating left adjacent word set and right adjacent word set obtain the information of a text fragments Entropy takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.

4. the Chinese word cutting method of word-based linked character according to claim 3, it is characterised in that：The freedom degree For in obtained text fragments set, when a text fragments can appear in a variety of different text sets, and there is left neighbour Word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtain the letter of a text fragments Cease entropy H, that is, H=min { s', s " },H indicates the freedom degree of candidate word, S' indicates that the right entropy of candidate word, s " are the left entropy of candidate word, takes smaller comentropy conduct in left adjacent word set and right adjacent word set Freedom degree.

5. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：It is described coagulate it is right Refer in a text, the probability that a neologisms individually occur is higher than the product of the probability of a combination thereof word, i.e. P (AB)>P(A)P (B), it enablesIt is right to coagulate to take the smallest M, and wherein AB indicates that a neologisms, P (AB) indicate neologisms in text The probability of middle appearance, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent portmanteau word occur in the text it is general Rate.

6. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：The statistics is waited Selecting coagulating right for word is obtained by calculating the independent probability of candidate word and the ratio of joint probability in corpus, the specific steps are：

Wherein, M₂Indicate candidate word coagulates right, s_iIndicate the probability that the first character of two-spot candidate word occurs in corpus, N_iIndicate the number that two-spot candidate word first character occurs in corpus, N_i+1Indicate second word of two-spot candidate word in corpus The number of appearance, N_i,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, s_i+1Indicate two The probability that second word of first candidate word occurs in corpus, it is general that p (i, i+1) indicates that two-spot candidate word occurs in corpus Rate；

Wherein, M₃Indicate candidate word coagulates right, S_iIndicate the probability that the first character of ternary candidate word occurs in corpus, s_i+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, s_i,i+1Before indicating ternary candidate word The probability that two words occur in corpus simultaneously, s_i+2Indicate what the last character of ternary candidate word occurred in corpus Probability, N_iIndicate the number that the first character of ternary candidate word occurs in corpus, N_i+2Indicate the third of ternary candidate word The number that word occurs in corpus, N_i+1,i+2Indicate the number that latter two word of ternary candidate word occurs in corpus, N_i,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, N_i,i+1,i+2Indicate ternary candidate word in corpus The number occurred in library, P_(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus；

Wherein, M₄Indicate candidate word coagulates right, S_iIndicate the probability that the first character of quaternary candidate word occurs in corpus, S_i+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, S_i,i+1,i+2Indicate that quaternary is candidate The probability that first three word of word occurs in corpus simultaneously, S_i,i+1, indicate the first two word of quaternary candidate word in corpus The probability of appearance, S_i+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, N_iIndicate quaternary candidate word The number that occurs in corpus of first character, N_i+3Indicate time that the 4th word of quaternary candidate word occurs in corpus Number, N_{I, i+1}Indicate the number that the first two word of quaternary candidate word occurs in corpus, N_{I+2, i+3}After indicating quaternary candidate word The number that two words occur in corpus, N_i+1,i+2,i+3, indicate what rear three words of quaternary candidate word occurred in corpus Number, N_i+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i+3) table Show the probability that quaternary candidate word occurs in corpus.

7. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：The ternary is waited Selecting the filter method of word is：For ternary candidate word, if latter two character is present in binary candidate's dictionary, first is judged Whether word, which constitutes a word with left adjacent word, judges the last character if the first two character is present in binary candidate's dictionary Whether with right adjacent word constitute a word, determine ternary candidate word whether candidate word；

If Then (A_i-1A_iA_i+1) belong to ternary word；

Wherein, (A_i-1A_iA_i+1) belong to ternary candidate word, A_i-2It is the left adjacent word collection of ternary candidate word, A_i+2It is ternary candidate word Right adjacent word, { A₀....A_i....A_NIt }, is character set and { (A in corpus₀,A₁)...(A_i,A_i+1)...(A_i-2,A_i-1), It is binary candidate word set.

8. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that：The quaternary is waited Selecting the filter method of word is：To may be at the quaternary candidate word of word, firstly, be split, the first two word be a participle segment, Latter two word is another participle segment, and matches respectively to participle segment with the binary dictionary divided, in matching Then as pre-selection word be then split in the word of centre two to quaternary word, and with divided binary dictionary progress Match, the then result as pre-selection word, if two conditions all meet, as participle in matching not：

And

Wherein, { A_i-2,(A_i-1A_iA_i+1)A_i+2Expression quaternary candidate word, { (A_i-2,A_i-1),(A_i,A_i+1) indicate by quaternary candidate Each word in front and back two that word branches away, { (A₀A₁)...(A_iA_i+1)...(A_NA_N+1) indicate binary dictionary, { (A_i-1, Ai) } indicate four First candidate word medium term.