CN108845982A - A kind of Chinese word cutting method of word-based linked character - Google Patents
A kind of Chinese word cutting method of word-based linked character Download PDFInfo
- Publication number
- CN108845982A CN108845982A CN201711293044.8A CN201711293044A CN108845982A CN 108845982 A CN108845982 A CN 108845982A CN 201711293044 A CN201711293044 A CN 201711293044A CN 108845982 A CN108845982 A CN 108845982A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- corpus
- candidate word
- indicate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 100
- 239000012634 fragment Substances 0.000 claims abstract description 42
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 206010028916 Neologism Diseases 0.000 claims description 9
- 230000001112 coagulating effect Effects 0.000 claims description 6
- 230000010365 information processing Effects 0.000 abstract description 2
- 239000000463 material Substances 0.000 description 5
- 239000012141 concentrate Substances 0.000 description 3
- 238000011430 maximum method Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.The present invention selects text to be treated from text library, and pre-processes to text library, including removes symbol and form it into sentence, using removing the sentence builder corpus after symbol.Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting forms binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets the reservation of this judgement, forms new corpus.
Description
Technical field
The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.
Background technique
Chinese words segmentation belongs to natural language processing technique scope, in short, people can be by oneself knowledge
Understand which is word, which is not word, but how computer to be allowed to will also understand that?Its treatment process is exactly segmentation methods.
Existing segmentation methods can be divided into three categories:Segmenting method based on understanding, the participle side based on string matching
Method and traditional segmenting method based on statistics.
Segmenting method based on understanding is to achieve the effect that identify word by allowing the understanding of computer mould personification distich.
Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information
Phenomenon.It generally includes three parts:Segment subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part
Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould
People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.Due to Chinese language
General, the complexity of knowledge, it is difficult to various language messages are organized into the form that machine can be directly read, therefore currently based on reason
The Words partition system of solution is also in experimental stage.
Segmenting method based on string matching, is called and does mechanical segmentation method, it is will be wait divide according to certain strategy
The Chinese character string of analysis is matched with the entry in " sufficiently big " machine dictionary, if finding some character string in dictionary,
Successful match (identifies a word).According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and inverse
To matching;The case where according to different length priority match, can be divided into maximum (longest) matching and minimum (most short) matching;According to
Whether combined with part-of-speech tagging process, and the integration side that simple segmenting method and participle are combined with mark can be divided into
Method.Common several mechanical segmentation methods have:(1) Forward Maximum Method method (by left-to-right direction);(2) reverse maximum matching
Method (by right to left direction);(3) minimum cutting (keeping the word number cut out in each sentence minimum).It can also be by these three machinery point
Word method is combined with each other, for example, it is two-way Forward Maximum Method method and reverse maximum matching process can be combined composition
Matching method.The characteristics of due to Chinese word at word, positive smallest match and reverse smallest match are generally rarely employed.Generally
Come, reverse matched cutting precision is slightly above positive matching, and the Ambiguity encountered is also less.The Words partition system of actual use,
It is all also to need using mechanical Chinese word segmentation as a kind of just departure section by further increasing cutting using various other language messages
Accuracy rate.A kind of improvement method is to improve scanning mode, referred to as mark scanning or mark cutting, preferentially in character string to be analyzed
It is middle to identify and be syncopated as some words with obvious characteristic, using these words as breakpoint, former character string can be divided into lesser string
Come again into mechanical Chinese word segmentation, to reduce matched error rate.Another improvement method is will to segment to combine with part-of-speech tagging,
Help is provided to participle decision using grammatical category information abundant, and word segmentation result is examined in turn again in annotation process
It tests, adjust, to greatly improve the accuracy rate of cutting.
The above-mentioned segmenting method based on string matching, that is, in mechanical segmentation method, either Forward Maximum Method
Method, reverse maximum matching method or minimum cutting, the purpose of these maximum matching methods are attempt in every point of word all as far as possible
Make itself and the word matching length longest in dictionary.The advantages of maximum matching method is that principle is simple, it is easy to accomplish, the disadvantage is that maximum
Be not easy to determine with length, time complexity rises if too big, too it is small some be more than that the word of the length can not match, reduce
The accuracy rate of participle.The evaluation principle of maximum matching method is " priority of long word ".No matter however existing maximum matching method is forward gone back
It is inversely, to increase word or subtract word, is all to carry out maximum matching in subrange, i.e., the matched range of maximum is all i at first every time
Or last i character, do not fully demonstrate the principle of " priority of long word " in this way.
The principle of segmenting method based on conventional statistics is formally to see, word is stable combinatorics on words, therefore upper
Hereinafter, the number that adjacent word occurs simultaneously is more, is more possible to constitute a word.Therefore the frequency of word co-occurrence adjacent with word
Rate or probability can preferably reflect into the confidence level of word.Can frequency to each combinatorics on words of co-occurrence adjacent in corpus into
Row statistics, calculates their information that appears alternatively.The information that appears alternatively for defining two words calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.
The information that appears alternatively embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, can think
This word group may constitute a word.This method need to only count the word group frequency in corpus, not need cutting dictionary,
It is thus called to do no dictionary cutting word method or count and takes word method.But this method also has certain limitation, can often extract one out
A little co-occurrence frequency are high but are not the commonly used word group of word, for example, " this ", " one of ", " having ", " I ", " many " etc.,
And poor to the accuracy of identification of everyday words, space-time expense is big.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of Chinese word cutting methods of word-based linked character, to solve
The defect of word certainly can not be effectively identified and extracted in the prior art from large-scale corpus, realize computer system in extensive language
It is effectively identified in material and extracts word.
The technical scheme is that:A kind of Chinese word cutting method of word-based linked character:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and made it
Sentence is formed, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, it is spelled using word before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting
Method is connect, binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary are formed;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word,
Quaternary candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the time of this thresholding
It selects word to retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidates
One unified freedom degree of word and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, and sentences if being unsatisfactory for this
Candidate word certainly is then deleted;
G, using participle filter method, further mistake is being carried out to the ternary candidate word and quaternary candidate word screened
Filter, forms new dictionary.
The front and back splicing word method, which refers to, to be carried out a Chinese continuously to cut participle since first character, by it
It is all to be cut at word word, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA word being expressed as in text
Symbol, n ∈ N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result
Binary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result
Ternary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result
Quaternary text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The freedom degree refers to:When a text fragments appear in a variety of different text sets, and there is left adjacent word collection
It closes and right adjacent word set, left neighbour's word set refers to that the set for appearing in the adjacent character in the text fragments left side, right neighbour's word set are
The set for pointing out character adjacent on the right of present text fragments, the comentropy by calculating left adjacent word set and right adjacent word set obtain
The comentropy for taking a text fragments takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
In the text fragments set that the freedom degree is, when a text fragments can appear in a variety of different texts
This concentration, and there is left adjacent word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtains
Take the comentropy H an of text fragments, that is, H=min { s ', s " } H table
Show the freedom degree of candidate word, the right entropy of S ' expression candidate word, s " is the left entropy of candidate word, takes left adjacent word set and right adjacent word set
In smaller comentropy as freedom degree.
It is described coagulate it is right refer in a text, probability that a neologisms individually occur is higher than the probability of a combination thereof word
Product, i.e. P (AB) > P (A) P (B) is enabledIt is right to coagulate to take the smallest M, and wherein AB indicates a neologisms, P
(AB) indicate that the probability that neologisms occur in the text, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent combination
The probability that word occurs in the text.
Coagulating right for the statistics candidate word is by calculating the independent probability of candidate word and the ratio of joint probability in corpus
Value obtains, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2:
Wherein, M2Indicate candidate word coagulates right, siIt is general to indicate that the first character of two-spot candidate word occurs in corpus
Rate, NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in language
The number occurred in material, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, si+1Table
Show that the probability that second word of two-spot candidate word occurs in corpus, p (i, i+1) indicate that two-spot candidate word occurs in corpus
Probability;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3:
Wherein, M3Indicate candidate word coagulates right, SiIt is general to indicate that the first character of ternary candidate word occurs in corpus
Rate, si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Indicate ternary candidate word
The first two word probability for occurring in corpus simultaneously, si+2Indicate that the last character of ternary candidate word goes out in corpus
Existing probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the of ternary candidate word
The number that three words occur in corpus, Ni+1,i+2Indicate time that latter two word of ternary candidate word occurs in corpus
Number, Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in language
The number occurred in material library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4:
Wherein, M4Indicate candidate word coagulates right, SiIt is general to indicate that the first character of quaternary candidate word occurs in corpus
Rate, Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate quaternary
The probability that first three word of candidate word occurs in corpus simultaneously, Si,i+1,Indicate the first two word of quaternary candidate word in corpus
The probability occurred in library, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate that quaternary is waited
The number for selecting the first character of word to occur in corpus, Ni+3Indicate that the 4th word of quaternary candidate word occurs in corpus
Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3Indicate quaternary candidate word
The number that occurs in corpus of latter two word, Ni+1,i+2,i+3, indicate that rear three words of quaternary candidate word go out in corpus
Existing number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i
+ 3) probability that quaternary candidate word occurs in corpus is indicated.
The filter method of the ternary candidate word is:For ternary candidate word, if latter two character is to be present in binary time
It selects in dictionary, judges whether first character with left adjacent word constitutes a word, if the first two character is to be present in binary candidate word
In library, judge whether the last character with right adjacent word constitutes a word, determine ternary candidate word whether candidate word;
If
Then(Ai-1AiAi+1)Belong to ternary word;
Wherein,(Ai-1AiAi+1)Belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate
The right adjacent word of word, { A0....Ai....ANIt }, is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2,
Ai-1), it is binary candidate word set.
The filter method of the quaternary candidate word is:To may be at the quaternary candidate word of word, firstly, being split, preceding two
A word is a participle segment, latter two word is another participle segment, and respectively to participle segment and the binary word divided
Library is matched, in matching then as pre-selection word being then split in the word of centre two to quaternary word, and with divided
Good binary dictionary is matched, the then knot as pre-selection word, if two conditions all meet, as participle in matching not
Fruit:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary
Each word in front and back two that candidate word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } table
Show quaternary candidate word medium term.
The beneficial effects of the invention are as follows:The correctness and validity comparison of method provided by the present invention are high, and system can be with
Efficiently segmented, what the present invention was designed into, right, freedom degree is coagulated, ternary and quaternary segmenting method can be very good to solve
Traditional segmenting method institute problem based on statistical model.
Detailed description of the invention
Fig. 1 is flow diagram of the invention;
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figure 1, a kind of Chinese word cutting method of word-based linked character:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and made it
Sentence is formed, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, it is spelled using word before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting
Method is connect, binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary are formed;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word,
Quaternary candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the time of this thresholding
It selects word to retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidates
One unified freedom degree of word and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, and sentences if being unsatisfactory for this
Candidate word certainly is then deleted;
G, using participle filter method, further mistake is being carried out to the ternary candidate word and quaternary candidate word screened
Filter, forms new dictionary.
The front and back splicing word method, which refers to, to be carried out a Chinese continuously to cut participle since first character, by it
It is all to be cut at word word, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA word being expressed as in text
Symbol, n ∈ N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result
Binary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result
Ternary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result
Quaternary text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The freedom degree refers to:When a text fragments appear in a variety of different text sets, and there is left adjacent word collection
It closes and right adjacent word set, left neighbour's word set refers to that the set for appearing in the adjacent character in the text fragments left side, right neighbour's word set are
The set for pointing out character adjacent on the right of present text fragments, the comentropy by calculating left adjacent word set and right adjacent word set obtain
The comentropy for taking a text fragments takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
In obtained text fragments set, the left adjacent word set of text fragments refers to:Appear in text fragments left side phase
The set of adjacent character, such as text fragments (ai+1ai+2) in text collection { ai,ai+1,ai+2,ai+3,ai+4,ai+ 5.......ai-1+n,ai+nIn left adjacent word collection be combined into { ai, the right adjacent word set of text fragments refers to:Appear in the text fragments right side
The set of the adjacent character in side, such as text fragments (ai+1ai+2) in text collection { ai,ai+1,ai+2,ai+3,ai+4,
ai+5.......ai-1+n,ai+nIn right adjacent word collection be combined into { ai+3}。
In the text fragments set that the freedom degree is, when a text fragments can appear in a variety of different texts
This concentration, and there is left adjacent word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtains
Take the comentropy H an of text fragments, that is, H=min { s ', s " } H table
Show the freedom degree of candidate word, the right entropy of S ' expression candidate word, s " is the left entropy of candidate word, takes left adjacent word set and right adjacent word set
In smaller comentropy as freedom degree.
The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word
Freedom degree.
H=min { s ', s " }
Wherein, H indicates the freedom degree of candidate word, and s' indicates the right entropy of candidate word;
Wherein, biBelong to the right adjacent word collection of candidate word, nbi indicates biThe frequency on the right of candidate word is appeared in, k indicates candidate
The right adjacent word of word concentrates Character table number, and s " is the left entropy of candidate word;
Wherein, mi belongs to the left adjacent word collection of candidate word, nmiIndicate that mi appears in the frequency on the candidate word left side, M indicates candidate
The left adjacent word of word concentrates Character table number.
It is described coagulate it is right refer in a text, probability that a neologisms individually occur is higher than the probability of a combination thereof word
Product, i.e. P (AB) > P (A) P (B) is enabledIt is right to coagulate to take the smallest M, and wherein AB indicates a neologisms, P
(AB) indicate that the probability that neologisms occur in the text, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent combination
The probability that word occurs in the text.
Coagulating right for the statistics candidate word is by calculating the independent probability of candidate word and the ratio of joint probability in corpus
Value obtains, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2:
Wherein, M2Indicate candidate word coagulates right, siIt is general to indicate that the first character of two-spot candidate word occurs in corpus
Rate, NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in language
The number occurred in material, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, Si+1Table
Show that the probability that second word of two-spot candidate word occurs in corpus, p (i, i+1) indicate that two-spot candidate word occurs in corpus
Probability;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3:
Wherein, M3Indicate candidate word coagulates right, SiIt is general to indicate that the first character of ternary candidate word occurs in corpus
Rate, si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Indicate ternary candidate word
The first two word probability for occurring in corpus simultaneously, si+2Indicate that the last character of ternary candidate word goes out in corpus
Existing probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the of ternary candidate word
The number that three words occur in corpus, Ni+1,i+2Indicate time that latter two word of ternary candidate word occurs in corpus
Number, Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in language
The number occurred in material library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4:
Wherein, M4Indicate candidate word coagulates right, SiIt is general to indicate that the first character of quaternary candidate word occurs in corpus
Rate, Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate quaternary
The probability that first three word of candidate word occurs in corpus simultaneously, Si,i+1,Indicate the first two word of quaternary candidate word in corpus
The probability occurred in library, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate that quaternary is waited
The number for selecting the first character of word to occur in corpus, Ni+3Indicate that the 4th word of quaternary candidate word occurs in corpus
Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3Indicate quaternary candidate word
The number that occurs in corpus of latter two word, Ni+1,i+2,i+3, indicate that rear three words of quaternary candidate word go out in corpus
Existing number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i
+ 3) probability that quaternary candidate word occurs in corpus is indicated.
The filter method of the ternary candidate word is:For ternary candidate word, if latter two character is to be present in binary time
It selects in dictionary, judges whether first character with left adjacent word constitutes a word, if the first two character is to be present in binary candidate word
In library, judge whether the last character with right adjacent word constitutes a word, determine ternary candidate word whether candidate word;
If
Then (Ai-1AiAi+1) belong to ternary word;
Wherein,(Ai-1AiAi+1)Belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate
The right adjacent word of word, { A0....Ai....AN., it is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2,
Ai-1), it is binary candidate word set.
The filter method of the quaternary candidate word is:To may be at the quaternary candidate word of word, firstly, being split, preceding two
A word is a participle segment, latter two word is another participle segment, and respectively to participle segment and the binary word divided
Library is matched, in matching then as pre-selection word being then split in the word of centre two to quaternary word, and with divided
Good binary dictionary is matched, the then knot as pre-selection word, if two conditions all meet, as participle in matching not
Fruit:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary
Each word in front and back two that candidate word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } table
Show quaternary candidate word medium term.
Embodiment 2:As shown in Figure 1, selecting text to be treated from text library, and text library is pre-processed,
Including removing symbol and forming it into sentence, using removing the sentence builder corpus after symbol.
Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.
Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting is formed
One binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.
One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets this judgement
Retain, forms new corpus.
Statistics candidate word coagulates right and candidate word freedom degree;In the present embodiment, coagulating for the statistics candidate word is right
It can be obtained by calculating the independent probability and joint probability ratio of candidate word in corpus;The freedom degree of the candidate word can lead to
It crosses and calculates and select candidate word or so adjacent word and integrate comentropy reckling as the freedom degree of candidate word.
Coagulating for the candidate word right is compared with the freedom degree of candidate word with the threshold value of setting.
The candidate word for being greater than threshold value is extracted, as candidate dictionary.
In the present embodiment, novel in four great classical masterpieces is had collected《Journey to the West》.In the dictionary counted, if at the text of word
The distribution of this segment is sufficient, then can be higher relative to its solidifying conjunction degree not at the segment of word, and freedom degree is bigger.If by word
The adjacent word in left and right regards stochastic variable as, then the comentropy of the adjacent word collection in the left and right of a word just reflect the adjacent word of this word or so with
Machine, the bigger left adjacent word set for illustrating the word of entropy or right adjacent word set are abundanter, and it is lesser that we take the adjacent word in left and right to concentrate
Entropy is as freedom degree.
In the present embodiment, for the segment at word, its solidifying conjunction degree can be higher, degree of relationship between declarer and word
It is just higher, when calculating coagulates right, we take coagulate it is right it is lesser as it is final coagulate it is right.
We segment filtering side passing through ternary using passing through freedom degree and coagulating the right dictionary counted as candidate dictionary
Method and quaternary segment filter method, handle ternary candidate dictionary and quaternary candidate's dictionary, finally obtained to be used as dictionary.
Ternary participle filter method and quaternary participle filter method solve and are words and are actually not from subjective seem
Word problem, to improve the validity of ternary dictionary and quaternary dictionary.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (8)
1. a kind of Chinese word cutting method of word-based linked character, it is characterised in that:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and formed it into
Sentence, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, using word splicing side before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting
Method forms binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word, quaternary
Candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the candidate word of this thresholding
Retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidate words one
A unified freedom degree and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, if being unsatisfactory for this judgement
Candidate word is then deleted;
G, it using participle filter method, is further filtered to the ternary candidate word and quaternary candidate word that screen,
Form new dictionary.
2. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:It spells the front and back
It connects word method and refers to and a Chinese is carried out continuously to cut participle since first character, its all is cut at word word
Come, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA character being expressed as in text, n ∈
N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result binary
Text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result ternary
Text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result quaternary
Text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+nai-2+ nai-1+nai+n)}。
3. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The freedom degree
Refer to:When a text fragments appear in a variety of different text sets, and there is left adjacent word set and right adjacent word set, left neighbour
Word set refers to the set for appearing in the adjacent character in the text fragments left side, and right neighbour's word set, which refers to, to be appeared on the right of text fragments
The set of adjacent character, the comentropy by calculating left adjacent word set and right adjacent word set obtain the information of a text fragments
Entropy takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
4. the Chinese word cutting method of word-based linked character according to claim 3, it is characterised in that:The freedom degree
For in obtained text fragments set, when a text fragments can appear in a variety of different text sets, and there is left neighbour
Word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtain the letter of a text fragments
Cease entropy H, that is, H=min { s', s " },H indicates the freedom degree of candidate word,
S' indicates that the right entropy of candidate word, s " are the left entropy of candidate word, takes smaller comentropy conduct in left adjacent word set and right adjacent word set
Freedom degree.
5. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:It is described coagulate it is right
Refer in a text, the probability that a neologisms individually occur is higher than the product of the probability of a combination thereof word, i.e. P (AB)>P(A)P
(B), it enablesIt is right to coagulate to take the smallest M, and wherein AB indicates that a neologisms, P (AB) indicate neologisms in text
The probability of middle appearance, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent portmanteau word occur in the text it is general
Rate.
6. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The statistics is waited
Selecting coagulating right for word is obtained by calculating the independent probability of candidate word and the ratio of joint probability in corpus, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2:
Wherein, M2Indicate candidate word coagulates right, siIndicate the probability that the first character of two-spot candidate word occurs in corpus,
NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in corpus
The number of appearance, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, si+1Indicate two
The probability that second word of first candidate word occurs in corpus, it is general that p (i, i+1) indicates that two-spot candidate word occurs in corpus
Rate;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3:
Wherein, M3Indicate candidate word coagulates right, SiIndicate the probability that the first character of ternary candidate word occurs in corpus,
si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Before indicating ternary candidate word
The probability that two words occur in corpus simultaneously, si+2Indicate what the last character of ternary candidate word occurred in corpus
Probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the third of ternary candidate word
The number that word occurs in corpus, Ni+1,i+2Indicate the number that latter two word of ternary candidate word occurs in corpus,
Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in corpus
The number occurred in library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4:
Wherein, M4Indicate candidate word coagulates right, SiIndicate the probability that the first character of quaternary candidate word occurs in corpus,
Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate that quaternary is candidate
The probability that first three word of word occurs in corpus simultaneously, Si,i+1, indicate the first two word of quaternary candidate word in corpus
The probability of appearance, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate quaternary candidate word
The number that occurs in corpus of first character, Ni+3Indicate time that the 4th word of quaternary candidate word occurs in corpus
Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3After indicating quaternary candidate word
The number that two words occur in corpus, Ni+1,i+2,i+3, indicate what rear three words of quaternary candidate word occurred in corpus
Number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i+3) table
Show the probability that quaternary candidate word occurs in corpus.
7. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The ternary is waited
Selecting the filter method of word is:For ternary candidate word, if latter two character is present in binary candidate's dictionary, first is judged
Whether word, which constitutes a word with left adjacent word, judges the last character if the first two character is present in binary candidate's dictionary
Whether with right adjacent word constitute a word, determine ternary candidate word whether candidate word;
If
Then (Ai-1AiAi+1) belong to ternary word;
Wherein, (Ai-1AiAi+1) belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate word
Right adjacent word, { A0....Ai....ANIt }, is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1),
It is binary candidate word set.
8. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The quaternary is waited
Selecting the filter method of word is:To may be at the quaternary candidate word of word, firstly, be split, the first two word be a participle segment,
Latter two word is another participle segment, and matches respectively to participle segment with the binary dictionary divided, in matching
Then as pre-selection word be then split in the word of centre two to quaternary word, and with divided binary dictionary progress
Match, the then result as pre-selection word, if two conditions all meet, as participle in matching not:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary candidate
Each word in front and back two that word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } indicate four
First candidate word medium term.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711293044.8A CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711293044.8A CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108845982A true CN108845982A (en) | 2018-11-20 |
CN108845982B CN108845982B (en) | 2021-08-20 |
Family
ID=64211732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711293044.8A Active CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108845982B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110287493A (en) * | 2019-06-28 | 2019-09-27 | 中国科学技术信息研究所 | Risk phrase chunking method, apparatus, electronic equipment and storage medium |
CN110287488A (en) * | 2019-06-18 | 2019-09-27 | 上海晏鼠计算机技术股份有限公司 | A kind of Chinese text segmenting method based on big data and Chinese feature |
CN110334345A (en) * | 2019-06-17 | 2019-10-15 | 首都师范大学 | New word discovery method |
CN110442861A (en) * | 2019-07-08 | 2019-11-12 | 万达信息股份有限公司 | A method of Chinese technical term and new word discovery based on real world statistics |
CN111125329A (en) * | 2019-12-18 | 2020-05-08 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN112711944A (en) * | 2021-01-13 | 2021-04-27 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system and word segmentation device generation method and system |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
CN116541527A (en) * | 2023-07-05 | 2023-08-04 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622341A (en) * | 2012-04-20 | 2012-08-01 | 北京邮电大学 | Domain ontology concept automatic-acquisition method based on Bootstrapping technology |
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
CN105955950A (en) * | 2016-04-29 | 2016-09-21 | 乐视控股(北京)有限公司 | New word discovery method and device |
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
-
2017
- 2017-12-08 CN CN201711293044.8A patent/CN108845982B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622341A (en) * | 2012-04-20 | 2012-08-01 | 北京邮电大学 | Domain ontology concept automatic-acquisition method based on Bootstrapping technology |
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
CN105955950A (en) * | 2016-04-29 | 2016-09-21 | 乐视控股(北京)有限公司 | New word discovery method and device |
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
Non-Patent Citations (1)
Title |
---|
王惠仙 等: "基于改进的正向最大匹配中文分词算法研究", 《贵州大学学报(自然科学版)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110334345A (en) * | 2019-06-17 | 2019-10-15 | 首都师范大学 | New word discovery method |
CN110287488A (en) * | 2019-06-18 | 2019-09-27 | 上海晏鼠计算机技术股份有限公司 | A kind of Chinese text segmenting method based on big data and Chinese feature |
CN110287493A (en) * | 2019-06-28 | 2019-09-27 | 中国科学技术信息研究所 | Risk phrase chunking method, apparatus, electronic equipment and storage medium |
CN110442861B (en) * | 2019-07-08 | 2023-04-07 | 万达信息股份有限公司 | Chinese professional term and new word discovery method based on real world statistics |
CN110442861A (en) * | 2019-07-08 | 2019-11-12 | 万达信息股份有限公司 | A method of Chinese technical term and new word discovery based on real world statistics |
CN111125329A (en) * | 2019-12-18 | 2020-05-08 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN111125329B (en) * | 2019-12-18 | 2023-07-21 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN112711944A (en) * | 2021-01-13 | 2021-04-27 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system and word segmentation device generation method and system |
CN112711944B (en) * | 2021-01-13 | 2023-03-10 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system, and word segmentation device generation method and system |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
CN116541527A (en) * | 2023-07-05 | 2023-08-04 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
CN116541527B (en) * | 2023-07-05 | 2023-09-29 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
Also Published As
Publication number | Publication date |
---|---|
CN108845982B (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108845982A (en) | A kind of Chinese word cutting method of word-based linked character | |
CN104572622B (en) | A kind of screening technique of term | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN108287922B (en) | Text data viewpoint abstract mining method fusing topic attributes and emotional information | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
TWI518528B (en) | Method, apparatus and system for identifying target words | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN108509425A (en) | A kind of Chinese new word discovery method based on novel degree | |
CN104268160A (en) | Evaluation object extraction method based on domain dictionary and semantic roles | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
CN105069080B (en) | A kind of document retrieval method and system | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN109947951A (en) | A kind of automatically updated emotion dictionary construction method for financial text analyzing | |
CN108763348A (en) | A kind of classification improved method of extension short text word feature vector | |
CN102662936A (en) | Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning | |
CN113407842B (en) | Model training method, theme recommendation reason acquisition method and system and electronic equipment | |
CN108491512A (en) | The method of abstracting and device of headline | |
CN107688630A (en) | A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme | |
CN105956158B (en) | The method that network neologisms based on massive micro-blog text and user information automatically extract | |
CN109299248A (en) | A kind of business intelligence collection method based on natural language processing | |
US10970489B2 (en) | System for real-time expression of semantic mind map, and operation method therefor | |
CN116362243A (en) | Text key phrase extraction method, storage medium and device integrating incidence relation among sentences | |
CN109299463B (en) | Emotion score calculation method and related equipment | |
CN114048310A (en) | Dynamic intelligence event timeline extraction method based on LDA theme AP clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |