CN108845982A - A kind of Chinese word cutting method of word-based linked character - Google Patents

A kind of Chinese word cutting method of word-based linked character Download PDF

Info

Publication number
CN108845982A
CN108845982A CN201711293044.8A CN201711293044A CN108845982A CN 108845982 A CN108845982 A CN 108845982A CN 201711293044 A CN201711293044 A CN 201711293044A CN 108845982 A CN108845982 A CN 108845982A
Authority
CN
China
Prior art keywords
word
candidate
corpus
candidate word
indicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711293044.8A
Other languages
Chinese (zh)
Other versions
CN108845982B (en
Inventor
龙华
李康康
邵玉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711293044.8A priority Critical patent/CN108845982B/en
Publication of CN108845982A publication Critical patent/CN108845982A/en
Application granted granted Critical
Publication of CN108845982B publication Critical patent/CN108845982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.The present invention selects text to be treated from text library, and pre-processes to text library, including removes symbol and form it into sentence, using removing the sentence builder corpus after symbol.Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting forms binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets the reservation of this judgement, forms new corpus.

Description

A kind of Chinese word cutting method of word-based linked character
Technical field
The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.
Background technique
Chinese words segmentation belongs to natural language processing technique scope, in short, people can be by oneself knowledge Understand which is word, which is not word, but how computer to be allowed to will also understand that?Its treatment process is exactly segmentation methods.
Existing segmentation methods can be divided into three categories:Segmenting method based on understanding, the participle side based on string matching Method and traditional segmenting method based on statistics.
Segmenting method based on understanding is to achieve the effect that identify word by allowing the understanding of computer mould personification distich. Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information Phenomenon.It generally includes three parts:Segment subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.Due to Chinese language General, the complexity of knowledge, it is difficult to various language messages are organized into the form that machine can be directly read, therefore currently based on reason The Words partition system of solution is also in experimental stage.
Segmenting method based on string matching, is called and does mechanical segmentation method, it is will be wait divide according to certain strategy The Chinese character string of analysis is matched with the entry in " sufficiently big " machine dictionary, if finding some character string in dictionary, Successful match (identifies a word).According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and inverse To matching;The case where according to different length priority match, can be divided into maximum (longest) matching and minimum (most short) matching;According to Whether combined with part-of-speech tagging process, and the integration side that simple segmenting method and participle are combined with mark can be divided into Method.Common several mechanical segmentation methods have:(1) Forward Maximum Method method (by left-to-right direction);(2) reverse maximum matching Method (by right to left direction);(3) minimum cutting (keeping the word number cut out in each sentence minimum).It can also be by these three machinery point Word method is combined with each other, for example, it is two-way Forward Maximum Method method and reverse maximum matching process can be combined composition Matching method.The characteristics of due to Chinese word at word, positive smallest match and reverse smallest match are generally rarely employed.Generally Come, reverse matched cutting precision is slightly above positive matching, and the Ambiguity encountered is also less.The Words partition system of actual use, It is all also to need using mechanical Chinese word segmentation as a kind of just departure section by further increasing cutting using various other language messages Accuracy rate.A kind of improvement method is to improve scanning mode, referred to as mark scanning or mark cutting, preferentially in character string to be analyzed It is middle to identify and be syncopated as some words with obvious characteristic, using these words as breakpoint, former character string can be divided into lesser string Come again into mechanical Chinese word segmentation, to reduce matched error rate.Another improvement method is will to segment to combine with part-of-speech tagging, Help is provided to participle decision using grammatical category information abundant, and word segmentation result is examined in turn again in annotation process It tests, adjust, to greatly improve the accuracy rate of cutting.
The above-mentioned segmenting method based on string matching, that is, in mechanical segmentation method, either Forward Maximum Method Method, reverse maximum matching method or minimum cutting, the purpose of these maximum matching methods are attempt in every point of word all as far as possible Make itself and the word matching length longest in dictionary.The advantages of maximum matching method is that principle is simple, it is easy to accomplish, the disadvantage is that maximum Be not easy to determine with length, time complexity rises if too big, too it is small some be more than that the word of the length can not match, reduce The accuracy rate of participle.The evaluation principle of maximum matching method is " priority of long word ".No matter however existing maximum matching method is forward gone back It is inversely, to increase word or subtract word, is all to carry out maximum matching in subrange, i.e., the matched range of maximum is all i at first every time Or last i character, do not fully demonstrate the principle of " priority of long word " in this way.
The principle of segmenting method based on conventional statistics is formally to see, word is stable combinatorics on words, therefore upper Hereinafter, the number that adjacent word occurs simultaneously is more, is more possible to constitute a word.Therefore the frequency of word co-occurrence adjacent with word Rate or probability can preferably reflect into the confidence level of word.Can frequency to each combinatorics on words of co-occurrence adjacent in corpus into Row statistics, calculates their information that appears alternatively.The information that appears alternatively for defining two words calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y. The information that appears alternatively embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, can think This word group may constitute a word.This method need to only count the word group frequency in corpus, not need cutting dictionary, It is thus called to do no dictionary cutting word method or count and takes word method.But this method also has certain limitation, can often extract one out A little co-occurrence frequency are high but are not the commonly used word group of word, for example, " this ", " one of ", " having ", " I ", " many " etc., And poor to the accuracy of identification of everyday words, space-time expense is big.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of Chinese word cutting methods of word-based linked character, to solve The defect of word certainly can not be effectively identified and extracted in the prior art from large-scale corpus, realize computer system in extensive language It is effectively identified in material and extracts word.
The technical scheme is that:A kind of Chinese word cutting method of word-based linked character:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and made it Sentence is formed, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, it is spelled using word before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting Method is connect, binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary are formed;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word, Quaternary candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the time of this thresholding It selects word to retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidates One unified freedom degree of word and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, and sentences if being unsatisfactory for this Candidate word certainly is then deleted;
G, using participle filter method, further mistake is being carried out to the ternary candidate word and quaternary candidate word screened Filter, forms new dictionary.
The front and back splicing word method, which refers to, to be carried out a Chinese continuously to cut participle since first character, by it It is all to be cut at word word, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA word being expressed as in text Symbol, n ∈ N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result Binary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result Ternary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result Quaternary text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The freedom degree refers to:When a text fragments appear in a variety of different text sets, and there is left adjacent word collection It closes and right adjacent word set, left neighbour's word set refers to that the set for appearing in the adjacent character in the text fragments left side, right neighbour's word set are The set for pointing out character adjacent on the right of present text fragments, the comentropy by calculating left adjacent word set and right adjacent word set obtain The comentropy for taking a text fragments takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
In the text fragments set that the freedom degree is, when a text fragments can appear in a variety of different texts This concentration, and there is left adjacent word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtains Take the comentropy H an of text fragments, that is, H=min { s ', s " } H table Show the freedom degree of candidate word, the right entropy of S ' expression candidate word, s " is the left entropy of candidate word, takes left adjacent word set and right adjacent word set In smaller comentropy as freedom degree.
It is described coagulate it is right refer in a text, probability that a neologisms individually occur is higher than the probability of a combination thereof word Product, i.e. P (AB) > P (A) P (B) is enabledIt is right to coagulate to take the smallest M, and wherein AB indicates a neologisms, P (AB) indicate that the probability that neologisms occur in the text, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent combination The probability that word occurs in the text.
Coagulating right for the statistics candidate word is by calculating the independent probability of candidate word and the ratio of joint probability in corpus Value obtains, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2
Wherein, M2Indicate candidate word coagulates right, siIt is general to indicate that the first character of two-spot candidate word occurs in corpus Rate, NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in language The number occurred in material, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, si+1Table Show that the probability that second word of two-spot candidate word occurs in corpus, p (i, i+1) indicate that two-spot candidate word occurs in corpus Probability;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3
Wherein, M3Indicate candidate word coagulates right, SiIt is general to indicate that the first character of ternary candidate word occurs in corpus Rate, si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Indicate ternary candidate word The first two word probability for occurring in corpus simultaneously, si+2Indicate that the last character of ternary candidate word goes out in corpus Existing probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the of ternary candidate word The number that three words occur in corpus, Ni+1,i+2Indicate time that latter two word of ternary candidate word occurs in corpus Number, Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in language The number occurred in material library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4
Wherein, M4Indicate candidate word coagulates right, SiIt is general to indicate that the first character of quaternary candidate word occurs in corpus Rate, Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate quaternary The probability that first three word of candidate word occurs in corpus simultaneously, Si,i+1,Indicate the first two word of quaternary candidate word in corpus The probability occurred in library, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate that quaternary is waited The number for selecting the first character of word to occur in corpus, Ni+3Indicate that the 4th word of quaternary candidate word occurs in corpus Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3Indicate quaternary candidate word The number that occurs in corpus of latter two word, Ni+1,i+2,i+3, indicate that rear three words of quaternary candidate word go out in corpus Existing number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i + 3) probability that quaternary candidate word occurs in corpus is indicated.
The filter method of the ternary candidate word is:For ternary candidate word, if latter two character is to be present in binary time It selects in dictionary, judges whether first character with left adjacent word constitutes a word, if the first two character is to be present in binary candidate word In library, judge whether the last character with right adjacent word constitutes a word, determine ternary candidate word whether candidate word;
If Then(Ai-1AiAi+1)Belong to ternary word;
Wherein,(Ai-1AiAi+1)Belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate The right adjacent word of word, { A0....Ai....ANIt }, is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2, Ai-1), it is binary candidate word set.
The filter method of the quaternary candidate word is:To may be at the quaternary candidate word of word, firstly, being split, preceding two A word is a participle segment, latter two word is another participle segment, and respectively to participle segment and the binary word divided Library is matched, in matching then as pre-selection word being then split in the word of centre two to quaternary word, and with divided Good binary dictionary is matched, the then knot as pre-selection word, if two conditions all meet, as participle in matching not Fruit:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary Each word in front and back two that candidate word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } table Show quaternary candidate word medium term.
The beneficial effects of the invention are as follows:The correctness and validity comparison of method provided by the present invention are high, and system can be with Efficiently segmented, what the present invention was designed into, right, freedom degree is coagulated, ternary and quaternary segmenting method can be very good to solve Traditional segmenting method institute problem based on statistical model.
Detailed description of the invention
Fig. 1 is flow diagram of the invention;
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figure 1, a kind of Chinese word cutting method of word-based linked character:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and made it Sentence is formed, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, it is spelled using word before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting Method is connect, binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary are formed;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word, Quaternary candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the time of this thresholding It selects word to retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidates One unified freedom degree of word and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, and sentences if being unsatisfactory for this Candidate word certainly is then deleted;
G, using participle filter method, further mistake is being carried out to the ternary candidate word and quaternary candidate word screened Filter, forms new dictionary.
The front and back splicing word method, which refers to, to be carried out a Chinese continuously to cut participle since first character, by it It is all to be cut at word word, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA word being expressed as in text Symbol, n ∈ N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result Binary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result Ternary text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result Quaternary text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The freedom degree refers to:When a text fragments appear in a variety of different text sets, and there is left adjacent word collection It closes and right adjacent word set, left neighbour's word set refers to that the set for appearing in the adjacent character in the text fragments left side, right neighbour's word set are The set for pointing out character adjacent on the right of present text fragments, the comentropy by calculating left adjacent word set and right adjacent word set obtain The comentropy for taking a text fragments takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
In obtained text fragments set, the left adjacent word set of text fragments refers to:Appear in text fragments left side phase The set of adjacent character, such as text fragments (ai+1ai+2) in text collection { ai,ai+1,ai+2,ai+3,ai+4,ai+ 5.......ai-1+n,ai+nIn left adjacent word collection be combined into { ai, the right adjacent word set of text fragments refers to:Appear in the text fragments right side The set of the adjacent character in side, such as text fragments (ai+1ai+2) in text collection { ai,ai+1,ai+2,ai+3,ai+4, ai+5.......ai-1+n,ai+nIn right adjacent word collection be combined into { ai+3}。
In the text fragments set that the freedom degree is, when a text fragments can appear in a variety of different texts This concentration, and there is left adjacent word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtains Take the comentropy H an of text fragments, that is, H=min { s ', s " } H table Show the freedom degree of candidate word, the right entropy of S ' expression candidate word, s " is the left entropy of candidate word, takes left adjacent word set and right adjacent word set In smaller comentropy as freedom degree.
The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word Freedom degree.
H=min { s ', s " }
Wherein, H indicates the freedom degree of candidate word, and s' indicates the right entropy of candidate word;
Wherein, biBelong to the right adjacent word collection of candidate word, nbi indicates biThe frequency on the right of candidate word is appeared in, k indicates candidate The right adjacent word of word concentrates Character table number, and s " is the left entropy of candidate word;
Wherein, mi belongs to the left adjacent word collection of candidate word, nmiIndicate that mi appears in the frequency on the candidate word left side, M indicates candidate The left adjacent word of word concentrates Character table number.
It is described coagulate it is right refer in a text, probability that a neologisms individually occur is higher than the probability of a combination thereof word Product, i.e. P (AB) > P (A) P (B) is enabledIt is right to coagulate to take the smallest M, and wherein AB indicates a neologisms, P (AB) indicate that the probability that neologisms occur in the text, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent combination The probability that word occurs in the text.
Coagulating right for the statistics candidate word is by calculating the independent probability of candidate word and the ratio of joint probability in corpus Value obtains, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2
Wherein, M2Indicate candidate word coagulates right, siIt is general to indicate that the first character of two-spot candidate word occurs in corpus Rate, NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in language The number occurred in material, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, Si+1Table Show that the probability that second word of two-spot candidate word occurs in corpus, p (i, i+1) indicate that two-spot candidate word occurs in corpus Probability;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3
Wherein, M3Indicate candidate word coagulates right, SiIt is general to indicate that the first character of ternary candidate word occurs in corpus Rate, si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Indicate ternary candidate word The first two word probability for occurring in corpus simultaneously, si+2Indicate that the last character of ternary candidate word goes out in corpus Existing probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the of ternary candidate word The number that three words occur in corpus, Ni+1,i+2Indicate time that latter two word of ternary candidate word occurs in corpus Number, Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in language The number occurred in material library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4
Wherein, M4Indicate candidate word coagulates right, SiIt is general to indicate that the first character of quaternary candidate word occurs in corpus Rate, Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate quaternary The probability that first three word of candidate word occurs in corpus simultaneously, Si,i+1,Indicate the first two word of quaternary candidate word in corpus The probability occurred in library, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate that quaternary is waited The number for selecting the first character of word to occur in corpus, Ni+3Indicate that the 4th word of quaternary candidate word occurs in corpus Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3Indicate quaternary candidate word The number that occurs in corpus of latter two word, Ni+1,i+2,i+3, indicate that rear three words of quaternary candidate word go out in corpus Existing number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i + 3) probability that quaternary candidate word occurs in corpus is indicated.
The filter method of the ternary candidate word is:For ternary candidate word, if latter two character is to be present in binary time It selects in dictionary, judges whether first character with left adjacent word constitutes a word, if the first two character is to be present in binary candidate word In library, judge whether the last character with right adjacent word constitutes a word, determine ternary candidate word whether candidate word;
If Then (Ai-1AiAi+1) belong to ternary word;
Wherein,(Ai-1AiAi+1)Belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate The right adjacent word of word, { A0....Ai....AN., it is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2, Ai-1), it is binary candidate word set.
The filter method of the quaternary candidate word is:To may be at the quaternary candidate word of word, firstly, being split, preceding two A word is a participle segment, latter two word is another participle segment, and respectively to participle segment and the binary word divided Library is matched, in matching then as pre-selection word being then split in the word of centre two to quaternary word, and with divided Good binary dictionary is matched, the then knot as pre-selection word, if two conditions all meet, as participle in matching not Fruit:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary Each word in front and back two that candidate word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } table Show quaternary candidate word medium term.
Embodiment 2:As shown in Figure 1, selecting text to be treated from text library, and text library is pre-processed, Including removing symbol and forming it into sentence, using removing the sentence builder corpus after symbol.
Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.
Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting is formed One binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.
One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets this judgement Retain, forms new corpus.
Statistics candidate word coagulates right and candidate word freedom degree;In the present embodiment, coagulating for the statistics candidate word is right It can be obtained by calculating the independent probability and joint probability ratio of candidate word in corpus;The freedom degree of the candidate word can lead to It crosses and calculates and select candidate word or so adjacent word and integrate comentropy reckling as the freedom degree of candidate word.
Coagulating for the candidate word right is compared with the freedom degree of candidate word with the threshold value of setting.
The candidate word for being greater than threshold value is extracted, as candidate dictionary.
In the present embodiment, novel in four great classical masterpieces is had collected《Journey to the West》.In the dictionary counted, if at the text of word The distribution of this segment is sufficient, then can be higher relative to its solidifying conjunction degree not at the segment of word, and freedom degree is bigger.If by word The adjacent word in left and right regards stochastic variable as, then the comentropy of the adjacent word collection in the left and right of a word just reflect the adjacent word of this word or so with Machine, the bigger left adjacent word set for illustrating the word of entropy or right adjacent word set are abundanter, and it is lesser that we take the adjacent word in left and right to concentrate Entropy is as freedom degree.
In the present embodiment, for the segment at word, its solidifying conjunction degree can be higher, degree of relationship between declarer and word It is just higher, when calculating coagulates right, we take coagulate it is right it is lesser as it is final coagulate it is right.
We segment filtering side passing through ternary using passing through freedom degree and coagulating the right dictionary counted as candidate dictionary Method and quaternary segment filter method, handle ternary candidate dictionary and quaternary candidate's dictionary, finally obtained to be used as dictionary.
Ternary participle filter method and quaternary participle filter method solve and are words and are actually not from subjective seem Word problem, to improve the validity of ternary dictionary and quaternary dictionary.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (8)

1. a kind of Chinese word cutting method of word-based linked character, it is characterised in that:
A, text to be treated is selected from text library, and text library is pre-processed, including is removed symbol and formed it into Sentence, using removing the sentence builder corpus after symbol;
B, using the segmenting method of front and back splicing word, the corpus in step 1 is segmented, forms segmentation fragment;
C, using word splicing side before and after word joining method, quaternary cutting before and after word joining method, ternary cutting before and after binary cutting Method forms binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary;
D, to the binary candidate word in binary candidate dictionary, ternary candidate dictionary and quaternary candidate's dictionary, ternary candidate word, quaternary Candidate word carries out word frequency statistics;
E, a word frequency thresholding is set to the candidate word for having counted word frequency, and it is made decisions, meet the candidate word of this thresholding Retain, forms new corpus, deleted if the candidate word for being unsatisfactory for this thresholding;
F, it calculates step 5 treated the freedom degree of the candidate word in corpus and coagulate right, and give all candidate words one A unified freedom degree and right thresholding is coagulated, and made decisions, the candidate word for meeting judgement retains, if being unsatisfactory for this judgement Candidate word is then deleted;
G, it using participle filter method, is further filtered to the ternary candidate word and quaternary candidate word that screen, Form new dictionary.
2. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:It spells the front and back It connects word method and refers to and a Chinese is carried out continuously to cut participle since first character, its all is cut at word word Come, specially:
The content of text for being included for a Chinese text is assumed to be:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+n, wherein aiA character being expressed as in text, n ∈ N;
Binary cutting splicing is carried out to text collection using word joining method before and after binary cutting, obtains processing result binary Text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Ternary cutting splicing is carried out to text collection using word joining method before and after ternary cutting, obtains processing result ternary Text fragments set is:{(aiai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Quaternary cutting splicing is carried out to text collection using word joining method before and after quaternary cutting, obtains processing result quaternary Text fragments set is:{(aiai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+nai-2+ nai-1+nai+n)}。
3. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The freedom degree Refer to:When a text fragments appear in a variety of different text sets, and there is left adjacent word set and right adjacent word set, left neighbour Word set refers to the set for appearing in the adjacent character in the text fragments left side, and right neighbour's word set, which refers to, to be appeared on the right of text fragments The set of adjacent character, the comentropy by calculating left adjacent word set and right adjacent word set obtain the information of a text fragments Entropy takes in left adjacent word set and right adjacent word set smaller comentropy as freedom degree.
4. the Chinese word cutting method of word-based linked character according to claim 3, it is characterised in that:The freedom degree For in obtained text fragments set, when a text fragments can appear in a variety of different text sets, and there is left neighbour Word set and right adjacent word set, the comentropy by calculating left adjacent word set and right adjacent word set obtain the letter of a text fragments Cease entropy H, that is, H=min { s', s " },H indicates the freedom degree of candidate word, S' indicates that the right entropy of candidate word, s " are the left entropy of candidate word, takes smaller comentropy conduct in left adjacent word set and right adjacent word set Freedom degree.
5. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:It is described coagulate it is right Refer in a text, the probability that a neologisms individually occur is higher than the product of the probability of a combination thereof word, i.e. P (AB)>P(A)P (B), it enablesIt is right to coagulate to take the smallest M, and wherein AB indicates that a neologisms, P (AB) indicate neologisms in text The probability of middle appearance, A and B respectively refer to one portmanteau word of generation, P (A) and P (B) respectively represent portmanteau word occur in the text it is general Rate.
6. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The statistics is waited Selecting coagulating right for word is obtained by calculating the independent probability of candidate word and the ratio of joint probability in corpus, the specific steps are:
(1) two-spot candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M2
Wherein, M2Indicate candidate word coagulates right, siIndicate the probability that the first character of two-spot candidate word occurs in corpus, NiIndicate the number that two-spot candidate word first character occurs in corpus, Ni+1Indicate second word of two-spot candidate word in corpus The number of appearance, Ni,i+1It indicates the number that two-spot candidate word occurs in corpus, indicates total number of word in corpus, si+1Indicate two The probability that second word of first candidate word occurs in corpus, it is general that p (i, i+1) indicates that two-spot candidate word occurs in corpus Rate;
(2) ternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M3
Wherein, M3Indicate candidate word coagulates right, SiIndicate the probability that the first character of ternary candidate word occurs in corpus, si+1,i+2Indicate the probability that latter two word of ternary candidate word occurs in corpus simultaneously, si,i+1Before indicating ternary candidate word The probability that two words occur in corpus simultaneously, si+2Indicate what the last character of ternary candidate word occurred in corpus Probability, NiIndicate the number that the first character of ternary candidate word occurs in corpus, Ni+2Indicate the third of ternary candidate word The number that word occurs in corpus, Ni+1,i+2Indicate the number that latter two word of ternary candidate word occurs in corpus, Ni,i+1Indicate the number that the first two word of ternary candidate word occurs in corpus, Ni,i+1,i+2Indicate ternary candidate word in corpus The number occurred in library, P(i,i+1,i+2)Indicate the probability that ternary candidate word occurs in corpus;
(3) quaternary candidate word is obtained by the ratio of the probability of candidate word and the combined probability of candidate word and coagulates right M4
Wherein, M4Indicate candidate word coagulates right, SiIndicate the probability that the first character of quaternary candidate word occurs in corpus, Si+1,i+2,i+3Indicate the probability that rear three words of quaternary candidate word occur in corpus simultaneously, Si,i+1,i+2Indicate that quaternary is candidate The probability that first three word of word occurs in corpus simultaneously, Si,i+1, indicate the first two word of quaternary candidate word in corpus The probability of appearance, Si+2,i+3Indicate the probability that latter two word of quaternary candidate word occurs in corpus, NiIndicate quaternary candidate word The number that occurs in corpus of first character, Ni+3Indicate time that the 4th word of quaternary candidate word occurs in corpus Number, NI, i+1Indicate the number that the first two word of quaternary candidate word occurs in corpus, NI+2, i+3After indicating quaternary candidate word The number that two words occur in corpus, Ni+1,i+2,i+3, indicate what rear three words of quaternary candidate word occurred in corpus Number, Ni+1,i+2,i+3, indicate the number that first three word of quaternary candidate word occurs in corpus, P (i, i+1, i+2, i+3) table Show the probability that quaternary candidate word occurs in corpus.
7. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The ternary is waited Selecting the filter method of word is:For ternary candidate word, if latter two character is present in binary candidate's dictionary, first is judged Whether word, which constitutes a word with left adjacent word, judges the last character if the first two character is present in binary candidate's dictionary Whether with right adjacent word constitute a word, determine ternary candidate word whether candidate word;
If Then (Ai-1AiAi+1) belong to ternary word;
Wherein, (Ai-1AiAi+1) belong to ternary candidate word, Ai-2It is the left adjacent word collection of ternary candidate word, Ai+2It is ternary candidate word Right adjacent word, { A0....Ai....ANIt }, is character set and { (A in corpus0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1), It is binary candidate word set.
8. the Chinese word cutting method of word-based linked character according to claim 1, it is characterised in that:The quaternary is waited Selecting the filter method of word is:To may be at the quaternary candidate word of word, firstly, be split, the first two word be a participle segment, Latter two word is another participle segment, and matches respectively to participle segment with the binary dictionary divided, in matching Then as pre-selection word be then split in the word of centre two to quaternary word, and with divided binary dictionary progress Match, the then result as pre-selection word, if two conditions all meet, as participle in matching not:
And
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2Expression quaternary candidate word, { (Ai-2,Ai-1),(Ai,Ai+1) indicate by quaternary candidate Each word in front and back two that word branches away, { (A0A1)...(AiAi+1)...(ANAN+1) indicate binary dictionary, { (Ai-1, Ai) } indicate four First candidate word medium term.
CN201711293044.8A 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics Active CN108845982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711293044.8A CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711293044.8A CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Publications (2)

Publication Number Publication Date
CN108845982A true CN108845982A (en) 2018-11-20
CN108845982B CN108845982B (en) 2021-08-20

Family

ID=64211732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711293044.8A Active CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Country Status (1)

Country Link
CN (1) CN108845982B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110287488A (en) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 A kind of Chinese text segmenting method based on big data and Chinese feature
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110442861A (en) * 2019-07-08 2019-11-12 万达信息股份有限公司 A method of Chinese technical term and new word discovery based on real world statistics
CN111125329A (en) * 2019-12-18 2020-05-08 东软集团股份有限公司 Text information screening method, device and equipment
CN112711944A (en) * 2021-01-13 2021-04-27 深圳前瞻资讯股份有限公司 Word segmentation method and system and word segmentation device generation method and system
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium
CN116541527A (en) * 2023-07-05 2023-08-04 国网北京市电力公司 Document classification method based on model integration and data expansion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王惠仙 等: "基于改进的正向最大匹配中文分词算法研究", 《贵州大学学报(自然科学版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110287488A (en) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 A kind of Chinese text segmenting method based on big data and Chinese feature
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110442861B (en) * 2019-07-08 2023-04-07 万达信息股份有限公司 Chinese professional term and new word discovery method based on real world statistics
CN110442861A (en) * 2019-07-08 2019-11-12 万达信息股份有限公司 A method of Chinese technical term and new word discovery based on real world statistics
CN111125329A (en) * 2019-12-18 2020-05-08 东软集团股份有限公司 Text information screening method, device and equipment
CN111125329B (en) * 2019-12-18 2023-07-21 东软集团股份有限公司 Text information screening method, device and equipment
CN112711944A (en) * 2021-01-13 2021-04-27 深圳前瞻资讯股份有限公司 Word segmentation method and system and word segmentation device generation method and system
CN112711944B (en) * 2021-01-13 2023-03-10 深圳前瞻资讯股份有限公司 Word segmentation method and system, and word segmentation device generation method and system
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium
CN116541527A (en) * 2023-07-05 2023-08-04 国网北京市电力公司 Document classification method based on model integration and data expansion
CN116541527B (en) * 2023-07-05 2023-09-29 国网北京市电力公司 Document classification method based on model integration and data expansion

Also Published As

Publication number Publication date
CN108845982B (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN108845982A (en) A kind of Chinese word cutting method of word-based linked character
CN104572622B (en) A kind of screening technique of term
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN108287922B (en) Text data viewpoint abstract mining method fusing topic attributes and emotional information
CN105786991B (en) In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
TWI518528B (en) Method, apparatus and system for identifying target words
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN108509425A (en) A kind of Chinese new word discovery method based on novel degree
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN105069080B (en) A kind of document retrieval method and system
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN109947951A (en) A kind of automatically updated emotion dictionary construction method for financial text analyzing
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN102662936A (en) Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
CN113407842B (en) Model training method, theme recommendation reason acquisition method and system and electronic equipment
CN108491512A (en) The method of abstracting and device of headline
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract
CN109299248A (en) A kind of business intelligence collection method based on natural language processing
US10970489B2 (en) System for real-time expression of semantic mind map, and operation method therefor
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN109299463B (en) Emotion score calculation method and related equipment
CN114048310A (en) Dynamic intelligence event timeline extraction method based on LDA theme AP clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant