CN106126495B

CN106126495B - One kind being based on large-scale corpus prompter method and apparatus

Info

Publication number: CN106126495B
Application number: CN201610429967.0A
Authority: CN
Inventors: 曹骥; 王富田; 李健; 张连毅; 武卫东
Original assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Current assignee: BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2019-03-12
Anticipated expiration: 2036-06-16
Also published as: CN106126495A

Abstract

The present invention provides a kind of prompter method and device based on large-scale corpus, comprising steps of counting candidate word from the corpus of collection coagulates right and candidate word freedom degree；The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score；Extract the candidate word at word score more than preset threshold value.In the case where Chinese is defined and divided without standard words, realizes computer system and effectively identified in large-scale corpus and extract word.

Description

One kind being based on large-scale corpus prompter method and apparatus

Technical field

The present invention relates to language analysis fields, are based on large-scale corpus prompter method and apparatus more particularly to one kind.

Background technique

In the natural language processing of Chinese material, it is often necessary to extract word from corpus.But in Chinese text processing Field, ambiguous always to the definition of word, which type of individual character or single combinatorics on words can be used as a word, not have still at present There is a generally acknowledged, authoritative standard.Chinese prompter needs in the case where no standard dictionary, filters out from corpus and most may be used It can be mainly used in from corpus at the text fragments of word and find word.In the case where Chinese is defined and divided without standard words, It what standard defined terms to be the key that the extraction word from corpus with.

Computer is that computer system is allowed how from Chinese language text corpus to find word for the key for handling Chinese prompter, And they are extracted.The word of Chinese is the symbol of record instruction, and what word was made of morpheme, it can independently use The smallest linguistic unit.But the text of this isolating language of Chinese, there is no the explicit mark in any space etc between word and word The boundary of deictic words.Therefore, Chinese prompter problem becomes the important process faced when computer disposal isolating language.

Therefore, how to construct it is a kind of based on large-scale corpus prompter method and apparatus become technical problem urgently to be resolved.

Summary of the invention

The embodiment of the present invention provide it is a kind of based on large-scale corpus prompter method and apparatus, to solve in the prior art without Method effectively identifies from large-scale corpus and extracts the defect of word, realizes that computer system effectively identifies simultaneously in large-scale corpus Extract word.

To solve the above-mentioned problems, the prompter method based on large-scale corpus that the invention discloses a kind of, comprising steps of

Candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree；

The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score；

Extract the candidate word at word score more than preset threshold value.

Method of the present invention, wherein

Coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating.

Method of the present invention, wherein

Coagulating for the candidate word is right

Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,

S'_iIt is the right entropy of i-th of word in candidate word,

Wherein,Indicate Character table " k in the right adjacent character set of i-th of word_j" appear in number on the right side of it, n_iIt indicates The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number；

S″_i+1Indicate the left entropy of i+1 word in candidate word,

Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " m_j" number of appearance to the left, n_i+1Table Show that the number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number；

p_jIndicate frequency of the candidate word relative to j-th of word,

Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.

Method of the present invention, wherein

The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word Freedom degree.

Method of the present invention, wherein

The freedom degree of the candidate word

H=min { S', S " }

Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,

Wherein, b_iBelong to the right adjacent word collection of candidate word,Indicate b_iThe frequency on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number；

S " is the left entropy of candidate word,

Wherein, m_iBelong to the left adjacent word collection of candidate word,Indicate m_iThe frequency on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.

To solve the above-mentioned problems, the invention also discloses a kind of prompter device based on large-scale corpus, including corpus Collector unit, further includes:

Candidate word coagulates right computing unit, right for calculating coagulating for candidate word；

Candidate word freedom calculation unit, for calculating the freedom degree of candidate word；

At word score calculation unit, for by the candidate word coagulate the right product with the freedom degree of candidate word obtain at Word score；

Prompter unit, for extracting the candidate word at word score more than preset threshold value.

Device of the present invention, wherein

The candidate word coagulates right computing unit, be further used for by calculate corpus in candidate word word between comentropy and It is right that word frequency obtains coagulating for candidate word.

Device of the present invention, wherein

The candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right

Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate word_iIt is i-th of word in candidate word Right entropy,

S″_i+1Indicate the left entropy of the word of i+1 in candidate word,

p_jIndicate frequency of the candidate word relative to j-th of word,

Device of the present invention, wherein

The candidate word freedom calculation unit is further used for by calculating and selecting candidate word or so adjacent word collection information The small person of entropy is candidate word freedom degree.

Device of the present invention, wherein

The candidate word freedom calculation unit is further used for being calculated by the following formula the freedom degree of the candidate word

H=min { S', S " }

Wherein, b_iBelong to the right adjacent word collection of candidate word,Indicate b_iThe number on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number；

S " is the left entropy of candidate word,

Wherein, m_iBelong to the left adjacent word collection of candidate word,Indicate m_iThe number on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.

A kind of prompter method and device based on large-scale corpus provided in an embodiment of the present invention, by calculating candidate word It coagulates right, the frequency of comentropy between its internal word and candidate word relative to internal each word is combined when calculating candidate word and coagulating right Rate information；Secondly for calculating the freedom degree of candidate word using the adjacent word collection comentropy in calculating left and right, the i.e. adjacent word collection of candidate word or so The small person of comentropy；Right and freedom degree product is coagulated as it at word score, to large-scale corpus finally for candidate word When carrying out prompter, candidate word is extracted at the candidate word that word score is higher than pre-set threshold value.Realize department of computer science System effectively identification and prompter in large-scale corpus.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of step flow chart of the prompter embodiment of the method based on large-scale corpus of the present invention；

Fig. 2 is a kind of structural block diagram based on large-scale corpus prompter Installation practice of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Embodiment one

Referring to Fig.1, a kind of step flow chart based on large-scale corpus prompter method of the embodiment of the present invention is shown.

The method of the present embodiment the following steps are included:

Step 100: candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree；In the present embodiment, The statistics candidate word coagulate it is right can by calculating comentropy and word frequency obtain between the word of candidate word in corpus；The candidate The freedom degree of word can integrate comentropy reckling by calculating and selecting candidate word or so adjacent word as the freedom degree of candidate word.

Step 200: the right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score；

Step 300: extracting the candidate word at word score more than preset threshold value.

In the present embodiment, the news corpus data of the every field of sohu.com's first half of the month in April, 2016 are had collected.? In the large-scale corpus of collection, if the text fragments at word are distributed abundance, then relative to its internal solidifying conjunction not at the segment of word Degree can be higher, and freedom degree is bigger.If regarding the adjacent word in the left and right of word as stochastic variable, the adjacent word collection in the left and right of a word Comentropy just reflect the randomness of the adjacent word of this word or so, the smaller left or right neighbour's word collection for illustrating the word of entropy is more stable.

The right comentropy of word each in text fragments is multiplied with the left comentropy of its right adjacent words and seeks the maximum, then it can be with Reflect the minimum stability between each word of text fragment internal, it is smaller to illustrate that text segment is more stable.And candidate word and its In the frequency ratio of each word reflect the tightness degree of candidate word Yu each word wherein included on the whole again, it is bigger to illustrate candidate Word and wherein each word relationship are closer.Taking the former inverse to be multiplied with the latter in this way can reflect that the inside of text segment is solidifying It is right.So we use the inverse and the relative frequency of candidate word of comentropy product the maximum between candidate word internal word in the present invention Product as coagulating right T inside it, value show to coagulate inside candidate word more greatly it is right higher, it is also higher at word possibility.

For some length be d candidate word, inside coagulate it is right

Wherein,Indicate Character table " k in the right adjacent character set of i-th of word_i" appear in number on the right side of it, n_iIt indicates The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number.S″_i+1It indicates in candidate word The left entropy of i+1 word,

Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " m_j" number of appearance to the left, n_i+1It indicates The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number.p_jTable Show frequency of the candidate word relative to j-th of word:

In the present embodiment, candidate word freely can reflect the rich of candidate word context environmental with degree.By The definition of comentropy illustrates that the randomness of stochastic variable is stronger it is known that its value is bigger.Here the adjacent word in the left and right of candidate word is taken The comentropy of collection reflects that its freely uses degree, in order to reflect the freedom degree of candidate word on the whole, takes candidate word or so information For the smaller of entropy as its freedom degree H, value is bigger to illustrate that candidate word is freer, higher at word possibility.Freedom degree

H=min { S', S " }

Wherein, b_iBelong to the right adjacent word collection of candidate word,Indicate b_iThe number on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number.S " is the left entropy of candidate word:

A possibility that candidate word is scored at G at word in the present embodiment, and the same score value is bigger, and candidate word is at word is higher, In,

G=T*H

It realizes candidate word of substring of the prompter to all length occurred in text no more than d all as potential word, calculates They set a threshold value at word score at word score, then for candidate word, finally take out the candidate word on all threshold values The word extracted as needs.In concrete operations as, entire corpus can regard to a character string, and to the character string it is all after Sew and sort by lexcographical order, so that identical candidate word be concentrated in together.It scans one time from the beginning to the end and calculates the interior of each candidate word Each word of frequency and inside of right neighbour's word comentropy of each word in portion, right adjacent word comentropy corresponding with candidate word and candidate word Frequency.Now all of suffix will be rearranged after entire corpus backward again, then scan one time to count to obtain in candidate word The left adjacent word comentropy of each word in portion, and left adjacent word comentropy corresponding with candidate word.By each candidate word of calculating at word Score extracts the word on threshold value after descending sort, this completes the prompter algorithms of large-scale corpus.

The present embodiment the method realizes computer system and exists in the case where Chinese is defined and divided without standard words It is effectively identified in large-scale corpus and extracts word.

Embodiment two

Referring to Fig. 2, a kind of structural block diagram based on large-scale corpus prompter device of the embodiment of the present invention is shown.

The device of the present embodiment, including corpus collector unit, further includes:

Candidate word coagulates right computing unit, right for calculating coagulating for candidate word；In the present embodiment, can be used for passing through Calculating in corpus comentropy and word frequency between the word of candidate word, to obtain coagulating for candidate word right.

Candidate word freedom calculation unit, for calculating the freedom degree of candidate word；It, can be by calculating simultaneously in the present embodiment The adjacent word of candidate word or so is selected to integrate comentropy reckling as candidate word freedom degree.

In the present embodiment, the candidate word coagulates right computing unit, can be used for being calculated by the following formula candidate word Coagulate it is right

S'_iIt is the right entropy of i-th of word in candidate word,

Wherein,Indicate Character table " k in the right adjacent character set of i-th of word_i" appear in number on the right side of it, n_iIndicate the The number that i word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number；

S″_i+1Indicate the left entropy of the word of i+1 in candidate word,

Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " m_j" number of appearance to the left, n_i+1It indicates The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates daughter element number；

p_jIndicate frequency of the candidate word relative to j-th of word,

In this embodiment, the candidate word freedom calculation unit, can be used for being calculated by the following formula the candidate The freedom degree of word

H=min { S', S " }

S " is the left entropy of candidate word,

The device based on large-scale corpus prompter of the present embodiment is for realizing prompter side corresponding in previous embodiment one Method, and the beneficial effect with corresponding embodiment of the method, details are not described herein.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of prompter method based on large-scale corpus, it is characterised in that comprising steps of

Extract the candidate word at word score more than preset threshold value；

Wherein, coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating；

Wherein, the candidate word coagulate it is right

S'_iIt is the right entropy of i-th of word in candidate word,

Wherein, n_kjIndicate Character table " k in the right adjacent character set of i-th of word_j" appear in number on the right side of it, n_iIndicate i-th of word The number occurred in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number；

S”_i+1Indicate the left entropy of i+1 word in candidate word,

Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " m_j" number of appearance to the left, n_i+1Indicate candidate The number that i+1 word occurs in corpus in word, M indicate that the left adjacent word of i+1 word concentrates Character table number；

p_jIndicate frequency of the candidate word relative to j-th of word,

Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus, n_jIndicate what j-th of word occurred in corpus Number.

2. according to the method described in claim 1, it is characterized by:

The freedom degree of the candidate word takes the adjacent word in the left and right of candidate word to concentrate comentropy smaller.

3. according to the method described in claim 2, it is characterized by:

The freedom degree of the candidate word

H=min { S', S " }

Wherein, b_iBelong to the right adjacent word collection of candidate word,Indicate b_iThe frequency on the right of candidate word is appeared in, U indicates candidate word Right neighbour's word concentrates Character table number；

S " is the left entropy of candidate word,

Wherein, m_iBelong to the left adjacent word collection of candidate word,Indicate m_iThe frequency on the candidate word left side is appeared in, q indicates candidate word Left neighbour's word concentrates Character table number.

4. a kind of prompter device based on large-scale corpus, including corpus collector unit, it is characterised in that further include:

At word score calculation unit, for the right product acquisition with the freedom degree of candidate word of coagulating of the candidate word to be obtained at word Point；

Prompter unit, for extracting the candidate word at word score more than preset threshold value；

Wherein, the candidate word coagulates right computing unit, is further used for through comentropy between the word of candidate word in calculating corpus It is right that coagulating for candidate word is obtained with word frequency；

Wherein, the candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right

Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate word_iIt is the right entropy of i-th of word in candidate word,

S”_i+1Indicate the left entropy of the word of i+1 in candidate word,

Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " m_j" number of appearance to the left, n_i+1It indicates to wait The number that i+1 word occurs in corpus in word selection, M indicate that the left adjacent word of i+1 word concentrates daughter element number；

p_jIndicate frequency of the candidate word relative to j-th of word,

5. device according to claim 4, it is characterised in that:

The candidate word freedom calculation unit is further used for taking the adjacent word collection comentropy smaller in the left and right of candidate word.

6. device according to claim 5, it is characterised in that:

H=min { S', S " }

Wherein, b_iBelong to the right adjacent word collection of candidate word,Indicate b_iThe number on the right of candidate word is appeared in, U indicates candidate word Right neighbour's word concentrates Character table number；

S " is the left entropy of candidate word,

Wherein, m_jBelong to the left adjacent word collection of candidate word,Indicate m_jThe number on the candidate word left side is appeared in, q indicates candidate word Left neighbour's word concentrates Character table number.