CN105488098B - A kind of new words extraction method based on field otherness - Google Patents
A kind of new words extraction method based on field otherness Download PDFInfo
- Publication number
- CN105488098B CN105488098B CN201510711219.7A CN201510711219A CN105488098B CN 105488098 B CN105488098 B CN 105488098B CN 201510711219 A CN201510711219 A CN 201510711219A CN 105488098 B CN105488098 B CN 105488098B
- Authority
- CN
- China
- Prior art keywords
- word
- field
- candidate
- difference
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of methods of new words extraction based on field otherness, belong to natural language processing applied technical field.The otherness that the present invention is distributed by comparing word between different field first, obtain difference word seed, then difference word seed is expanded by n-gram mode, construct candidate word set, next according to field difference size remove candidate word set in repetitor, finally to each word in candidate word set, respectively with field difference value, coagulate it is right, and it rejects the lower candidate word of field difference as measurement standard at word rate and obtains neologisms.The prior art is compared, the present invention passes through using different information between different corpus fields, selected seed word, and expands by n-gram and obtain candidate word set;Then the neologisms in candidate word are automatically selected, to significantly improve the number and accuracy of new word discovery using different information between word itself and field again.
Description
Technical field
The present invention relates to a kind of method of new words extraction, in particular to the side of a kind of new words extraction based on field otherness
Method belongs to natural language processing applied technical field.
Background technique
Network neologisms, which refer to, there are some special languages of cocurrent enforcement or text along with internet.It is typically derived from
Video display network hot topic term, or some words acceptable to all generated by a certain social phenomenon.Network neologisms are in net
Network field text, such as: being frequently occurred in discussion bar, microblogging.Statistics discovery, China appear in people's per year over 1000 neologisms
In daily life.According to related research result, the participle mistake more than 60% comes automatic network neologisms, the order of accuarcy of new word identification
Directly affect the performance of intelligent information handling system.Such as: it is fixed in the text emotion analysis task of Intelligent Information Processing
Phrase collocation can embody feeling polarities, for neologisms phrase, if can not to its it is correct identify, will lead to and judged
Feeling polarities distortion.Such as: " expression very tall and big on " (this is the net exploxer comment of a product) actually should " on tall and big " here
As a network neologisms, integrally the positive emotion of " high-end and atmospheric to improve grade " is indicated, however almost all of application at present
In system, the annotated sequence that is formed after word segmentation processing is " expression/v ten divides/adv high/adj is big/adj is upper/adv ", it may be assumed that by the net
Network neologisms are cut into individual character, and the word segmentation processing of mistake makes this be lost the meaning that positive emotion is inclined to, to the intelligence of follow-up
It can analyze to produce and seriously affect.Therefore there is very important meaning in natural language processing field to effective identification of neologisms
Justice.
Currently, new words extraction is broadly divided into two class of rule-based method and Statistics-Based Method.Rule-based approach
Main thought be: be conceived to the word-building principle of neologisms, as theoretical foundation and establish one and help to identify neologisms
Common corpus;Then itself characteristic of speech sounds for studying word builds a special structure based on the natural quality of word
Word rule base.Rule-based method is higher to the recognition accuracy of neologisms, but needs extremely strong language attainment and related fields
Knowledge background.Statistics-Based Method realizes new word identification, and there are mainly two types of means, and one is must using new words extraction as participle
Indispensable a part is finally inferred to most possible separation by certain statistical model and obtains neologisms.Classical
Statistical model is had ready conditions under random field (Conditional Random Fields, CRF), the gradient based on characteristic frequency information
Training pattern etc. drops.Another means are using new words extraction as an individual task, it usually needs do part-of-speech tagging
The pretreatment of (Part-Of-Speech, POS).Since network neologisms have a real-time, the features such as circulation is strong, dynamic change,
Therefore pure rule-based method is often ineffective;And using statistical means acquisition network neologisms, there is also training completely
Sparse, validity feature extract the deficiencies of difficult.The method that major part researcher is combined using rule and statistics at present, with
Phase plays respective advantage, however these methods all have ignored the information characteristics advantage of corpus itself, it may be assumed that same words are in difference
Information (intension) difference between the theme of field is embodied as the corresponding word distribution performance of same words under different field theme not
Together.
Summary of the invention
The present invention proposes a kind of new words extraction based on field otherness for the neologisms for constantly generating and using in network
Method, this method make full use of the characteristic of different field corpus itself, under existing general appraisement system, effectively increase neologisms
The accuracy rate of identification.
Idea of the invention is that obtaining difference word seed by comparing the otherness that word between different field is distributed, passing through n-
Gram mode expands difference word, constructs candidate word set, then to each word in candidate word set, respectively with field difference value,
It coagulates right, and at word rate as measurement standard, further extracts and obtain neologisms.
Related definition involved in the present invention is as follows:
Define 1: field difference word refers to the individual character of embodiment field otherness, which can reflect domain features,
The frequency of occurrences has very big difference in different field corpus.Such as, if individual character c frequency of occurrences f in network corpusinternet(c) with
Frequency of occurrences f in News FieldnewsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference word.It is existing at the language of word for individual character
As if it can symbolize otherness.The present invention also assert its difference performance with word distribution.
Definition 2: repetitor, as word WAWith word WBMeet conditionClaim WBAnd WARepetitor each other.Such as: " happiness is big
It is general to run quickly " (WA) and " general greatly to run quickly " (WB)。
Define 3: field difference value DV (Difference Value), the measurement of field otherness, using word W in network language
Expect frequency of occurrences finternet(W) with news corpus frequency of occurrences fnews(W) it is calculated;Wherein finternet(W) indicate word W in net
The frequency of occurrences in network corpus, fnews(W) word W frequency of occurrences in news corpus is indicated.
It defines 4: coagulating right CV (Concrete Value), measure word by the quantizating index of correct cutting.Such as " cinema "
There are " film "+" institute " and the solidifying conjunction mode of two kinds of " electricity "+" movie theatre ".To any word W=c1c2(wherein, c1Or c2It indicates to constitute the word
Word or word), by enumerating its all possible solidifying conjunction mode, calculate corresponding weight, take wherein minimum value, it is solidifying as the word
It is right.
It defines 5: at word rate NWP (New Word Probability), judging whether certain individual character sequence forms the finger of word
Mark.Such as: " liking to say ", " love is eaten " are made of individual character, but NWP is very low, that is, indicate that the two does not constitute word.
Purpose of the invention is through the following steps that realize:
A kind of new words extraction method based on field otherness, comprising the following steps:
Certain field of neologisms to be obtained is inputted corpus S by step 11With other field corpus S2Compare acquisition field
Difference word seed;
Preferably, obtaining field difference word seed by following steps:
(1) S is counted respectively1And S2In each word " c " occur frequency fs1(c) and fs2(c);
(2) each word is calculated in S by following formula1And S2In difference value:
Dword_seg(c)=fs1(c)/1+fs2(c)
(3) given threshold λ, if the difference value D of word " c "word_seg(c) it is more than threshold value λ, word " c " is used as difference word kind
Son.
Step 2 expands field difference word seed, constructs candidate word set Setcandidate;
Preferably, being expanded by following steps using n-gram mode, detailed process is as follows:
(1) in corpus S1In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n-
Gram word retains if including any difference word, and counts these n-gram word frequencies of occurrences, and candidate word set is added
Setcandidate;
(2) to candidate word set SetcandidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set SetcandidateIn leave out W;
Step 3: candidate word set Set is removed according to the field difference size of candidate wordcandidateIn repetitor;
Preferably, the field difference of candidate word W can be calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance.
Further, better duplicate removal effect in order to obtain, the field difference of repetitor can comprehensively consider coagulate it is right with
Field difference value obtains, i.e., according to defining 2, finds out candidate word set SetCandidateIn all repetitor, repetitor is carried out
Compare, selects the biggish reservation of weight in repetitor, it is lesser to give up;The process is repeated until candidate word set SetCandidate
In no longer contain repetitor, detailed process is as follows:
(1) according to defining 2, n=2 is taken, 3,4,5, to SetCandidateIn all words compare, find out all repetitors, n table
Show SetCandidateThe individual character number for including in the word of set;
(2) right CV (W) and field difference value DV (W), calculating are coagulated according to what is defined 3, define 4 and calculate each repetitors
Formula difference is as follows:
It coagulates right:
Field difference value:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Further, it is compared as follows weight V size after weighting shown in formula two-by-two to repetitor, it is biggish to leave weight
Word:
V (W)=αn*DV(W)+CV(W)
Wherein, a is parameter, indicates the measurement of permitted difference between different n-gram, and n indicates individual character number in word W,
ciIndicate i-th of word or word in word W, w1And w2For duplicate two words each other.
(3) repeat step (1), (2), until no longer containing repetitor in candidate word set.
Step 4: removal SetCandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ add
Enter new set of words Y and export and obtains all neologisms.
Preferably, the field difference of candidate word W can be calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance.
Further, the field difference can be by candidate word set SetcandidateEach of candidate word, point
Not according to defining 3,4,5, its field difference value (DV) is calculated, at word rate (NWP) and is coagulated right (CV), and it is pressed centainly
Proportion composite characterizes, specific as follows:
(1) candidate word W difference value DV (W) is calculated according to the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
(2) candidate word W is calculated according to the following formula into word rate NWP (W):
Wherein, f (ci) indicate individual character c in WiThe frequency of occurrences;Single(ci) indicate after using participle tool, ciThere is frequency
Rate;
(3) candidate word W is calculated according to the following formula coagulate right CV (W):
(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalize formula
It is as follows:
Wherein, XjCorresponding j-th word current value (difference value at word rate or is coagulated right), XminIndicate the value in all words
Minimum, XmaxIndicate the peak of the value in all words;
(5) candidate word W weight V is calculated according to the following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
Wherein, a, b and c respectively indicate difference value, coagulate ratio that is right, accounting for weight V at word rate.
Beneficial effect
The present invention compares the prior art, by different information, selected seed word between the different corpus fields of utilization, and passes through n-
Gram, which is expanded, obtains candidate word set;Then candidate word is automatically selected using different information between word itself and field again
In neologisms, to significantly improve the number and accuracy of new word discovery.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the new words extraction method based on field otherness of the embodiment of the present invention;
Fig. 2 is the method for the present invention and now there are four types of pair of the new words extraction method in terms of new word identification quantity and accuracy rate
Compare result schematic diagram.
Specific embodiment
The method of the present invention is described in further details with embodiment with reference to the accompanying drawing.
Embodiment
The present embodiment is using network corpus as S1, news corpus is as S2For the method for the present invention is described in detail.
Network corpus selects a model in discussion bar as shown in table 1:
Table 1:
News corpus selects certain news in 4 days April in 2001 as shown in table 2:
Table 2:
A kind of new words extraction method based on field otherness, process flow are as shown in Figure 1, comprising the following steps:
Step 1: obtaining field difference word seed:
Field difference word is the word that frequency of occurrence is significantly more than other corpus in a kind of corpus, obtains field difference word
Mode it is varied, this implementation is simply sentenced so that whether the frequency difference that word occurs in two kinds of corpus is higher than certain preset threshold
Whether determine as field difference word seed, specific as follows:
Each word occurs in statistics network corpus the frequency and its frequency occurred in news corpus respectively;Then
The difference value of the two is calculated, last set threshold value λ is 2, and the word using difference value more than or equal to λ is as difference word;Obtain difference word
Set is as shown in table 3:
Table 3:
Step 2: expanding difference word seed, candidate word set is obtained
Difference word is expanded it is varied to obtain the mode of candidate word, such as pass through dictionary or use n-gram mode
It is expanded, n-gram mode is used in the present embodiment, it is specific as follows: in network corpus, to take n=2,3,4 or 5 respectively, obtain
It takes all n-gram to combine word string if including any difference word, to retain these n-gram words, if it is unintentionally
Adopted word string, then delete.Such as: " good beautiful mew star people " can extract following n-gram form respectively:
2-gram { " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },
3-gram { " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },
4-gram { " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " } and 5-gram { " good drift
Bright mew ", " beautiful mew star ", " bright mew star people " }
Then, the word frequency of these n-gram is counted respectively, and threshold value is setWhen word W word frequency f (W) is more than threshold value
And when including any of the above-described difference word, it is selected as candidate word, finally obtained candidate word set is as shown in table 4:
Table 4:
Step 3: removal repetitor.
First according to defining 2, candidate word set Set is found outCandidateAll repetitors;It is below to be with " mew star people "
All repetitors for finding out of example: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, mew star people, love
Mew star people };
Secondly retain field according to the field difference size between repetitor two-by-two to differ greatly candidate word;Here, field
Difference can be characterized simply with the frequency that candidate word occurs in two kinds of corpus, in the present embodiment overcome the simple frequency
The influence different because of corpus of poor bring seeks logarithm using the two ratio to characterize, shown in following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Further, the results show, if field difference can not only consider difference value DV in field shown in formula as above,
It can also consider that better duplicate removal effect will be obtained if coagulating right CV, i.e., field difference passes through comprehensive both shown in following formula
Weight later obtains:
V (W)=αn*DV(W)+CV(W)
Therefore, according to defining 3,4, calculate each of the above word coagulates right and difference value.With { mew star people, the mew star of love
People } for remove repetitor, mew star people's word frequency is 6, and mew star people's word frequency of love is 3, and word frequency is 0 in news corpus, then:
DV (mew star people)=log ((6+1)/(0+1))=0.845
DV (the mew star people of love)=log ((3+1)/(0+1))=0.602
CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of two kinds of " mew star "+" people ", coagulates right value and is respectively
CV (" mew "+" star people ")=6/ (8*6)=0.125
CV (" mew star "+" people ")=6/ (6*7)=0.143.
Its smaller value is taken to coagulate as word " mew star people " right
CV (mew star people)=0.125
Similarly CV (the mew star people of love) have " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ",
Four kinds of solidifying conjunction modes of " the mew star of love "+" people ".
It coagulates right value and is respectively as follows:
CV (" love "+" mew star people ")=3/ (4*4)=0.185
CV (" love "+" mew star people ")=3/ (3*6)=0.167
CV (" mew of love "+" star people ")=3/ (3*6)=0.167
CV (" the mew star of love "+" people ")=3/ (3*7)=0.143 takes its smaller value as word " the mew star people of love " solidifying conjunction
Degree
CV (the mew star people of love)=0.143
Taking a parameter is 1.1
V (mew star people)=0.845*1.13+ 0.125=1.249
V (the mew star people of love)=0.602*1.15+ 0.143=1.113
So retaining " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To SetCandidateIn own
Repetitor, execute step 3, until without repetitor generate.Finally determining candidate word is as shown in table 5:
Table 5:
Step 4: obtaining new set of words according to field differential screening candidate word and exporting.
Same step 3, the field difference can be characterized after frequency ratio takes logarithm between different corpus by candidate word,
But the experiment proved that if field difference can comprehensively consider field difference value DV, at word rate NWP and coagulate right CV, according to
Better effect will be obtained if integrating three according to a certain percentage shown in following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
To candidate word set SetCandidateEach of candidate word, respectively according to define 3,4,5, it is poor to calculate its field
Different value at word rate and is coagulated right:
Still by taking " mew star people " word as an example:
Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845
Coagulate right: CV (mew star people)=6/ (8*6)=0.125 (takes " mew "+" star people " to obtain minimum)
At word rate:
The present embodiment using ICTCLAS participle tool will above participle after obtain single (mew)=8, single (star)=
6, single (people)=7;F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again;Therefore
Further, it to obtain better extraction effect, needs
Synthesis obtains the weight of field difference again after three of the above value is normalized;
Maximum, the minimum value of three kinds of values are respectively as follows: in 7 words shown in table 5
DVmax=0.903;DVmin=0.176;
CVmax=0.25;CVmin=0.071;
NWPmax=1;NWPmin=0;
After normalization, " mew star people " three kinds of values are respectively as follows:
Take a=0.6, b=0.4, c=-0.2;
VMew star people=0.6*0.920+0.4*0.302-0.2*0=0.6728
Thus the field difference for obtaining word all shown in table 5 is as shown in table 6:
Table 6:
Threshold gamma=0.4 is taken, filtering out all spectra difference and obtaining new set of words lower than the word of threshold gamma is { building-owner, mew star
People, tinkling of pieces of jade body }.
Experimental result:
In order to verify the validity of new words extraction method of the embodiment of the present invention based on field otherness, this experiment is using new
Unrestrained three days 6-8 in microblogging June days microblogging, amounts to 10,237,813 and Baidu " the big Supreme Being of Li Yi " amounts to 3,524,584 notes
Son is used as network corpus, using the news data of Xinhua News Agency's all publications in 1993 to 2004, amounts to 9,517,292 sentences
As news corpus, it is utilized respectively existing new words extraction method CV, NWP, EMI, PNWD and DV proposed by the present invention and DV+
CV+NWP method compares in terms of new word identification quantity and accuracy rate, and comparing result is as shown in Figure 2.
CV and NWP be those skilled in the art it is commonly understood that new words extraction statistical method, details are not described herein again.
The Enhanced Mutual Information algorithm that EMI:Zhang et al. was proposed in 2009, formula:
Wherein, word W=w1w2…wn, wiFor each word for constituting word, n is the number for constituting the word of word.F table
Show word W frequency of occurrence, FiIndicate word wiFrequency of occurrence.The algorithm idea is to measure word to the dependence of each word, and value is got over
Greatly, then a possibility that becoming word, is bigger.
Pattern-based new word identification (the Patten New Word that PNWD:Huang et al. was proposed in 2014
Detection) algorithm.The algorithm core concept be automatically select using POS markup information and by seed vocabulary meet it is short
Language mode such as<ad, *, au>model, then the method for vocabulary newly occur is automatically extracted out by these models.
As shown in Fig. 2, k word before x-axis indicates in figure, the Average Accuracy AP (k) of k word before y-axis indicates.By can in figure
To see, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP obtain better effect, with benchmarks PNWD phase
Than, DV and DV+CV+NWP effect is more preferable, and CV and NWP, when results set is smaller, accuracy ratio PNWD is slightly worse, and with knot
The expansion of fruit data, CV and NWP are obviously improved again.This is because PWND can only have found the neologisms of adjective, and neglect
The neologisms of other parts of speech have been omited, so, after the neologisms for efficiently identifying adjective, neologisms of the PWND for other parts of speech
Discrimination decline.For DV, extraordinary effect is obtained, this method is primarily due to and takes full advantage of difference between different field
Property, and neologisms are good at embodying this field otherness.For CV and NWP, recognition accuracy is slightly worse, is primarily due to CV and NWP
2-gram vocabulary is judged slightly worse, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very big, makes
It is extremely low at this 2 values of 2-gram, it is not easy to be identified, and there is greatly 2-gram vocabulary in neologisms, so 2 kinds of methods
Effect is not satisfactory.DV+CV+NWP combines the advantage of tri- kinds of methods of DV, CV and NWP, obtains best result.Therefore, with
Conventional method is compared, and the new words extraction method proposed by the present invention based on field otherness can obtain higher accuracy and discovery more
More neologisms.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this
The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes
Change and improve all within the scope of the claimed invention, the claimed scope of the invention is by appended claims and its waits
Effect object defines.
Claims (8)
1. a kind of new words extraction method based on field otherness, which comprises the following steps:
Certain field of neologisms to be obtained is inputted corpus S by step 11With other field corpus S2Compare acquisition field difference word
Seed;The field difference word, refers to the individual character of embodiment field otherness, and even individual character c occurs in certain class field corpus
Frequency finternet(c) in another kind of field corpus frequency of occurrences fnewsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference
Word;
Step 2 expands field difference word seed by n-gram mode, constructs candidate word set SetCandidate, detailed process is such as
Under:
(1) in corpus S1In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n-gram words,
If including any field difference word, retain, and count these n-gram word frequencies of occurrences, candidate word set is added
SetCandidate;
(2) to candidate word set SetCandidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set SetCandidateIn leave out W;Step 3 is removed according to the field difference size of candidate word
Candidate word set SetCandidateIn repetitor;
The field difference of the candidate word W is calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate word W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance;
Step 4, removal SetCandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ are added newly
Set of words Y and export obtain all neologisms.
2. a kind of new words extraction method based on field otherness according to claim 1, which is characterized in that the field
Difference word seed is obtained by following procedure:
(1) S is counted respectively1And S2In each word " c " occur frequency fs1(c) and fs2(c);
(2) each word is calculated in S by following formula1And S2In difference value:
Dword_seg(c)=fs1(c)/fs2(c)
(3) given threshold λ, if the difference value D of word " c "word_seg(c) it is more than discrepancy threshold λ, word " c " is used as difference word kind
Son.
3. a kind of new words extraction method based on field otherness according to claim 2, which is characterized in that λ=2.
4. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute
It states and candidate word set Set is removed according to field difference sizeCandidateIn repetitor pass through following steps carry out:
(1) n=2,3,4 or 5 are taken, to SetCandidateIn all words be compared, find out all repetitors, n is indicated
SetCandidateThe number for the word for including in the word of set;
(2) repetitor found is comprehensively considered and coagulates right CV and field difference value DV and is calculate by the following formula its weight V, and
Retain the biggish word of weight, the removal lesser word of weight to achieve the purpose that duplicate removal:
V (W)=αn*DV(W)+CV(W);
DV (W)=log (1+fs1(W)/(1+fs2(W)));
Wherein, α is parameter, indicates the measurement of permitted difference between different n-gram, ciIndicate i-th of word or word in word W,
And W=c1c2;Wherein, f (W) indicates the frequency that word W occurs in corpus of text;
(3) repeat step (1), (2), until no longer containing repetitor in candidate word set.
5. a kind of new words extraction method based on field otherness according to claim 4, which is characterized in that α=1.1.
6. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute
State removal SetCandidate" field difference " in the middle lower candidate word of field difference is by field difference value DV, at word rate NWP
And coagulate the value after right CV is integrated according to a certain percentage, i.e. weight V is obtained especially by following procedure:
(1) candidate word W difference value DV (W) is calculated according to the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
(2) candidate word W is calculated according to the following formula into word rate NWP (W):
Wherein, f (ci) indicate word ciThe frequency of occurrences;Single(ci) indicate after using participle tool, ciThe frequency of occurrences;I indicates structure
At the label of the words of W, n indicates to constitute the quantity of all words of word W;
(3) candidate word W is calculated according to the following formula coagulate right CV (W):
(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalization formula is such as
Under:
Wherein, XjCorresponding j-th of word current value, the current value are difference value, at word rate or coagulate right, XminIndicate all words
In the value minimum, XmaxIndicate the peak of the value in all words;
(4) candidate word W weight V is calculated according to the following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
Wherein, a, b and c respectively indicate difference value, coagulate ratio that is right, accounting for weight V at word rate.
7. a kind of new words extraction method based on field otherness according to claim 6, which is characterized in that a=0.6, b
=0.4, c=-0.2.
8. any a kind of new words extraction method based on field otherness in -3,5 or 7, feature exist according to claim 1
In γ=0.4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510711219.7A CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510711219.7A CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105488098A CN105488098A (en) | 2016-04-13 |
CN105488098B true CN105488098B (en) | 2019-02-05 |
Family
ID=55675073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510711219.7A Active CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105488098B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126495B (en) * | 2016-06-16 | 2019-03-12 | 北京捷通华声科技股份有限公司 | One kind being based on large-scale corpus prompter method and apparatus |
CN108845982B (en) * | 2017-12-08 | 2021-08-20 | 昆明理工大学 | Chinese word segmentation method based on word association characteristics |
CN110634145B (en) * | 2018-06-22 | 2022-04-12 | 日日顺供应链科技股份有限公司 | Warehouse checking method based on image processing |
CN110472140B (en) * | 2019-07-17 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Object word recommendation method and device and electronic equipment |
CN112668331A (en) * | 2021-03-18 | 2021-04-16 | 北京沃丰时代数据科技有限公司 | Special word mining method and device, electronic equipment and storage medium |
CN113051912B (en) * | 2021-04-08 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Domain word recognition method and device based on word forming rate |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1340804A (en) * | 2000-08-30 | 2002-03-20 | 国际商业机器公司 | Automatic new term fetch method and system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
-
2015
- 2015-10-28 CN CN201510711219.7A patent/CN105488098B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1340804A (en) * | 2000-08-30 | 2002-03-20 | 国际商业机器公司 | Automatic new term fetch method and system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
Non-Patent Citations (6)
Title |
---|
A Method f or Automatic POS Guessing of Chinese Unknown Words;Qiu L等;《Proceedings of the 22nd International Conference on Computational Linguistics》;20081231;第705-712页 |
New Word Detection for Sentiment Analysis;Minlie Huang等;《Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics》;20141231;第531–541页 |
一种快速获取领域新词语的新方法;刘华;《中文信息学报》;20061231;第17-23页 |
中文新词识别技术综述;张海军等;《计算机科学》;20100331;第6-10页 |
基于N-Gram的专业领域中文新词识别研究;段宇锋等;《现代图书情报技术》;20121231;第41-47页 |
面向互联网数据的新词发现平台的设计与实现;杜聪慧;《万方数据》;20140331;第1-60页 |
Also Published As
Publication number | Publication date |
---|---|
CN105488098A (en) | 2016-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105488098B (en) | A kind of new words extraction method based on field otherness | |
CN109241538B (en) | Chinese entity relation extraction method based on dependency of keywords and verbs | |
CN106156204B (en) | Text label extraction method and device | |
Ramakrishna et al. | Linguistic analysis of differences in portrayal of movie characters | |
CN109815336B (en) | Text aggregation method and system | |
CN108920456A (en) | A kind of keyword Automatic method | |
CN103793447B (en) | The estimation method and estimating system of semantic similarity between music and image | |
CN106933972B (en) | The method and device of data element are defined using natural language processing technique | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN112989802B (en) | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN108009135A (en) | The method and apparatus for generating documentation summary | |
CN110502742A (en) | A kind of complexity entity abstracting method, device, medium and system | |
CN109214445A (en) | A kind of multi-tag classification method based on artificial intelligence | |
CN110134781A (en) | A kind of automatic abstracting method of finance text snippet | |
CN107341142B (en) | Enterprise relation calculation method and system based on keyword extraction and analysis | |
CN107038155A (en) | The extracting method of text feature is realized based on improved small-world network model | |
CN114048310A (en) | Dynamic intelligence event timeline extraction method based on LDA theme AP clustering | |
Song et al. | A novel automatic ontology construction method based on web data | |
CN112528640A (en) | Automatic domain term extraction method based on abnormal subgraph detection | |
CN104331396A (en) | Intelligent advertisement identifying method | |
CN110413985B (en) | Related text segment searching method and device | |
CN108920475A (en) | A kind of short text similarity calculating method | |
JP4326713B2 (en) | News topic analysis device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |