CN105488098B - A kind of new words extraction method based on field otherness - Google Patents

A kind of new words extraction method based on field otherness Download PDF

Info

Publication number
CN105488098B
CN105488098B CN201510711219.7A CN201510711219A CN105488098B CN 105488098 B CN105488098 B CN 105488098B CN 201510711219 A CN201510711219 A CN 201510711219A CN 105488098 B CN105488098 B CN 105488098B
Authority
CN
China
Prior art keywords
word
field
candidate
difference
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510711219.7A
Other languages
Chinese (zh)
Other versions
CN105488098A (en
Inventor
史树敏
周新宇
黄河燕
史胜清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201510711219.7A priority Critical patent/CN105488098B/en
Publication of CN105488098A publication Critical patent/CN105488098A/en
Application granted granted Critical
Publication of CN105488098B publication Critical patent/CN105488098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of methods of new words extraction based on field otherness, belong to natural language processing applied technical field.The otherness that the present invention is distributed by comparing word between different field first, obtain difference word seed, then difference word seed is expanded by n-gram mode, construct candidate word set, next according to field difference size remove candidate word set in repetitor, finally to each word in candidate word set, respectively with field difference value, coagulate it is right, and it rejects the lower candidate word of field difference as measurement standard at word rate and obtains neologisms.The prior art is compared, the present invention passes through using different information between different corpus fields, selected seed word, and expands by n-gram and obtain candidate word set;Then the neologisms in candidate word are automatically selected, to significantly improve the number and accuracy of new word discovery using different information between word itself and field again.

Description

A kind of new words extraction method based on field otherness
Technical field
The present invention relates to a kind of method of new words extraction, in particular to the side of a kind of new words extraction based on field otherness Method belongs to natural language processing applied technical field.
Background technique
Network neologisms, which refer to, there are some special languages of cocurrent enforcement or text along with internet.It is typically derived from Video display network hot topic term, or some words acceptable to all generated by a certain social phenomenon.Network neologisms are in net Network field text, such as: being frequently occurred in discussion bar, microblogging.Statistics discovery, China appear in people's per year over 1000 neologisms In daily life.According to related research result, the participle mistake more than 60% comes automatic network neologisms, the order of accuarcy of new word identification Directly affect the performance of intelligent information handling system.Such as: it is fixed in the text emotion analysis task of Intelligent Information Processing Phrase collocation can embody feeling polarities, for neologisms phrase, if can not to its it is correct identify, will lead to and judged Feeling polarities distortion.Such as: " expression very tall and big on " (this is the net exploxer comment of a product) actually should " on tall and big " here As a network neologisms, integrally the positive emotion of " high-end and atmospheric to improve grade " is indicated, however almost all of application at present In system, the annotated sequence that is formed after word segmentation processing is " expression/v ten divides/adv high/adj is big/adj is upper/adv ", it may be assumed that by the net Network neologisms are cut into individual character, and the word segmentation processing of mistake makes this be lost the meaning that positive emotion is inclined to, to the intelligence of follow-up It can analyze to produce and seriously affect.Therefore there is very important meaning in natural language processing field to effective identification of neologisms Justice.
Currently, new words extraction is broadly divided into two class of rule-based method and Statistics-Based Method.Rule-based approach Main thought be: be conceived to the word-building principle of neologisms, as theoretical foundation and establish one and help to identify neologisms Common corpus;Then itself characteristic of speech sounds for studying word builds a special structure based on the natural quality of word Word rule base.Rule-based method is higher to the recognition accuracy of neologisms, but needs extremely strong language attainment and related fields Knowledge background.Statistics-Based Method realizes new word identification, and there are mainly two types of means, and one is must using new words extraction as participle Indispensable a part is finally inferred to most possible separation by certain statistical model and obtains neologisms.Classical Statistical model is had ready conditions under random field (Conditional Random Fields, CRF), the gradient based on characteristic frequency information Training pattern etc. drops.Another means are using new words extraction as an individual task, it usually needs do part-of-speech tagging The pretreatment of (Part-Of-Speech, POS).Since network neologisms have a real-time, the features such as circulation is strong, dynamic change, Therefore pure rule-based method is often ineffective;And using statistical means acquisition network neologisms, there is also training completely Sparse, validity feature extract the deficiencies of difficult.The method that major part researcher is combined using rule and statistics at present, with Phase plays respective advantage, however these methods all have ignored the information characteristics advantage of corpus itself, it may be assumed that same words are in difference Information (intension) difference between the theme of field is embodied as the corresponding word distribution performance of same words under different field theme not Together.
Summary of the invention
The present invention proposes a kind of new words extraction based on field otherness for the neologisms for constantly generating and using in network Method, this method make full use of the characteristic of different field corpus itself, under existing general appraisement system, effectively increase neologisms The accuracy rate of identification.
Idea of the invention is that obtaining difference word seed by comparing the otherness that word between different field is distributed, passing through n- Gram mode expands difference word, constructs candidate word set, then to each word in candidate word set, respectively with field difference value, It coagulates right, and at word rate as measurement standard, further extracts and obtain neologisms.
Related definition involved in the present invention is as follows:
Define 1: field difference word refers to the individual character of embodiment field otherness, which can reflect domain features, The frequency of occurrences has very big difference in different field corpus.Such as, if individual character c frequency of occurrences f in network corpusinternet(c) with Frequency of occurrences f in News FieldnewsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference word.It is existing at the language of word for individual character As if it can symbolize otherness.The present invention also assert its difference performance with word distribution.
Definition 2: repetitor, as word WAWith word WBMeet conditionClaim WBAnd WARepetitor each other.Such as: " happiness is big It is general to run quickly " (WA) and " general greatly to run quickly " (WB)。
Define 3: field difference value DV (Difference Value), the measurement of field otherness, using word W in network language Expect frequency of occurrences finternet(W) with news corpus frequency of occurrences fnews(W) it is calculated;Wherein finternet(W) indicate word W in net The frequency of occurrences in network corpus, fnews(W) word W frequency of occurrences in news corpus is indicated.
It defines 4: coagulating right CV (Concrete Value), measure word by the quantizating index of correct cutting.Such as " cinema " There are " film "+" institute " and the solidifying conjunction mode of two kinds of " electricity "+" movie theatre ".To any word W=c1c2(wherein, c1Or c2It indicates to constitute the word Word or word), by enumerating its all possible solidifying conjunction mode, calculate corresponding weight, take wherein minimum value, it is solidifying as the word It is right.
It defines 5: at word rate NWP (New Word Probability), judging whether certain individual character sequence forms the finger of word Mark.Such as: " liking to say ", " love is eaten " are made of individual character, but NWP is very low, that is, indicate that the two does not constitute word.
Purpose of the invention is through the following steps that realize:
A kind of new words extraction method based on field otherness, comprising the following steps:
Certain field of neologisms to be obtained is inputted corpus S by step 11With other field corpus S2Compare acquisition field Difference word seed;
Preferably, obtaining field difference word seed by following steps:
(1) S is counted respectively1And S2In each word " c " occur frequency fs1(c) and fs2(c);
(2) each word is calculated in S by following formula1And S2In difference value:
Dword_seg(c)=fs1(c)/1+fs2(c)
(3) given threshold λ, if the difference value D of word " c "word_seg(c) it is more than threshold value λ, word " c " is used as difference word kind Son.
Step 2 expands field difference word seed, constructs candidate word set Setcandidate
Preferably, being expanded by following steps using n-gram mode, detailed process is as follows:
(1) in corpus S1In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n- Gram word retains if including any difference word, and counts these n-gram word frequencies of occurrences, and candidate word set is added Setcandidate
(2) to candidate word set SetcandidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set SetcandidateIn leave out W;
Step 3: candidate word set Set is removed according to the field difference size of candidate wordcandidateIn repetitor;
Preferably, the field difference of candidate word W can be calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance.
Further, better duplicate removal effect in order to obtain, the field difference of repetitor can comprehensively consider coagulate it is right with Field difference value obtains, i.e., according to defining 2, finds out candidate word set SetCandidateIn all repetitor, repetitor is carried out Compare, selects the biggish reservation of weight in repetitor, it is lesser to give up;The process is repeated until candidate word set SetCandidate In no longer contain repetitor, detailed process is as follows:
(1) according to defining 2, n=2 is taken, 3,4,5, to SetCandidateIn all words compare, find out all repetitors, n table Show SetCandidateThe individual character number for including in the word of set;
(2) right CV (W) and field difference value DV (W), calculating are coagulated according to what is defined 3, define 4 and calculate each repetitors Formula difference is as follows:
It coagulates right:
Field difference value:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Further, it is compared as follows weight V size after weighting shown in formula two-by-two to repetitor, it is biggish to leave weight Word:
V (W)=αn*DV(W)+CV(W)
Wherein, a is parameter, indicates the measurement of permitted difference between different n-gram, and n indicates individual character number in word W, ciIndicate i-th of word or word in word W, w1And w2For duplicate two words each other.
(3) repeat step (1), (2), until no longer containing repetitor in candidate word set.
Step 4: removal SetCandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ add Enter new set of words Y and export and obtains all neologisms.
Preferably, the field difference of candidate word W can be calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance.
Further, the field difference can be by candidate word set SetcandidateEach of candidate word, point Not according to defining 3,4,5, its field difference value (DV) is calculated, at word rate (NWP) and is coagulated right (CV), and it is pressed centainly Proportion composite characterizes, specific as follows:
(1) candidate word W difference value DV (W) is calculated according to the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
(2) candidate word W is calculated according to the following formula into word rate NWP (W):
Wherein, f (ci) indicate individual character c in WiThe frequency of occurrences;Single(ci) indicate after using participle tool, ciThere is frequency Rate;
(3) candidate word W is calculated according to the following formula coagulate right CV (W):
(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalize formula It is as follows:
Wherein, XjCorresponding j-th word current value (difference value at word rate or is coagulated right), XminIndicate the value in all words Minimum, XmaxIndicate the peak of the value in all words;
(5) candidate word W weight V is calculated according to the following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
Wherein, a, b and c respectively indicate difference value, coagulate ratio that is right, accounting for weight V at word rate.
Beneficial effect
The present invention compares the prior art, by different information, selected seed word between the different corpus fields of utilization, and passes through n- Gram, which is expanded, obtains candidate word set;Then candidate word is automatically selected using different information between word itself and field again In neologisms, to significantly improve the number and accuracy of new word discovery.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the new words extraction method based on field otherness of the embodiment of the present invention;
Fig. 2 is the method for the present invention and now there are four types of pair of the new words extraction method in terms of new word identification quantity and accuracy rate Compare result schematic diagram.
Specific embodiment
The method of the present invention is described in further details with embodiment with reference to the accompanying drawing.
Embodiment
The present embodiment is using network corpus as S1, news corpus is as S2For the method for the present invention is described in detail.
Network corpus selects a model in discussion bar as shown in table 1:
Table 1:
News corpus selects certain news in 4 days April in 2001 as shown in table 2:
Table 2:
A kind of new words extraction method based on field otherness, process flow are as shown in Figure 1, comprising the following steps:
Step 1: obtaining field difference word seed:
Field difference word is the word that frequency of occurrence is significantly more than other corpus in a kind of corpus, obtains field difference word Mode it is varied, this implementation is simply sentenced so that whether the frequency difference that word occurs in two kinds of corpus is higher than certain preset threshold Whether determine as field difference word seed, specific as follows:
Each word occurs in statistics network corpus the frequency and its frequency occurred in news corpus respectively;Then The difference value of the two is calculated, last set threshold value λ is 2, and the word using difference value more than or equal to λ is as difference word;Obtain difference word Set is as shown in table 3:
Table 3:
Step 2: expanding difference word seed, candidate word set is obtained
Difference word is expanded it is varied to obtain the mode of candidate word, such as pass through dictionary or use n-gram mode It is expanded, n-gram mode is used in the present embodiment, it is specific as follows: in network corpus, to take n=2,3,4 or 5 respectively, obtain It takes all n-gram to combine word string if including any difference word, to retain these n-gram words, if it is unintentionally Adopted word string, then delete.Such as: " good beautiful mew star people " can extract following n-gram form respectively:
2-gram { " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },
3-gram { " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },
4-gram { " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " } and 5-gram { " good drift Bright mew ", " beautiful mew star ", " bright mew star people " }
Then, the word frequency of these n-gram is counted respectively, and threshold value is setWhen word W word frequency f (W) is more than threshold value And when including any of the above-described difference word, it is selected as candidate word, finally obtained candidate word set is as shown in table 4:
Table 4:
Step 3: removal repetitor.
First according to defining 2, candidate word set Set is found outCandidateAll repetitors;It is below to be with " mew star people " All repetitors for finding out of example: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, mew star people, love Mew star people };
Secondly retain field according to the field difference size between repetitor two-by-two to differ greatly candidate word;Here, field Difference can be characterized simply with the frequency that candidate word occurs in two kinds of corpus, in the present embodiment overcome the simple frequency The influence different because of corpus of poor bring seeks logarithm using the two ratio to characterize, shown in following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Further, the results show, if field difference can not only consider difference value DV in field shown in formula as above, It can also consider that better duplicate removal effect will be obtained if coagulating right CV, i.e., field difference passes through comprehensive both shown in following formula Weight later obtains:
V (W)=αn*DV(W)+CV(W)
Therefore, according to defining 3,4, calculate each of the above word coagulates right and difference value.With { mew star people, the mew star of love People } for remove repetitor, mew star people's word frequency is 6, and mew star people's word frequency of love is 3, and word frequency is 0 in news corpus, then:
DV (mew star people)=log ((6+1)/(0+1))=0.845
DV (the mew star people of love)=log ((3+1)/(0+1))=0.602
CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of two kinds of " mew star "+" people ", coagulates right value and is respectively
CV (" mew "+" star people ")=6/ (8*6)=0.125
CV (" mew star "+" people ")=6/ (6*7)=0.143.
Its smaller value is taken to coagulate as word " mew star people " right
CV (mew star people)=0.125
Similarly CV (the mew star people of love) have " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ", Four kinds of solidifying conjunction modes of " the mew star of love "+" people ".
It coagulates right value and is respectively as follows:
CV (" love "+" mew star people ")=3/ (4*4)=0.185
CV (" love "+" mew star people ")=3/ (3*6)=0.167
CV (" mew of love "+" star people ")=3/ (3*6)=0.167
CV (" the mew star of love "+" people ")=3/ (3*7)=0.143 takes its smaller value as word " the mew star people of love " solidifying conjunction Degree
CV (the mew star people of love)=0.143
Taking a parameter is 1.1
V (mew star people)=0.845*1.13+ 0.125=1.249
V (the mew star people of love)=0.602*1.15+ 0.143=1.113
So retaining " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To SetCandidateIn own Repetitor, execute step 3, until without repetitor generate.Finally determining candidate word is as shown in table 5:
Table 5:
Step 4: obtaining new set of words according to field differential screening candidate word and exporting.
Same step 3, the field difference can be characterized after frequency ratio takes logarithm between different corpus by candidate word, But the experiment proved that if field difference can comprehensively consider field difference value DV, at word rate NWP and coagulate right CV, according to Better effect will be obtained if integrating three according to a certain percentage shown in following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
To candidate word set SetCandidateEach of candidate word, respectively according to define 3,4,5, it is poor to calculate its field Different value at word rate and is coagulated right:
Still by taking " mew star people " word as an example:
Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845
Coagulate right: CV (mew star people)=6/ (8*6)=0.125 (takes " mew "+" star people " to obtain minimum)
At word rate:
The present embodiment using ICTCLAS participle tool will above participle after obtain single (mew)=8, single (star)= 6, single (people)=7;F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again;Therefore
Further, it to obtain better extraction effect, needs Synthesis obtains the weight of field difference again after three of the above value is normalized;
Maximum, the minimum value of three kinds of values are respectively as follows: in 7 words shown in table 5
DVmax=0.903;DVmin=0.176;
CVmax=0.25;CVmin=0.071;
NWPmax=1;NWPmin=0;
After normalization, " mew star people " three kinds of values are respectively as follows:
Take a=0.6, b=0.4, c=-0.2;
VMew star people=0.6*0.920+0.4*0.302-0.2*0=0.6728
Thus the field difference for obtaining word all shown in table 5 is as shown in table 6:
Table 6:
Threshold gamma=0.4 is taken, filtering out all spectra difference and obtaining new set of words lower than the word of threshold gamma is { building-owner, mew star People, tinkling of pieces of jade body }.
Experimental result:
In order to verify the validity of new words extraction method of the embodiment of the present invention based on field otherness, this experiment is using new Unrestrained three days 6-8 in microblogging June days microblogging, amounts to 10,237,813 and Baidu " the big Supreme Being of Li Yi " amounts to 3,524,584 notes Son is used as network corpus, using the news data of Xinhua News Agency's all publications in 1993 to 2004, amounts to 9,517,292 sentences As news corpus, it is utilized respectively existing new words extraction method CV, NWP, EMI, PNWD and DV proposed by the present invention and DV+ CV+NWP method compares in terms of new word identification quantity and accuracy rate, and comparing result is as shown in Figure 2.
CV and NWP be those skilled in the art it is commonly understood that new words extraction statistical method, details are not described herein again.
The Enhanced Mutual Information algorithm that EMI:Zhang et al. was proposed in 2009, formula:
Wherein, word W=w1w2…wn, wiFor each word for constituting word, n is the number for constituting the word of word.F table Show word W frequency of occurrence, FiIndicate word wiFrequency of occurrence.The algorithm idea is to measure word to the dependence of each word, and value is got over Greatly, then a possibility that becoming word, is bigger.
Pattern-based new word identification (the Patten New Word that PNWD:Huang et al. was proposed in 2014 Detection) algorithm.The algorithm core concept be automatically select using POS markup information and by seed vocabulary meet it is short Language mode such as<ad, *, au>model, then the method for vocabulary newly occur is automatically extracted out by these models.
As shown in Fig. 2, k word before x-axis indicates in figure, the Average Accuracy AP (k) of k word before y-axis indicates.By can in figure To see, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP obtain better effect, with benchmarks PNWD phase Than, DV and DV+CV+NWP effect is more preferable, and CV and NWP, when results set is smaller, accuracy ratio PNWD is slightly worse, and with knot The expansion of fruit data, CV and NWP are obviously improved again.This is because PWND can only have found the neologisms of adjective, and neglect The neologisms of other parts of speech have been omited, so, after the neologisms for efficiently identifying adjective, neologisms of the PWND for other parts of speech Discrimination decline.For DV, extraordinary effect is obtained, this method is primarily due to and takes full advantage of difference between different field Property, and neologisms are good at embodying this field otherness.For CV and NWP, recognition accuracy is slightly worse, is primarily due to CV and NWP 2-gram vocabulary is judged slightly worse, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very big, makes It is extremely low at this 2 values of 2-gram, it is not easy to be identified, and there is greatly 2-gram vocabulary in neologisms, so 2 kinds of methods Effect is not satisfactory.DV+CV+NWP combines the advantage of tri- kinds of methods of DV, CV and NWP, obtains best result.Therefore, with Conventional method is compared, and the new words extraction method proposed by the present invention based on field otherness can obtain higher accuracy and discovery more More neologisms.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improve all within the scope of the claimed invention, the claimed scope of the invention is by appended claims and its waits Effect object defines.

Claims (8)

1. a kind of new words extraction method based on field otherness, which comprises the following steps:
Certain field of neologisms to be obtained is inputted corpus S by step 11With other field corpus S2Compare acquisition field difference word Seed;The field difference word, refers to the individual character of embodiment field otherness, and even individual character c occurs in certain class field corpus Frequency finternet(c) in another kind of field corpus frequency of occurrences fnewsIt the ratio between (c) is more than threshold value λ, then c is referred to as field difference Word;
Step 2 expands field difference word seed by n-gram mode, constructs candidate word set SetCandidate, detailed process is such as Under:
(1) in corpus S1In, take n=2 respectively, 3,4,5, its corresponding all n-gram word is obtained, to these n-gram words, If including any field difference word, retain, and count these n-gram word frequencies of occurrences, candidate word set is added SetCandidate
(2) to candidate word set SetCandidateIn all candidate word W, with preset thresholdCompare, if its word frequencyIn candidate word set SetCandidateIn leave out W;Step 3 is removed according to the field difference size of candidate word Candidate word set SetCandidateIn repetitor;
The field difference of the candidate word W is calculated by the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
Wherein fs1(W) indicate word W in corpus S1The frequency of middle appearance, fs2(W) indicate word W in corpus S2The frequency of middle appearance;
Step 4, removal SetCandidateThe middle lower candidate word of field difference, the candidate word that will be above preset threshold γ are added newly Set of words Y and export obtain all neologisms.
2. a kind of new words extraction method based on field otherness according to claim 1, which is characterized in that the field Difference word seed is obtained by following procedure:
(1) S is counted respectively1And S2In each word " c " occur frequency fs1(c) and fs2(c);
(2) each word is calculated in S by following formula1And S2In difference value:
Dword_seg(c)=fs1(c)/fs2(c)
(3) given threshold λ, if the difference value D of word " c "word_seg(c) it is more than discrepancy threshold λ, word " c " is used as difference word kind Son.
3. a kind of new words extraction method based on field otherness according to claim 2, which is characterized in that λ=2.
4. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute It states and candidate word set Set is removed according to field difference sizeCandidateIn repetitor pass through following steps carry out:
(1) n=2,3,4 or 5 are taken, to SetCandidateIn all words be compared, find out all repetitors, n is indicated SetCandidateThe number for the word for including in the word of set;
(2) repetitor found is comprehensively considered and coagulates right CV and field difference value DV and is calculate by the following formula its weight V, and Retain the biggish word of weight, the removal lesser word of weight to achieve the purpose that duplicate removal:
V (W)=αn*DV(W)+CV(W);
DV (W)=log (1+fs1(W)/(1+fs2(W)));
Wherein, α is parameter, indicates the measurement of permitted difference between different n-gram, ciIndicate i-th of word or word in word W, And W=c1c2;Wherein, f (W) indicates the frequency that word W occurs in corpus of text;
(3) repeat step (1), (2), until no longer containing repetitor in candidate word set.
5. a kind of new words extraction method based on field otherness according to claim 4, which is characterized in that α=1.1.
6. a kind of new words extraction method based on field otherness according to claim 1 to 3, which is characterized in that institute State removal SetCandidate" field difference " in the middle lower candidate word of field difference is by field difference value DV, at word rate NWP And coagulate the value after right CV is integrated according to a certain percentage, i.e. weight V is obtained especially by following procedure:
(1) candidate word W difference value DV (W) is calculated according to the following formula:
DV (W)=log (1+fs1(W)/(1+fs2(W)))
(2) candidate word W is calculated according to the following formula into word rate NWP (W):
Wherein, f (ci) indicate word ciThe frequency of occurrences;Single(ci) indicate after using participle tool, ciThe frequency of occurrences;I indicates structure At the label of the words of W, n indicates to constitute the quantity of all words of word W;
(3) candidate word W is calculated according to the following formula coagulate right CV (W):
(4) it by difference value (DV), at word rate (NWP), and coagulates right (CV) and is normalized respectively, normalization formula is such as Under:
Wherein, XjCorresponding j-th of word current value, the current value are difference value, at word rate or coagulate right, XminIndicate all words In the value minimum, XmaxIndicate the peak of the value in all words;
(4) candidate word W weight V is calculated according to the following formula:
V (W)=a*DV (W)+b*CV (W)+c*NWP (W)
Wherein, a, b and c respectively indicate difference value, coagulate ratio that is right, accounting for weight V at word rate.
7. a kind of new words extraction method based on field otherness according to claim 6, which is characterized in that a=0.6, b =0.4, c=-0.2.
8. any a kind of new words extraction method based on field otherness in -3,5 or 7, feature exist according to claim 1 In γ=0.4.
CN201510711219.7A 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness Active CN105488098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510711219.7A CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510711219.7A CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Publications (2)

Publication Number Publication Date
CN105488098A CN105488098A (en) 2016-04-13
CN105488098B true CN105488098B (en) 2019-02-05

Family

ID=55675073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510711219.7A Active CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Country Status (1)

Country Link
CN (1) CN105488098B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126495B (en) * 2016-06-16 2019-03-12 北京捷通华声科技股份有限公司 One kind being based on large-scale corpus prompter method and apparatus
CN108845982B (en) * 2017-12-08 2021-08-20 昆明理工大学 Chinese word segmentation method based on word association characteristics
CN110634145B (en) * 2018-06-22 2022-04-12 日日顺供应链科技股份有限公司 Warehouse checking method based on image processing
CN110472140B (en) * 2019-07-17 2023-10-31 腾讯科技(深圳)有限公司 Object word recommendation method and device and electronic equipment
CN112668331A (en) * 2021-03-18 2021-04-16 北京沃丰时代数据科技有限公司 Special word mining method and device, electronic equipment and storage medium
CN113051912B (en) * 2021-04-08 2023-01-20 云南电网有限责任公司电力科学研究院 Domain word recognition method and device based on word forming rate

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
A Method f or Automatic POS Guessing of Chinese Unknown Words;Qiu L等;《Proceedings of the 22nd International Conference on Computational Linguistics》;20081231;第705-712页
New Word Detection for Sentiment Analysis;Minlie Huang等;《Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics》;20141231;第531–541页
一种快速获取领域新词语的新方法;刘华;《中文信息学报》;20061231;第17-23页
中文新词识别技术综述;张海军等;《计算机科学》;20100331;第6-10页
基于N-Gram的专业领域中文新词识别研究;段宇锋等;《现代图书情报技术》;20121231;第41-47页
面向互联网数据的新词发现平台的设计与实现;杜聪慧;《万方数据》;20140331;第1-60页

Also Published As

Publication number Publication date
CN105488098A (en) 2016-04-13

Similar Documents

Publication Publication Date Title
CN105488098B (en) A kind of new words extraction method based on field otherness
CN109241538B (en) Chinese entity relation extraction method based on dependency of keywords and verbs
CN106156204B (en) Text label extraction method and device
Ramakrishna et al. Linguistic analysis of differences in portrayal of movie characters
CN109815336B (en) Text aggregation method and system
CN108920456A (en) A kind of keyword Automatic method
CN103793447B (en) The estimation method and estimating system of semantic similarity between music and image
CN106933972B (en) The method and device of data element are defined using natural language processing technique
CN107992542A (en) A kind of similar article based on topic model recommends method
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN108804595B (en) Short text representation method based on word2vec
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN108009135A (en) The method and apparatus for generating documentation summary
CN110502742A (en) A kind of complexity entity abstracting method, device, medium and system
CN109214445A (en) A kind of multi-tag classification method based on artificial intelligence
CN110134781A (en) A kind of automatic abstracting method of finance text snippet
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
CN114048310A (en) Dynamic intelligence event timeline extraction method based on LDA theme AP clustering
Song et al. A novel automatic ontology construction method based on web data
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN104331396A (en) Intelligent advertisement identifying method
CN110413985B (en) Related text segment searching method and device
CN108920475A (en) A kind of short text similarity calculating method
JP4326713B2 (en) News topic analysis device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant