CN105955950A - New word discovery method and device - Google Patents

New word discovery method and device Download PDF

Info

Publication number
CN105955950A
CN105955950A CN201610282625.0A CN201610282625A CN105955950A CN 105955950 A CN105955950 A CN 105955950A CN 201610282625 A CN201610282625 A CN 201610282625A CN 105955950 A CN105955950 A CN 105955950A
Authority
CN
China
Prior art keywords
morpheme
subset
word
tuples
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610282625.0A
Other languages
Chinese (zh)
Inventor
康潮明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610282625.0A priority Critical patent/CN105955950A/en
Publication of CN105955950A publication Critical patent/CN105955950A/en
Priority to PCT/CN2016/102448 priority patent/WO2017185674A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a new word discovery method and device. The new word discovery method comprises the step of extracting morphemes from a target text in a target text library, establishing a morpheme collection H, counting frequency of occurrence of each morpheme, representing each morpheme and the frequency of occurrence of each morpheme in a two-tuples form, and forming a two-tuples collection T; calculating context correlation d of sub-collections w of the morphemes ti and gathering the sub-collections w of the morphemes ti with the d value greater than or equal to a preset correlation threshold to form a first candidate word collection Ws; calculating the support degree and the confidence degree of the morphemes ti, gathering the morphemes ti with the support degree and the confidence degree greater than or equal to corresponding minimum thresholds to form a second candidate word collection Wt; and obtaining an intersection of the first candidate word collection Ws and the second candidate word collection Wt to serve as a candidate new word collection Wh, filtering the candidate new word collection Wh, extracting the new words and storing the new words in a new word collection W. According to the new word discovery method and device, information entropy algorithm analysis and association rule algorithm analysis are combined efficiently, and new word discovery accuracy can be improved efficiently.

Description

New word discovery method and device
Technical field
The present invention relates to natural language processing technique field, particularly relate to a kind of new word discovery method and device.
Background technology
When utilizing computer analyzing and processing natural language information, often encounter new words extraction.At present, the method for new words extraction mainly has two ways: one is Statistics-Based Method, and another kind is the method combined based on dictionary and rule.Statistics-Based Method is the most applicable for searching shorter phrase, but, the simple method utilizing statistics then can be because of characteristics such as the word-building capacities that have ignored between the structure within word and word and word;The method combined based on dictionary and rule is the most often confined to certain specific field, this is because the formulation of rule generally is directed to specific field, motility is poor, and make a set of suitable rule is a cumbersome and time consuming job comprehensively.
Summary of the invention
The technical problem to be solved is, it is provided that a kind of new word discovery method, to be effectively improved new word discovery accuracy rate.
The further technical problem to be solved of the present invention is, it is provided that a kind of new word discovery device, to be effectively improved new word discovery accuracy rate.
For solving above-mentioned technical problem, the present invention provides following technical scheme: a kind of new word discovery method, comprises the following steps:
Target text in analyzing and processing target text storehouse, extracts morpheme from described target text, builds morpheme set H, add up the frequency that described morpheme occurs, the frequency table that described morpheme and described morpheme occur is shown as the form of two tuples, forms two tuple-set T;
Obtain the morpheme t in each two tuples in described two tuple-set TiThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithmiThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold valueiSubset w collect formation the first candidate word set Ws
Use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
Try to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
Further, the target text in described analyzing and processing target text storehouse, extracts morpheme structure morpheme set from described target text and specifically includes:
Splitting target text for segmentation standard with predetermined segmentation symbol, obtain sentence set S, each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
To each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein subset Ci=cici+1ci+2 ci+m
Keep each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each element in P set is according to CiThe morpheme set that builds of mode;And
All of target text is processed the most in the manner described above structure morpheme set H respectively.
Further, described morpheme t is calculated according to comentropy algorithmiThe specifically comprising the following steps that of context relation degree d of subset w
To the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Analyze the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Calculate each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manneri);And
Obtain context relation degree d=min{H (L) that subset w is corresponding, H (R) }.
Further, the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples is usediSupport and the specifically comprising the following steps that of confidence level
From two tuple-set T, select any two from same piece target text, analyze two tuples obtained;
Morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of two words wordA, wordB respectively;
Judge to calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
On the other hand, the embodiment of the present invention additionally provides a kind of new word discovery device, comprising:
Two tuple-set constructing modules, the target text in analyzing and processing target text storehouse, extracts morpheme from described target text, build morpheme set H, add up the frequency that described morpheme occurs, the frequency that described morpheme and described morpheme occur is represented the form of binary group, forms two tuple-set T;
Comentropy Algorithm Analysis module, obtains the morpheme t in each two tuples in described two tuple-set TiThe left adjacent word of subset w and right adjacent word, calculate morpheme t according to comentropy algorithmiThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold valueiSubset w collect formation the first candidate word set Ws
Association rule algorithm analyzes module, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
New words extraction module, tries to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
Further, described two tuple-set constructing modules include again:
Cutting unit, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
Subset construction unit, to each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein subset Ci=cici+1ci+2 ci+m
Morpheme set construction unit, keeps each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each element in P set is according to CiThe morpheme set that builds of mode;And
Collection unit, processes the morpheme set H built respectively the most in the manner described above and collects the total morpheme set of acquisition all of target text.
Further, comentropy Algorithm Analysis module includes:
Obtain adjacent word cell, to the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Adjacent word aggregation units, analyzes the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Comentropy computing unit, calculates each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manneri);And
The degree of association compares and collection unit, obtains context relation degree d=nin{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then subset w is joined the first candidate word set WsIn.
Further, association rule algorithm analysis module includes:
Module of selection, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text;
Support and confidence computation unit, the morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of wordA, wordB respectively;And
Judge and collection unit, it is judged that calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
After using technique scheme, the present invention at least has the advantages that the present invention is a kind of non-structured text new word discovery method based on Computerized Information Processing Tech, the context syntopy of the morpheme that on the one hand methods and apparatus of the present invention is extracted from target text by comentropy Algorithm Analysis, make full use of the structural information of text, it is possible to be effectively improved the accuracy rate judging neologisms border;On the other hand, the present invention fully combines internal for candidate word with external information also by association rule algorithm, is effectively improved the accuracy of new word discovery.
Accompanying drawing explanation
Fig. 1 is the steps flow chart schematic diagram of new word discovery method of the present invention.
Fig. 2 is the system principle of compositionality block diagram of new word discovery device of the present invention.
Fig. 3 is the system principle of compositionality block diagram of the comentropy Algorithm Analysis module of new word discovery device of the present invention.
Fig. 4 is the system principle of compositionality block diagram of the association rule algorithm analysis module of new word discovery device of the present invention.
Detailed description of the invention
With specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.Should be appreciated that following illustrative examples and illustrate only for explaining the present invention, not as a limitation of the invention, and, in the case of not conflicting, the embodiment in the application and the feature in embodiment can be combined with each other.
As it is shown in figure 1, the present invention provides a kind of new word discovery method, comprise the following steps;
Step S1, target text in analyzing and processing target text storehouse, therefrom extract morpheme, build morpheme set H, when having many target texts, also morpheme set H corresponding for each target text obtained is merged, then add up the frequency that described morpheme occurs, the frequency table that described morpheme and described morpheme occur is shown as the form of two tuples, forms two tuple-set T;
Step S2, obtains the morpheme t in each two tuples in described two tuple-set TiThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithmiThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold valueiSubset w collect formation the first candidate word set Ws
Step S3, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
Step S4, tries to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
Concrete operations to above each step describe in detail respectively below.
Step S1, structure morpheme set
The concrete operations being analyzed a target text processing comprise the steps: again
Step S11, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and predetermined segmentation symbol typically refers to punctuation mark, and each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
Step S12, to each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein Ci=cici+1ci+2 ci+m
Step S13, keeps each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each subset in P set is according to CiThe morpheme set that builds of mode;And
Step S14, carry out the most in the manner described above all of target text building morpheme set H, then morpheme set H corresponding for all target texts extracted is merged, the frequency that in statistics set, each morpheme occurs the most respectively, each morpheme is expressed as two tuple<morphemes, the frequency>form, all two tuples collect two tuple-sets of composition and are denoted as T.
In one embodiment of the invention, m=4, then C are seti=cici+1ci+2ci+3, with C1As a example by, C1=c1c2c3c4, to C1Split according to character, constitute morpheme set h1={c1, c2, c3, c4, c1c2, c2c3, c3c4, c1c2c3, c2c3c4, to each subset Ci in set P, all carry out building morpheme set hi according to the mode of C1, finally obtain total morpheme set T of this text.
Step S2, by comentropy Algorithm Analysis context syntopy
Comentropy is a relative abstract concept, it is possible to understand that become the probability of occurrence of certain customizing messages, it is possible to reflect the size of a brought quantity of information of variable.Computing formula is as follows:
H(xi) = -p(xi)log(p(xi)), wherein, p (xi) represent event xiThe probability occurred.
In text-processing, the left and right comentropy of word string embodies word string in context relation degree.If certain word string has higher left and right entropy, illustrate that its context collocation object enriches, use and there is bigger motility and independence., self can there are these features in one independent word, therefore, by the left and right comentropy calculating word string, the present invention judges whether this word is neologisms.
In this step S2, for the morpheme t in two tuples each in two tuple-set TiSubset w, use comentropy algorithm calculate morpheme tiThe specifically comprising the following steps that of context relation degree d of subset w
Step S21, to the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Step S22, analyzes the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Step S23, calculates each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R) corresponding to w in an identical manner =ΣH(ri);And
Step S24, obtains context relation degree d=min{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then w is joined the first candidate word set WsIn;
Step S25, to each element in two tuple-set T, processes according to above-mentioned steps, finally obtains set Ws={w1, w2..., wn}。
Step S3, use association rule algorithm excavate the frequent item set of morpheme
Association rule algorithm (Apriori algorithm) is proposed in 1994 by Rakesh Agrawal and two doctors of Ramakrishnan Srikant, the core concept of this algorithm is a kind of recurrence method theoretical based on frequency collection, it is therefore an objective to excavate those supports from data and confidence level is all not less than the incidence relation between given minimum support threshold value and the item of minimal confidence threshold and item.
For item A and item B, Apriori algorithm is generally divided into following several step:
(1) joint probability of support, i.e. A Yu B is calculated.Computing formula is as follows:
P (A, B)=count (A ∩ B)/(count (A)+count (B))
Wherein, count (A ∩ B) represents the frequency that A and B occurs simultaneously, and count (A) represents the frequency that A occurs, count (B) represents the frequency that B occurs.
(2) frequent item set is obtained.Support P (A, B) is more than or equal to (A, the B) tuple presetting minimum support threshold value, as frequent item set.
(3) confidence level is calculated, the probability that i.e. B occurs under A occurrence condition.Computing formula is as follows:
P(B|A) = P(A,B)/P(A)
Wherein, P (A, B) is the calculated support of previous step, and P (A) is the probability that A occurs.
(4) associations collection is obtained.In the Frequent Set that (2nd) step obtains, confidence level P (B | A) will be met more than the tuple presetting minimal confidence threshold, as last associations collection.
And specific in the method for the present invention, be in this step S3, use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, it specifically comprises the following steps that
Step S31, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text, it is preferable that two two tuples selected are preferably analyzes acquisition in same short sentence;
Step S32, the morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of two words wordA, wordB, the morpheme t of two namely corresponding tuples respectivelyiSupport and confidence level;
Step S33, it is judged that calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
Neologisms are extracted in step S4, filtration
This step S4 specifically utilizes everyday words dictionary to filter candidate word set, extracts the new set of words of new term assembly, and it includes again when concrete operations:
Step S41, try to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh
Step S42, use everyday words dictionary set of words W new to candidatehFiltering, remove the word inside already contained in everyday words dictionary, remaining word is i.e. saved in new set of words W as the neologisms extracted.
On the other hand, for preferably implementing said method, the embodiment of the present invention additionally provides a kind of new word discovery device, comprising:
Two tuple-set constructing modules 10, each target text in analyzing and processing target text storehouse piece by piece, therefrom extract morpheme and build morpheme set H, and morpheme set H corresponding for each target text obtained is merged, add up the frequency that each morpheme occurs again, each morpheme is represented the form of two tuples, forms two tuple-set T;
Comentropy Algorithm Analysis module 20, uses the morpheme t that comentropy algorithm calculates in two tuple-set T in each two tuplesiThe context relation degree of subset w, and context relation degree is collected formation the first candidate word set W more than or equal to the subset of elements w presetting degree of association threshold values
Association rule algorithm analyzes module 30, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
New words extraction module 40, tries to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
Wherein, described two tuple-set constructing modules 10 include again:
Cutting unit 100, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and predetermined segmentation symbol typically refers to punctuation mark, and each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
Subset construction unit 102, to each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein subset Ci=cici+1ci+2 ci+m
Morpheme set construction unit 104, keeps each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each element in P set is according to CiThe morpheme set that builds of mode;And
Collection unit 106, processes the morpheme set H built respectively the most in the manner described above and collects the total morpheme set of acquisition all of target text.
And as it is shown on figure 3, described comentropy Algorithm Analysis module 20 can farther include:
Obtain adjacent word cell 200, to the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Adjacent word aggregation units 202, analyzes the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Comentropy computing unit 204, calculates each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R) corresponding to w in an identical manner =ΣH(ri);And
The degree of association compares and collection unit 206, obtains context relation degree d=nin{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then w is joined set WsIn.
As shown in Figure 4, described association rule algorithm analysis module 30 the most also can farther include:
Module of selection 300, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text, preferably selects two two tuples analyzing acquisition in same short sentence;
Support and confidence computation unit 302, the morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of wordA, wordB respectively, the morpheme t of two namely corresponding tuplesiSupport and confidence level;And
Judge and collection unit 304, it is judged that calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
The context syntopy of the morpheme that on the one hand methods and apparatus of the present invention is extracted from target text by comentropy Algorithm Analysis, makes full use of the structural information of text, it is possible to be effectively improved the accuracy rate judging neologisms border;On the other hand, the present invention fully combines internal for candidate word with external information also by association rule algorithm, is effectively improved the accuracy of new word discovery.
If the function described in the embodiment of the present invention is using the form realization of software function module or unit and as independent production marketing or use, a calculating device-readable can be stored in and take in storage medium.Based on such understanding, part or the part of this technical scheme that prior art is contributed by the embodiment of the present invention can embody with the form of software product, this software product is stored in a storage medium, including some instructions with so that a calculating equipment (can be personal computer, server, mobile computing device or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), the various medium that can store program code such as magnetic disc or CD.In this specification, each embodiment uses the mode gone forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, and between each embodiment, same or similar part sees mutually.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, being appreciated that and these embodiments can carry out multiple change without departing from the principles and spirit of the present invention, revise, replace and modification, the scope of the present invention is limited by claims and equivalency range thereof.

Claims (8)

1. a new word discovery method, it is characterised in that comprise the following steps:
Target text in analyzing and processing target text storehouse, extracts morpheme from described target text, builds morpheme set H, add up the frequency that described morpheme occurs, the frequency table that described morpheme and described morpheme occur is shown as the form of two tuples, forms two tuple-set T;
Obtain the morpheme t in each two tuples in described two tuple-set TiThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithmiThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold valueiSubset w collect formation the first candidate word set Ws
Use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
Try to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
2. new word discovery method as claimed in claim 1, it is characterised in that the target text in described analyzing and processing target text storehouse, extracts morpheme structure morpheme set from described target text and specifically includes:
Splitting target text for segmentation standard with predetermined segmentation symbol, obtain sentence set S, each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
To each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein subset Ci=cici+1ci+2…ci+m
Keep each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each element in P set is according to CiThe morpheme set that builds of mode;And
All of target text is processed the most in the manner described above structure morpheme set H respectively.
3. new word discovery method as claimed in claim 1, it is characterised in that calculate the specifically comprising the following steps that of context relation degree d of described left adjacent word and right adjacent word according to comentropy algorithm
To the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Analyze the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Calculate each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manneri);And
Obtain context relation degree d=min{H (L) that subset w is corresponding, H (R) }.
4. new word discovery method as claimed in claim 1, it is characterised in that use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and the specifically comprising the following steps that of confidence level
From two tuple-set T, select any two from same piece target text, analyze two tuples obtained;
Morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of two words wordA, wordB respectively;
Judge to calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
5. a new word discovery device, it is characterised in that including:
Two tuple-set constructing modules, the target text in analyzing and processing target text storehouse, extracts morpheme from described target text, build morpheme set H, add up the frequency that described morpheme occurs, the frequency that described morpheme and described morpheme occur is represented the form of binary group, forms two tuple-set T;
Comentropy Algorithm Analysis module, obtains the morpheme t in each two tuples in described two tuple-set TiThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithmiThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold valueiSubset w collect formation the first candidate word set Ws
Association rule algorithm analyzes module, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuplesiSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum thresholdiCollect formation the second candidate word set Wt;And
New words extraction module, tries to achieve the first candidate word set WsWith the second candidate word set WtCommon factor as new set of words W of candidateh, then set of words W new to candidatehFilter, extract neologisms and save as new set of words W.
6. new word discovery device as claimed in claim 5, it is characterised in that described two tuple-set constructing modules include again:
Cutting unit, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and each short sentence in set S is Si={c1c2c3...cn, wherein, ciRepresent each character in sentence;
Subset construction unit, to each short sentence S in set Si={c1c2c3...cn, according to character order in short sentence, take the subset that window size is m, build set P={C1, C2..., Cn, wherein subset Ci=cici+1ci+2…ci+m
Morpheme set construction unit, keeps each character at CiIn order constant, extraction set P subset CiCarry out segmentation according to character and build morpheme set hi, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text1, h2...hn, wherein, hiIt is that each element in P set is according to CiThe morpheme set that builds of mode;And
Collection unit, processes the morpheme set H built respectively the most in the manner described above and collects the total morpheme set of acquisition all of target text.
7. new word discovery device as claimed in claim 5, it is characterised in that comentropy Algorithm Analysis module includes:
Obtain adjacent word cell, to the morpheme t in two tuples each in two tuple-set Ti={c1c2...cn(n >=3) cut, and obtains this morpheme t respectivelyiSubset w=(c2...cn-1) left adjacent word c1With right adjacent word cn
Adjacent word aggregation units, analyzes the morpheme t of all two tuples in two tuple-set Ti, from the morpheme t comprising subset wiThe all of left adjacent word of middle extraction subset w constitutes set L={l1, l2... ln, all right adjacent word extracting subset w constitutes set R={ r1, r2... rn};
Comentropy computing unit, calculates each left adjacent word l in L setiProbability p (the l occurredi), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent wordi), left adjacent word comentropy H (L)=Σ H (l that w is correspondingi), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manneri);And
The degree of association compares and collection unit, obtains context relation degree d=nin{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then subset w is joined the first candidate word set WsIn.
8. new word discovery device as claimed in claim 5, it is characterised in that association rule algorithm is analyzed module and included:
Module of selection, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text;
Support and confidence computation unit, the morpheme t in two two tuples that will be singled outiIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of wordA, wordB respectively;And
Judge and collection unit, it is judged that calculate the morpheme t of two tuples of gainediSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum thresholdiJoin the second candidate word set WtIn.
CN201610282625.0A 2016-04-29 2016-04-29 New word discovery method and device Pending CN105955950A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610282625.0A CN105955950A (en) 2016-04-29 2016-04-29 New word discovery method and device
PCT/CN2016/102448 WO2017185674A1 (en) 2016-04-29 2016-10-18 Method and apparatus for discovering new word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610282625.0A CN105955950A (en) 2016-04-29 2016-04-29 New word discovery method and device

Publications (1)

Publication Number Publication Date
CN105955950A true CN105955950A (en) 2016-09-21

Family

ID=56914877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610282625.0A Pending CN105955950A (en) 2016-04-29 2016-04-29 New word discovery method and device

Country Status (2)

Country Link
CN (1) CN105955950A (en)
WO (1) WO2017185674A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN108228712A (en) * 2017-11-30 2018-06-29 北京三快在线科技有限公司 A kind of entity method for digging and device, electronic equipment
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN110807322A (en) * 2019-09-19 2020-02-18 平安科技(深圳)有限公司 Method, device, server and storage medium for identifying new words based on information entropy
CN110851610A (en) * 2018-07-25 2020-02-28 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN116361442A (en) * 2023-06-02 2023-06-30 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992766B (en) * 2017-12-29 2024-02-06 北京京东尚科信息技术有限公司 Method and device for extracting target words
CN108829658B (en) * 2018-05-02 2022-05-24 石家庄天亮教育科技有限公司 Method and device for discovering new words
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN109670170B (en) * 2018-11-21 2023-04-07 东软集团股份有限公司 Professional vocabulary mining method and device, readable storage medium and electronic equipment
CN111368535B (en) * 2018-12-26 2024-01-16 珠海金山数字网络科技有限公司 Sensitive word recognition method, device and equipment
CN111400377B (en) * 2020-04-27 2023-09-08 新奥新智科技有限公司 Method and device for determining target data set
CN111753531B (en) * 2020-06-28 2024-03-12 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium based on artificial intelligence
CN111768842B (en) * 2020-07-06 2023-08-11 宁波方太厨具有限公司 Identification method and system of traditional Chinese medicine certification, electronic equipment and readable storage medium
CN112732934B (en) * 2021-01-11 2022-05-27 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN113051912B (en) * 2021-04-08 2023-01-20 云南电网有限责任公司电力科学研究院 Domain word recognition method and device based on word forming rate
CN112800173B (en) * 2021-04-14 2021-07-09 北京金山云网络技术有限公司 Standardized database and medical text library construction method and device and electronic equipment
CN113609844B (en) * 2021-07-30 2024-03-08 国网山西省电力公司晋城供电公司 Electric power professional word stock construction method based on hybrid model and clustering algorithm
CN115982390B (en) * 2023-03-17 2023-06-23 北京邮电大学 Industrial chain construction and iterative expansion development method
CN117056869A (en) * 2023-10-11 2023-11-14 轩创(广州)网络科技有限公司 Electronic information data association method and system based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955450A (en) * 2014-05-06 2014-07-30 杭州东信北邮信息技术有限公司 Automatic extraction method of new words
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN105512109A (en) * 2015-12-11 2016-04-20 北京锐安科技有限公司 New word discovery method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216874B (en) * 2014-09-22 2017-03-29 广西财经学院 Positive and negative mode excavation method and system are weighted between the Chinese word based on coefficient correlation
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955450A (en) * 2014-05-06 2014-07-30 杭州东信北邮信息技术有限公司 Automatic extraction method of new words
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN105512109A (en) * 2015-12-11 2016-04-20 北京锐安科技有限公司 New word discovery method and device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
CN106776543B (en) * 2016-11-23 2019-09-06 上海智臻智能网络科技股份有限公司 New word discovery method, apparatus, terminal and server
CN108228712A (en) * 2017-11-30 2018-06-29 北京三快在线科技有限公司 A kind of entity method for digging and device, electronic equipment
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108845982B (en) * 2017-12-08 2021-08-20 昆明理工大学 Chinese word segmentation method based on word association characteristics
CN110851610A (en) * 2018-07-25 2020-02-28 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN110851610B (en) * 2018-07-25 2022-09-27 百度在线网络技术(北京)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN110807322A (en) * 2019-09-19 2020-02-18 平安科技(深圳)有限公司 Method, device, server and storage medium for identifying new words based on information entropy
CN110807322B (en) * 2019-09-19 2024-03-01 平安科技(深圳)有限公司 Method, device, server and storage medium for identifying new words based on information entropy
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN116361442A (en) * 2023-06-02 2023-06-30 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence
CN116361442B (en) * 2023-06-02 2023-10-17 国网浙江宁波市鄞州区供电有限公司 Business hall data analysis method and system based on artificial intelligence

Also Published As

Publication number Publication date
WO2017185674A1 (en) 2017-11-02

Similar Documents

Publication Publication Date Title
CN105955950A (en) New word discovery method and device
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
CN106874292B (en) Topic processing method and device
AU2015203818B2 (en) Providing contextual information associated with a source document using information from external reference documents
CN107943792B (en) Statement analysis method and device, terminal device and storage medium
US20140379719A1 (en) System and method for tagging and searching documents
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
Alami et al. Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts
CN111291177A (en) Information processing method and device and computer storage medium
US20100023505A1 (en) Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof
WO2017091985A1 (en) Method and device for recognizing stop word
CN105095434A (en) Recognition method and device for timeliness requirement
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112883734A (en) Block chain security event public opinion monitoring method and system
CN104657376A (en) Searching method and searching device for video programs based on program relationship
CN113076735A (en) Target information acquisition method and device and server
CN115795030A (en) Text classification method and device, computer equipment and storage medium
Consoli et al. A quartet method based on variable neighborhood search for biomedical literature extraction and clustering
US10353927B2 (en) Categorizing columns in a data table
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN110472058B (en) Entity searching method, related equipment and computer storage medium
CN111737461A (en) Text processing method and device, electronic equipment and computer readable storage medium
Oliveira et al. A concept-based ILP approach for multi-document summarization exploring centrality and position
CN113868508B (en) Writing material query method and device, electronic equipment and storage medium
CN112784046B (en) Text clustering method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160921

WD01 Invention patent application deemed withdrawn after publication