CN105955950A

CN105955950A - New word discovery method and device

Info

Publication number: CN105955950A
Application number: CN201610282625.0A
Authority: CN
Inventors: 康潮明
Original assignee: LeTV Holding Beijing Co Ltd; LeTV Information Technology Beijing Co Ltd
Current assignee: LeTV Holding Beijing Co Ltd; LeTV Information Technology Beijing Co Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2016-09-21
Also published as: WO2017185674A1

Abstract

The invention relates to a new word discovery method and device. The new word discovery method comprises the step of extracting morphemes from a target text in a target text library, establishing a morpheme collection H, counting frequency of occurrence of each morpheme, representing each morpheme and the frequency of occurrence of each morpheme in a two-tuples form, and forming a two-tuples collection T; calculating context correlation d of sub-collections w of the morphemes ti and gathering the sub-collections w of the morphemes ti with the d value greater than or equal to a preset correlation threshold to form a first candidate word collection Ws; calculating the support degree and the confidence degree of the morphemes ti, gathering the morphemes ti with the support degree and the confidence degree greater than or equal to corresponding minimum thresholds to form a second candidate word collection Wt; and obtaining an intersection of the first candidate word collection Ws and the second candidate word collection Wt to serve as a candidate new word collection Wh, filtering the candidate new word collection Wh, extracting the new words and storing the new words in a new word collection W. According to the new word discovery method and device, information entropy algorithm analysis and association rule algorithm analysis are combined efficiently, and new word discovery accuracy can be improved efficiently.

Description

New word discovery method and device

Technical field

The present invention relates to natural language processing technique field, particularly relate to a kind of new word discovery method and device.

Background technology

When utilizing computer analyzing and processing natural language information, often encounter new words extraction.At present, the method for new words extraction mainly has two ways: one is Statistics-Based Method, and another kind is the method combined based on dictionary and rule.Statistics-Based Method is the most applicable for searching shorter phrase, but, the simple method utilizing statistics then can be because of characteristics such as the word-building capacities that have ignored between the structure within word and word and word；The method combined based on dictionary and rule is the most often confined to certain specific field, this is because the formulation of rule generally is directed to specific field, motility is poor, and make a set of suitable rule is a cumbersome and time consuming job comprehensively.

Summary of the invention

The technical problem to be solved is, it is provided that a kind of new word discovery method, to be effectively improved new word discovery accuracy rate.

The further technical problem to be solved of the present invention is, it is provided that a kind of new word discovery device, to be effectively improved new word discovery accuracy rate.

For solving above-mentioned technical problem, the present invention provides following technical scheme: a kind of new word discovery method, comprises the following steps:

Target text in analyzing and processing target text storehouse, extracts morpheme from described target text, builds morpheme set H, add up the frequency that described morpheme occurs, the frequency table that described morpheme and described morpheme occur is shown as the form of two tuples, forms two tuple-set T；

Obtain the morpheme t in each two tuples in described two tuple-set T_iThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithm_iThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold value_iSubset w collect formation the first candidate word set W_s；

Use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum threshold_iCollect formation the second candidate word set W_t；And

Try to achieve the first candidate word set W_sWith the second candidate word set W_tCommon factor as new set of words W of candidate_h, then set of words W new to candidate_hFilter, extract neologisms and save as new set of words W.

Further, the target text in described analyzing and processing target text storehouse, extracts morpheme structure morpheme set from described target text and specifically includes:

Splitting target text for segmentation standard with predetermined segmentation symbol, obtain sentence set S, each short sentence in set S is S_i={c₁c₂c₃...c_n, wherein, c_iRepresent each character in sentence；

To each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein subset C_i=c_ic_i+1c_i+2 _…c_i+m；

Keep each character at C_iIn order constant, extraction set P subset C_iCarry out segmentation according to character and build morpheme set h_i, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text₁, h₂...h_n, wherein, h_iIt is that each element in P set is according to C_iThe morpheme set that builds of mode；And

All of target text is processed the most in the manner described above structure morpheme set H respectively.

Further, described morpheme t is calculated according to comentropy algorithm_iThe specifically comprising the following steps that of context relation degree d of subset w

To the morpheme t in two tuples each in two tuple-set T_i={c₁c₂...c_n(n >=3) cut, and obtains this morpheme t respectively_iSubset w=(c₂...c_n-1) left adjacent word c₁With right adjacent word c_n；

Analyze the morpheme t of all two tuples in two tuple-set T_i, from the morpheme t comprising subset w_iThe all of left adjacent word of middle extraction subset w constitutes set L={l₁, l₂... l_n, all right adjacent word extracting subset w constitutes set R={ r₁, r₂... r_n}；

Calculate each left adjacent word l in L set_iProbability p (the l occurred_i), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent word_i), left adjacent word comentropy H (L)=Σ H (l that w is corresponding_i), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manner_i)；And

Obtain context relation degree d=min{H (L) that subset w is corresponding, H (R) }.

Further, the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples is used_iSupport and the specifically comprising the following steps that of confidence level

From two tuple-set T, select any two from same piece target text, analyze two tuples obtained；

Morpheme t in two two tuples that will be singled out_iIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of two words wordA, wordB respectively；

Judge to calculate the morpheme t of two tuples of gained_iSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum threshold_iJoin the second candidate word set W_tIn.

On the other hand, the embodiment of the present invention additionally provides a kind of new word discovery device, comprising:

Two tuple-set constructing modules, the target text in analyzing and processing target text storehouse, extracts morpheme from described target text, build morpheme set H, add up the frequency that described morpheme occurs, the frequency that described morpheme and described morpheme occur is represented the form of binary group, forms two tuple-set T；

Comentropy Algorithm Analysis module, obtains the morpheme t in each two tuples in described two tuple-set T_iThe left adjacent word of subset w and right adjacent word, calculate morpheme t according to comentropy algorithm_iThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold value_iSubset w collect formation the first candidate word set W_s；

Association rule algorithm analyzes module, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum threshold_iCollect formation the second candidate word set W_t；And

New words extraction module, tries to achieve the first candidate word set W_sWith the second candidate word set W_tCommon factor as new set of words W of candidate_h, then set of words W new to candidate_hFilter, extract neologisms and save as new set of words W.

Further, described two tuple-set constructing modules include again:

Cutting unit, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and each short sentence in set S is S_i={c₁c₂c₃...c_n, wherein, c_iRepresent each character in sentence；

Subset construction unit, to each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein subset C_i=c_ic_i+1c_i+2 _…c_i+m；

Morpheme set construction unit, keeps each character at C_iIn order constant, extraction set P subset C_iCarry out segmentation according to character and build morpheme set h_i, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text₁, h₂...h_n, wherein, h_iIt is that each element in P set is according to C_iThe morpheme set that builds of mode；And

Collection unit, processes the morpheme set H built respectively the most in the manner described above and collects the total morpheme set of acquisition all of target text.

Further, comentropy Algorithm Analysis module includes:

Obtain adjacent word cell, to the morpheme t in two tuples each in two tuple-set T_i={c₁c₂...c_n(n >=3) cut, and obtains this morpheme t respectively_iSubset w=(c₂...c_n-1) left adjacent word c₁With right adjacent word c_n；

Adjacent word aggregation units, analyzes the morpheme t of all two tuples in two tuple-set T_i, from the morpheme t comprising subset w_iThe all of left adjacent word of middle extraction subset w constitutes set L={l₁, l₂... l_n, all right adjacent word extracting subset w constitutes set R={ r₁, r₂... r_n}；

Comentropy computing unit, calculates each left adjacent word l in L set_iProbability p (the l occurred_i), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent word_i), left adjacent word comentropy H (L)=Σ H (l that w is corresponding_i), calculate right adjacent word comentropy H (R)=Σ H (r corresponding to w in an identical manner_i)；And

The degree of association compares and collection unit, obtains context relation degree d=nin{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then subset w is joined the first candidate word set W_sIn.

Further, association rule algorithm analysis module includes:

Module of selection, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text；

Support and confidence computation unit, the morpheme t in two two tuples that will be singled out_iIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of wordA, wordB respectively；And

Judge and collection unit, it is judged that calculate the morpheme t of two tuples of gained_iSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum threshold_iJoin the second candidate word set W_tIn.

After using technique scheme, the present invention at least has the advantages that the present invention is a kind of non-structured text new word discovery method based on Computerized Information Processing Tech, the context syntopy of the morpheme that on the one hand methods and apparatus of the present invention is extracted from target text by comentropy Algorithm Analysis, make full use of the structural information of text, it is possible to be effectively improved the accuracy rate judging neologisms border；On the other hand, the present invention fully combines internal for candidate word with external information also by association rule algorithm, is effectively improved the accuracy of new word discovery.

Accompanying drawing explanation

Fig. 1 is the steps flow chart schematic diagram of new word discovery method of the present invention.

Fig. 2 is the system principle of compositionality block diagram of new word discovery device of the present invention.

Fig. 3 is the system principle of compositionality block diagram of the comentropy Algorithm Analysis module of new word discovery device of the present invention.

Fig. 4 is the system principle of compositionality block diagram of the association rule algorithm analysis module of new word discovery device of the present invention.

Detailed description of the invention

With specific embodiment, the application is described in further detail below in conjunction with the accompanying drawings.Should be appreciated that following illustrative examples and illustrate only for explaining the present invention, not as a limitation of the invention, and, in the case of not conflicting, the embodiment in the application and the feature in embodiment can be combined with each other.

As it is shown in figure 1, the present invention provides a kind of new word discovery method, comprise the following steps；

Step S1, target text in analyzing and processing target text storehouse, therefrom extract morpheme, build morpheme set H, when having many target texts, also morpheme set H corresponding for each target text obtained is merged, then add up the frequency that described morpheme occurs, the frequency table that described morpheme and described morpheme occur is shown as the form of two tuples, forms two tuple-set T；

Step S2, obtains the morpheme t in each two tuples in described two tuple-set T_iThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithm_iThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold value_iSubset w collect formation the first candidate word set W_s；

Step S3, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum threshold_iCollect formation the second candidate word set W_t；And

Step S4, tries to achieve the first candidate word set W_sWith the second candidate word set W_tCommon factor as new set of words W of candidate_h, then set of words W new to candidate_hFilter, extract neologisms and save as new set of words W.

Concrete operations to above each step describe in detail respectively below.

Step S1, structure morpheme set

The concrete operations being analyzed a target text processing comprise the steps: again

Step S11, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and predetermined segmentation symbol typically refers to punctuation mark, and each short sentence in set S is S_i={c₁c₂c₃...c_n, wherein, c_iRepresent each character in sentence；

Step S12, to each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein C_i=c_ic_i+1c_i+2 _…c_i+m；

Step S13, keeps each character at C_iIn order constant, extraction set P subset C_iCarry out segmentation according to character and build morpheme set h_i, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text₁, h₂...h_n, wherein, h_iIt is that each subset in P set is according to C_iThe morpheme set that builds of mode；And

Step S14, carry out the most in the manner described above all of target text building morpheme set H, then morpheme set H corresponding for all target texts extracted is merged, the frequency that in statistics set, each morpheme occurs the most respectively, each morpheme is expressed as two tuple<morphemes, the frequency>form, all two tuples collect two tuple-sets of composition and are denoted as T.

In one embodiment of the invention, m=4, then C are set_i=c_ic_i+1c_i+2c_i+3, with C₁As a example by, C1=c₁c₂c₃c₄, to C₁Split according to character, constitute morpheme set h₁={c₁, c₂, c₃, c₄, c₁c₂, c₂c₃, c₃c₄, c₁c₂c₃, c₂c₃c₄, to each subset Ci in set P, all carry out building morpheme set hi according to the mode of C1, finally obtain total morpheme set T of this text.

Step S2, by comentropy Algorithm Analysis context syntopy

Comentropy is a relative abstract concept, it is possible to understand that become the probability of occurrence of certain customizing messages, it is possible to reflect the size of a brought quantity of information of variable.Computing formula is as follows:

H(x_i) = -p(x_i)log(p(x_i)), wherein, p (x_i) represent event x_iThe probability occurred.

In text-processing, the left and right comentropy of word string embodies word string in context relation degree.If certain word string has higher left and right entropy, illustrate that its context collocation object enriches, use and there is bigger motility and independence., self can there are these features in one independent word, therefore, by the left and right comentropy calculating word string, the present invention judges whether this word is neologisms.

In this step S2, for the morpheme t in two tuples each in two tuple-set T_iSubset w, use comentropy algorithm calculate morpheme t_iThe specifically comprising the following steps that of context relation degree d of subset w

Step S21, to the morpheme t in two tuples each in two tuple-set T_i={c₁c₂...c_n(n >=3) cut, and obtains this morpheme t respectively_iSubset w=(c₂...c_n-1) left adjacent word c₁With right adjacent word c_n；

Step S22, analyzes the morpheme t of all two tuples in two tuple-set T_i, from the morpheme t comprising subset w_iThe all of left adjacent word of middle extraction subset w constitutes set L={l₁, l₂... l_n, all right adjacent word extracting subset w constitutes set R={ r₁, r₂... r_n}；

Step S23, calculates each left adjacent word l in L set_iProbability p (the l occurred_i), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent word_i), left adjacent word comentropy H (L)=Σ H (l that w is corresponding_i), calculate right adjacent word comentropy H (R) corresponding to w in an identical manner =ΣH(r_i)；And

Step S24, obtains context relation degree d=min{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then w is joined the first candidate word set W_sIn；

Step S25, to each element in two tuple-set T, processes according to above-mentioned steps, finally obtains set W_s={w₁, w₂..., w_n}。

Step S3, use association rule algorithm excavate the frequent item set of morpheme

Association rule algorithm (Apriori algorithm) is proposed in 1994 by Rakesh Agrawal and two doctors of Ramakrishnan Srikant, the core concept of this algorithm is a kind of recurrence method theoretical based on frequency collection, it is therefore an objective to excavate those supports from data and confidence level is all not less than the incidence relation between given minimum support threshold value and the item of minimal confidence threshold and item.

For item A and item B, Apriori algorithm is generally divided into following several step:

(1) joint probability of support, i.e. A Yu B is calculated.Computing formula is as follows:

P (A, B)=count (A ∩ B)/(count (A)+count (B))

Wherein, count (A ∩ B) represents the frequency that A and B occurs simultaneously, and count (A) represents the frequency that A occurs, count (B) represents the frequency that B occurs.

(2) frequent item set is obtained.Support P (A, B) is more than or equal to (A, the B) tuple presetting minimum support threshold value, as frequent item set.

(3) confidence level is calculated, the probability that i.e. B occurs under A occurrence condition.Computing formula is as follows:

P(B|A) = P(A,B)/P(A)

Wherein, P (A, B) is the calculated support of previous step, and P (A) is the probability that A occurs.

(4) associations collection is obtained.In the Frequent Set that (2nd) step obtains, confidence level P (B | A) will be met more than the tuple presetting minimal confidence threshold, as last associations collection.

And specific in the method for the present invention, be in this step S3, use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and confidence level, it specifically comprises the following steps that

Step S31, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text, it is preferable that two two tuples selected are preferably analyzes acquisition in same short sentence；

Step S32, the morpheme t in two two tuples that will be singled out_iIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of two words wordA, wordB, the morpheme t of two namely corresponding tuples respectively_iSupport and confidence level；

Step S33, it is judged that calculate the morpheme t of two tuples of gained_iSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum threshold_iJoin the second candidate word set W_tIn.

Neologisms are extracted in step S4, filtration

This step S4 specifically utilizes everyday words dictionary to filter candidate word set, extracts the new set of words of new term assembly, and it includes again when concrete operations:

Step S41, try to achieve the first candidate word set W_sWith the second candidate word set W_tCommon factor as new set of words W of candidate_h；

Step S42, use everyday words dictionary set of words W new to candidate_hFiltering, remove the word inside already contained in everyday words dictionary, remaining word is i.e. saved in new set of words W as the neologisms extracted.

On the other hand, for preferably implementing said method, the embodiment of the present invention additionally provides a kind of new word discovery device, comprising:

Two tuple-set constructing modules 10, each target text in analyzing and processing target text storehouse piece by piece, therefrom extract morpheme and build morpheme set H, and morpheme set H corresponding for each target text obtained is merged, add up the frequency that each morpheme occurs again, each morpheme is represented the form of two tuples, forms two tuple-set T；

Comentropy Algorithm Analysis module 20, uses the morpheme t that comentropy algorithm calculates in two tuple-set T in each two tuples_iThe context relation degree of subset w, and context relation degree is collected formation the first candidate word set W more than or equal to the subset of elements w presetting degree of association threshold value_s；

Association rule algorithm analyzes module 30, uses the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and confidence level, support and confidence level are all higher than or equal to the morpheme t of corresponding minimum threshold_iCollect formation the second candidate word set W_t；And

New words extraction module 40, tries to achieve the first candidate word set W_sWith the second candidate word set W_tCommon factor as new set of words W of candidate_h, then set of words W new to candidate_hFilter, extract neologisms and save as new set of words W.

Wherein, described two tuple-set constructing modules 10 include again:

Cutting unit 100, splits target text with predetermined segmentation symbol for segmentation standard, obtains sentence set S, and predetermined segmentation symbol typically refers to punctuation mark, and each short sentence in set S is S_i={c₁c₂c₃...c_n, wherein, c_iRepresent each character in sentence；

Subset construction unit 102, to each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein subset C_i=c_ic_i+1c_i+2 _…c_i+m；

Morpheme set construction unit 104, keeps each character at C_iIn order constant, extraction set P subset C_iCarry out segmentation according to character and build morpheme set h_i, the morpheme set that each subset in set P builds respectively is collected the morpheme set H={ h obtaining this target text₁, h₂...h_n, wherein, h_iIt is that each element in P set is according to C_iThe morpheme set that builds of mode；And

Collection unit 106, processes the morpheme set H built respectively the most in the manner described above and collects the total morpheme set of acquisition all of target text.

And as it is shown on figure 3, described comentropy Algorithm Analysis module 20 can farther include:

Obtain adjacent word cell 200, to the morpheme t in two tuples each in two tuple-set T_i={c₁c₂...c_n(n >=3) cut, and obtains this morpheme t respectively_iSubset w=(c₂...c_n-1) left adjacent word c₁With right adjacent word c_n；

Adjacent word aggregation units 202, analyzes the morpheme t of all two tuples in two tuple-set T_i, from the morpheme t comprising subset w_iThe all of left adjacent word of middle extraction subset w constitutes set L={l₁, l₂... l_n, all right adjacent word extracting subset w constitutes set R={ r₁, r₂... r_n}；

Comentropy computing unit 204, calculates each left adjacent word l in L set_iProbability p (the l occurred_i), then utilize comentropy formula to calculate the comentropy H (l of this left adjacent word_i), left adjacent word comentropy H (L)=Σ H (l that w is corresponding_i), calculate right adjacent word comentropy H (R) corresponding to w in an identical manner =ΣH(r_i)；And

The degree of association compares and collection unit 206, obtains context relation degree d=nin{H (L) that subset w is corresponding, H (R) }, the value comparing d and the threshold value pre-set, if greater than threshold value, then w is joined set W_sIn.

As shown in Figure 4, described association rule algorithm analysis module 30 the most also can farther include:

Module of selection 300, selects any two from two tuple-set T and analyzes two tuples obtained from same piece target text, preferably selects two two tuples analyzing acquisition in same short sentence；

Support and confidence computation unit 302, the morpheme t in two two tuples that will be singled out_iIt is denoted as wordA, wordB respectively, and calculates support and the confidence level of wordA, wordB respectively, the morpheme t of two namely corresponding tuples_iSupport and confidence level；And

Judge and collection unit 304, it is judged that calculate the morpheme t of two tuples of gained_iSupport and confidence level whether more than or equal to corresponding minimum threshold, support and confidence level are all higher than or the morpheme t of two tuples equal to corresponding minimum threshold_iJoin the second candidate word set W_tIn.

The context syntopy of the morpheme that on the one hand methods and apparatus of the present invention is extracted from target text by comentropy Algorithm Analysis, makes full use of the structural information of text, it is possible to be effectively improved the accuracy rate judging neologisms border；On the other hand, the present invention fully combines internal for candidate word with external information also by association rule algorithm, is effectively improved the accuracy of new word discovery.

If the function described in the embodiment of the present invention is using the form realization of software function module or unit and as independent production marketing or use, a calculating device-readable can be stored in and take in storage medium.Based on such understanding, part or the part of this technical scheme that prior art is contributed by the embodiment of the present invention can embody with the form of software product, this software product is stored in a storage medium, including some instructions with so that a calculating equipment (can be personal computer, server, mobile computing device or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), the various medium that can store program code such as magnetic disc or CD.In this specification, each embodiment uses the mode gone forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, and between each embodiment, same or similar part sees mutually.

Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, being appreciated that and these embodiments can carry out multiple change without departing from the principles and spirit of the present invention, revise, replace and modification, the scope of the present invention is limited by claims and equivalency range thereof.

Claims

1. a new word discovery method, it is characterised in that comprise the following steps:

2. new word discovery method as claimed in claim 1, it is characterised in that the target text in described analyzing and processing target text storehouse, extracts morpheme structure morpheme set from described target text and specifically includes:

To each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein subset C_i=c_ic_i+1c_i+2…c_i+m；

3. new word discovery method as claimed in claim 1, it is characterised in that calculate the specifically comprising the following steps that of context relation degree d of described left adjacent word and right adjacent word according to comentropy algorithm

4. new word discovery method as claimed in claim 1, it is characterised in that use the morpheme t that association rule algorithm calculates in two tuple-set T in each two tuples_iSupport and the specifically comprising the following steps that of confidence level

5. a new word discovery device, it is characterised in that including:

Comentropy Algorithm Analysis module, obtains the morpheme t in each two tuples in described two tuple-set T_iThe left adjacent word of subset w and right adjacent word, calculate described morpheme t according to comentropy algorithm_iThe context relation degree d of subset w, and by the value of context relation degree d more than or equal to the morpheme t presetting degree of association threshold value_iSubset w collect formation the first candidate word set W_s；

6. new word discovery device as claimed in claim 5, it is characterised in that described two tuple-set constructing modules include again:

Subset construction unit, to each short sentence S in set S_i={c₁c₂c₃...c_n, according to character order in short sentence, take the subset that window size is m, build set P={C₁, C₂..., C_n, wherein subset C_i=c_ic_i+1c_i+2…c_i+m；

7. new word discovery device as claimed in claim 5, it is characterised in that comentropy Algorithm Analysis module includes:

8. new word discovery device as claimed in claim 5, it is characterised in that association rule algorithm is analyzed module and included: