CN108038109A

CN108038109A - Method and system, the computer program of Feature Words are extracted from non-structured text

Info

Publication number: CN108038109A
Application number: CN201810120746.4A
Authority: CN
Inventors: 孙宏亮; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2018-05-15

Abstract

The invention belongs to computer software technical field, disclose a kind of method and system that Feature Words are extracted from non-structured text, computer program, for each section of text, sentence is split into by word by Hidden Markov Model first, word is mapped to vector using word2vec；By k means algorithms, all words occurred in text are clustered；One word belongs to Top K classes, then is not keyword, is otherwise keyword.Test result indicates that Feature Words extracting method proposed by the present invention, TFIDF is substantially better than in terms of discrimination and false recognition rate.The results show that TF IDF discriminations are 34.13%, false recognition rate 82.9%；Discrimination is 81.65%, and false recognition rate 40.25%, false recognition rate reduces 42.65% while discrimination greatly improves 47.52%.

Description

Method and system, the computer program of Feature Words are extracted from non-structured text

Technical field

The invention belongs to computer software technical field, more particularly to a kind of Feature Words are extracted from non-structured text Method and system, computer program.

Background technology

At present, the prior art commonly used in the trade is such：Keyword is information age people management, the weight of retrieval resource Means and portable tool are wanted, keyword automatic marking technology is the important dependence that people obtain information in magnanimity information, therefore, Keyword automatic marking technology is increasingly becoming the popular research problem of natural language processing and information retrieval.At present, keyword carries Technology is taken to be mainly used in search-engine results sequence and Personalize News recommendation, the object being extracted is typically long text.Sentence Whether the word that breaks is important in an article, it is easy to which a desired measurement index is exactly word frequency, and important word is often Repeatedly occur in article.Classical TF-IDF algorithms are exactly from angle of statistics, give the word repeatedly occurred with greater weight To extract the algorithm of keyword.But TF-IDF is not particularly suited for extracting the scene of keyword from user in introducing myself.User Self-introduction be one section of 100 word or so non-structured text, Feature Words (such as educational background, work unit, technical ability) are usual Only occur once.Here is that one section of user introduces sample：

I graduates from Stockholm Univ Sweden in 2010, works always in Sweden afterwards, once travels in Sweden Office participated in foreigner's travel reception work, was proficient in Swedish.

In this section of text, it would be desirable to keyword be " Stockholm University " and " Swedish ", but they only go out Now once, " Sweden " and " work " then repeatedly appearance, therefore TF-IDF algorithms can extract the two words as key Word.

In conclusion problem existing in the prior art is：

(1) the very low keyword of word frequency can not be extracted；

(2) " important word often repeatedly occurs " is assumed, but this hypothesis is in the scene that we run into and invalid.

Solve the difficulty and meaning of above-mentioned technical problem：In some scenarios (such as：User introduces myself), Feature Words are not It is that the frequency of occurrences is highest in text, is that frequency is relatively low on the contrary, or even only occurs once, the basis of this and TF-IDF algorithms Assuming that disagreing, TF-IDF is helpless for this kind of scene.

The content of the invention

In view of the problems of the existing technology, the present invention provides a kind of side that Feature Words are extracted from non-structured text Method and system, computer program.

The present invention is achieved in that a kind of method that Feature Words are extracted from non-structured text, described from non-structural Change the method that Feature Words are extracted in text for each section of text, sentence is split into by list by Hidden Markov Model first Word, vector is mapped to using word2vec by word；By k-means algorithms, all words occurred in text are clustered；One A word belongs to Top K classes, then is not keyword, is otherwise keyword.

Further, the method that Feature Words are extracted from non-structured text comprises the following steps：

Step 1, carries out word segmentation processing to text using Hidden Markov Model, obtains set of words A；

Step 2, to each word a in A, performs step 3 and step 4；

Step 3, using word2vec algorithms, vector v is mapped to by word a；

Step 4, if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and by a Add set B；It is feature set of words to obtain set B.

Further, the Hidden Markov Model is described by five yuan of parameter lambdas=(N, M, π, A, B), wherein：

(1)N：Model λ status numbers, if the state q of t moment model_t, then have q_t∈{S₁,S₂,…,S_N}；

(2)M：The corresponding possible observed value number of each state；Remember that M observed value is V₁,V₂,…,V_M, the sight of t moment Examine value O_t, then have O_t∈{V₁,V₂,…,V_M}；

(3)A：State transition probability matrix, A=(a_ij)_N×N,a_ijIt is t moment state in which S_i, it is transferred to the t+1 moment State S_jState transition probability；Wherein：

a_ij=P (q_t+1=S_j|q_t=S_i), 1≤i, j≤N；

(4)B：Observed value probability matrix, B=(b_jk)_N×M,b_jkIt is in state S_jUnder obtain symbol V_kProbability, i.e. b_jk=P (O_t=V_k|q_t=S_j), 1≤j≤N, 1≤k≤M；

(5)π:Initial shape probability of state, π=(π₁,π₂,…,π_N), wherein π_i=P (q_i=S_j), 1≤i≤N, for describing Observation sequence O belongs to the probability distribution of each state in model in t=1 moment state in which；

Due to a_ij, b_jk, π_iAll it is probability, it is therefore desirable to meet normalizing condition：

Further, the word2vec includes：CBOW models and Skip-gram models；Model all includes three layers：Input Layer, projection layer and output layer；

CBOW models：

Input layer：Include the term vector v (Context (ω) of 2c word in Context (ω)₁),v(Context (ω)₂),…,v(Context(ω)_2c)∈R^m；The implication of m represents the length of term vector；

Projection layer：It is cumulative that 2c vector of input layer is done into summation：

Output layer corresponds to a binary tree, is to work as leaf node with the word occurred in language material, is gone out with each word in language material Existing number works as the Huffman trees that weights construct；In this Huffman seeds, the common N of leaf node (=| D |) it is a, respectively Word in corresponding dictionary D, non-leaf nodes N-1；

Skip-gram models：

Input layer：Term vector v (ω) the ∈ R of centre word ω containing only current sample^m；

Projection layer：This is identical projection, and v (ω) is projected to v (ω)；

Output layer：As CBOW models, output layer is also a Huffman.

Further, the K-Means clustering algorithms specifically include：

1) K document is randomly selected as barycenter from N number of document；

2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter；

3) barycenter of obtained each class is recalculated；

4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm；

Formula：

Another object of the present invention is to provide a kind of method for extracting Feature Words described in realize from non-structured text Computer program.

Another object of the present invention is to provide a kind of method for extracting Feature Words described in realize from non-structured text Information data processing terminal.

Another object of the present invention is to provide a kind of computer-readable recording medium, including instruction, when it is in computer During upper operation so that computer performs the method that Feature Words are extracted in the slave non-structured text.

Another object of the present invention is to provide a kind of method that Feature Words are extracted from non-structured text from The system that Feature Words are extracted in non-structured text, the method that Feature Words are extracted from non-structured text include：

Word segmentation processing module, for carrying out word segmentation processing to language material using HMM；

Processing module is mapped, for word to be mapped to vector using word2vec；

Sort module, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.

Another object of the present invention is to provide it is a kind of be provided with it is described extract Feature Words from non-structured text be The information data processing terminal of system.

In conclusion advantages of the present invention and good effect are：Feature Words extracting method proposed by the present invention, in discrimination TFIDF is substantially better than with terms of false recognition rate.The application effect of the present invention is explained in detail with reference to experiment.In order to test The accuracy rate of method proposed by the invention is demonstrate,proved, the present invention is extracted 1000 at random from the background data base of " looking for translation " APP The self-introduction text of interpreter and corresponding label (amounting to 3750) are used as validation data set.Since label is interpreter Fill in, have correspondence with self-introduction, therefore can regard the Feature Words of self-introduction as.The present invention uses respectively TFIDF and algorithm proposed by the present invention, extract Feature Words, the experimental results are shown inthe following table from 1000 texts.

The results show that TF-IDF discriminations are 34.13%, false recognition rate 82.9%, the Feature Words that this patent proposes carry It is 81.65% to take method discrimination, false recognition rate 40.25%, false recognition rate while discrimination greatly improves 47.52% Reduce 42.65%.Therefore test result indicates that, Feature Words extracting method proposed by the present invention, in discrimination and false recognition rate Aspect is substantially better than TFIDF.

Brief description of the drawings

Fig. 1 is the method flow diagram provided in an embodiment of the present invention that Feature Words are extracted from non-structured text.

Fig. 2 is the system structure diagram provided in an embodiment of the present invention that Feature Words are extracted from non-structured text；

In figure：1st, word segmentation processing module；2nd, processing module is mapped；3rd, sort module.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Keyword is information age people management, retrieves the important means and portable tool of resource, keyword automatic marking Technology is the important dependence that people obtain information in magnanimity information.

As shown in Figure 1, the method provided in an embodiment of the present invention that Feature Words are extracted from non-structured text is including following Step：

S101：To each language material, step S102 and step S103 is performed；

S102：Word segmentation processing is carried out to language material using HMM；

S103：For step S102's as a result, word is mapped to vector using word2vec；

S104：Using k-means algorithms, the vectorial clustering that step S101~step S103 is obtained, obtains Top K Classification.

As shown in Fig. 2, the system provided in an embodiment of the present invention that Feature Words are extracted from non-structured text includes：

Word segmentation processing module 1, for carrying out word segmentation processing to language material using HMM；

Processing module 2 is mapped, for word to be mapped to vector using word2vec；

Sort module 3, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.

The application principle of the present invention is further described with reference to specific embodiment.

Based on non-feature set of words, for the non-structured text of one section of self-introduction class, the steps of Feature Words is extracted It is rapid as follows：

1. carrying out word segmentation processing to text using HMM, set of words A is obtained；

2. each word a in couple A, performs step 3 and step 4；

3. using word2vec algorithms, word a is mapped to vector v；

4. if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and a is added Set B；

5. it is feature set of words to obtain set B.

HMM (Hidden Markov Model) basic principle：If " future " of a process relies only on " present " and does not depend on " past ", then this process has Markov property, or this process is Markov process.Markov chain (Markov chain) is Time and all discrete Markov process of state parameter.HMM grows up on the basis of Markov chain, due to reality Problem is more described than Markov chain model increasingly complex, it was observed that time be not one-to-one with state, it is but logical Cross one group of probability distribution to be associated, such model is known as HMM.HMM is dual random process：One of them is Markov chain, This is essentially random process, it describes the transfer of state, is implicit.Another random process is described between state and observed value Statistics correspondence, can be observed.

The definition of HMM：One has N number of state (to be denoted as S₁,S₂,…,S_N) HMM be by five yuan of parameter lambdas=(N, M, π, A, B) describe, wherein：

(1)N：Model λ status numbers, if the state q of t moment model_t, then have q_t∈{S₁,S₂,…,S_N}。

(2)M：The corresponding possible observed value number of each state.Remember that M observed value is V₁,V₂,…,V_M, the sight of t moment Examine value O_t, then have O_t∈{V₁,V₂,…,V_M}。

(3)A：State transition probability matrix, A=(a_ij)_N×N,a_ijIt is t moment state in which S_i, it is transferred to the t+1 moment State S_jState transition probability.Wherein,

a_ij=P (q_t+1=S_j|q_t=S_i),1≤i,j≤N。

(4)B：Observed value probability matrix, B=(b_jk)_N×M,b_jkIt is in state S_jUnder obtain symbol V_kProbability, i.e. b_jk=P (O_t=V_k|q_t=S_j),1≤j≤N,1≤k≤M。

(5)π:Initial shape probability of state, π=(π₁,π₂,…,π_N), wherein π_i=P (q_i=S_j), 1≤i≤N, for describing Observation sequence O belongs to the probability distribution of each state in model in t=1 moment state in which.

HMM is actually to be divided into two parts, first, Markov chain, by parameter, A descriptions, it is using one group and general Rate distribution be associated state transfer statistics correspondence, come describe each short-term stationarity section be how to be converted to it is next short When steady section, the output that this process produces is status switch；Second, a random process, describes between state and observed value Statistical relationship, implicit state is described with the sequence observed, is described by B, its produce output for observation value sequence.

Word2vec is an instrument that word is converted into vector form.Processing to content of text can be reduced to Vector operation in vector space, calculates the similarity in vector space, to represent the similarity on text semantic. Two important models are used in word2vec：CBOW models and Skip-gram models.Two models all include three layers：Input layer, Projection layer and output layer.

CBOW models are right with (assuming that Context (ω) is made of each c word before and after ω) exemplified by (Context (ω), ω) These three layers are briefly described.

Input layer：Include the term vector v (Context (ω) of 2c word in Context (ω)₁),v(Context (ω)₂),…,v(Context(ω)_2c)∈R^m.Here, the implication of m represents the length of term vector.

Projection layer：2c vector of input layer is done summation to add up, i.e.,

Output layer corresponds to a binary tree, it is to work as leaf node with the word occurred in language material, with each word in language material The number of appearance works as the Huffman trees that weights construct.In this Huffman seeds, the common N of leaf node (=| D |) it is a, point The word in dictionary D, non-leaf nodes N-1 are not corresponded to.

Exemplified by Skip-gram models are with sample (ω, Context (ω)), three layers are briefly described.

1. input layer：Term vector v (ω) the ∈ R of centre word ω containing only current sample^m。

2. projection layer：This is identical projection, and v (ω) is projected to v (ω).

3. output layer：As CBOW models, output layer is also a Huffman.

K-Means is one kind in clustering algorithm, and wherein K represents classification number, and Means represents average.As the term suggests K- Means is a kind of algorithm clustered by average to data point.K-Means algorithms are by K values set in advance and each The initial barycenter of classification divides similar data point.And optimal cluster is obtained by the mean iterative optimization after division As a result.

Algorithmic procedure is as follows：

1) K document is randomly selected as barycenter from N number of document

2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter

3) barycenter of obtained each class is recalculated

4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm

Formula：

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real It is existing.When using whole or in part realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or performing the computer program instructions, produce whole or in part according to Flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer-readable recording medium is transmitted to another computer-readable recording medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as server that a or multiple usable mediums integrate, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disc Solid State Disk (SSD)) etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims

A kind of 1. method that Feature Words are extracted from non-structured text, it is characterised in that described to be carried from non-structured text Take the method for Feature Words that sentence is split into word by Hidden Markov Model first, is used for each section of text Word is mapped to vector by word2vec；By k-means algorithms, all words occurred in text are clustered；One word category Then it is not keyword in Top K classes, is otherwise keyword.
2. the method for Feature Words is extracted from non-structured text as claimed in claim 1, it is characterised in that described from non-knot The method that Feature Words are extracted in structure text comprises the following steps：

Step 1, carries out word segmentation processing to text using Hidden Markov Model, obtains set of words A；

Step 2, to each word a in A, performs step 3 and step 4；

Step 3, using word2vec algorithms, vector v is mapped to by word a；

Step 4, if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and a is added Set B；It is feature set of words to obtain set B.
3. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that the hidden Ma Er Section's husband's model is described by five yuan of parameter lambdas=(N, M, π, A, B), wherein：

(1)N：Model λ status numbers, if the state q of t moment model_t, then have q_t∈{S₁,S₂,…,S_N}；

(2)M：The corresponding possible observed value number of each state；Remember that M observed value is V₁,V₂,…,V_M, the observed value of t moment O_t, then have O_t∈{V₁,V₂,…,V_M}；

(3)A：State transition probability matrix, A=(a_ij)_N×N,a_ijIt is t moment state in which S_i, it is transferred to the state at t+1 moment S_jState transition probability；Wherein：

a_jk=P (q_t+1=S_j|q_t=S_i), 1≤i, j≤N；

(4)B：Observed value probability matrix, B=(b_jk)_N×M,b_jkIt is in state S_jUnder obtain symbol V_kProbability, i.e. b_jk=P (O_t= V_k|q_t=S_j), 1≤j≤N, 1≤k≤M；

(5)π:Initial shape probability of state, π=(π₁,π₂,…,π_N), wherein π_i=P (q_i=S_j), 1≤i≤N, for describing to observe Sequence O belongs to the probability distribution of each state in model in t=1 moment state in which；

Due to a_ij, b_jk, π_iAll it is probability, it is therefore desirable to meet normalizing condition：
4. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that described Word2vec includes：CBOW models and Skip-gram models；Model all includes three layers：Input layer, projection layer and output layer；

CBOW models：

Input layer：Include the term vector v (Context (ω) of 2c word in Context (ω)₁),v(Context(ω)₂),…,v (Context(ω)_2c)∈R^m；The implication of m represents the length of term vector；

Projection layer：It is cumulative that 2c vector of input layer is done into summation：

<mrow> <msub> <mi>x</mi> <mi>w</mi> </msub> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mn>2</mn> <mi>c</mi> </mrow> </msubsup> <mi>v</mi> <mrow> <mo>(</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <msub> <mrow> <mo>(</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&Element;</mo> <msup> <mi>R</mi> <mi>m</mi> </msup> <mo>;</mo> </mrow>

Output layer corresponds to a binary tree, is to work as leaf node with the word occurred in language material, occurs with each word in language material Number works as the Huffman trees that weights construct；In this Huffman seeds, the common N of leaf node (=| D |) it is a, correspond to respectively Word in dictionary D, non-leaf nodes N-1；

Skip-gram models：

Input layer：Term vector v (ω) the ∈ R of centre word ω containing only current sample^m；

Projection layer：This is identical projection, and v (ω) is projected to v (ω)；

Output layer：As CBOW models, output layer is also a Huffman.
5. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that the K- Means clustering algorithms specifically include：

1) K document is randomly selected as barycenter from N number of document；

2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter；

3) barycenter of obtained each class is recalculated；

4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm；

Formula：

<mrow> <mi>v</mi> <mo>=</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </msub> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>.</mo> </mrow>
A kind of 6. calculating for realizing the method for extracting Feature Words described in Claims 1 to 5 any one from non-structured text Machine program.
A kind of 7. information for realizing the method for extracting Feature Words described in Claims 1 to 5 any one from non-structured text Data processing terminal.
8. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer is performed as weighed Profit requires the method that Feature Words are extracted in the slave non-structured text described in 1-5 any one.
9. carried in a kind of slave non-structured text for the method for extracting Feature Words from non-structured text as claimed in claim 1 The system for taking Feature Words, it is characterised in that the method that Feature Words are extracted from non-structured text includes：

Word segmentation processing module, for carrying out word segmentation processing to language material using HMM；

Processing module is mapped, for word to be mapped to vector using word2vec；

Sort module, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.
A kind of 10. information data processing for being provided with the system for extracting Feature Words described in claim 9 from non-structured text Terminal.