CN108038109A - Method and system, the computer program of Feature Words are extracted from non-structured text - Google Patents

Method and system, the computer program of Feature Words are extracted from non-structured text Download PDF

Info

Publication number
CN108038109A
CN108038109A CN201810120746.4A CN201810120746A CN108038109A CN 108038109 A CN108038109 A CN 108038109A CN 201810120746 A CN201810120746 A CN 201810120746A CN 108038109 A CN108038109 A CN 108038109A
Authority
CN
China
Prior art keywords
word
mrow
feature words
msub
structured text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810120746.4A
Other languages
Chinese (zh)
Inventor
孙宏亮
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810120746.4A priority Critical patent/CN108038109A/en
Publication of CN108038109A publication Critical patent/CN108038109A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to computer software technical field, disclose a kind of method and system that Feature Words are extracted from non-structured text, computer program, for each section of text, sentence is split into by word by Hidden Markov Model first, word is mapped to vector using word2vec;By k means algorithms, all words occurred in text are clustered;One word belongs to Top K classes, then is not keyword, is otherwise keyword.Test result indicates that Feature Words extracting method proposed by the present invention, TFIDF is substantially better than in terms of discrimination and false recognition rate.The results show that TF IDF discriminations are 34.13%, false recognition rate 82.9%;Discrimination is 81.65%, and false recognition rate 40.25%, false recognition rate reduces 42.65% while discrimination greatly improves 47.52%.

Description

Method and system, the computer program of Feature Words are extracted from non-structured text
Technical field
The invention belongs to computer software technical field, more particularly to a kind of Feature Words are extracted from non-structured text Method and system, computer program.
Background technology
At present, the prior art commonly used in the trade is such:Keyword is information age people management, the weight of retrieval resource Means and portable tool are wanted, keyword automatic marking technology is the important dependence that people obtain information in magnanimity information, therefore, Keyword automatic marking technology is increasingly becoming the popular research problem of natural language processing and information retrieval.At present, keyword carries Technology is taken to be mainly used in search-engine results sequence and Personalize News recommendation, the object being extracted is typically long text.Sentence Whether the word that breaks is important in an article, it is easy to which a desired measurement index is exactly word frequency, and important word is often Repeatedly occur in article.Classical TF-IDF algorithms are exactly from angle of statistics, give the word repeatedly occurred with greater weight To extract the algorithm of keyword.But TF-IDF is not particularly suited for extracting the scene of keyword from user in introducing myself.User Self-introduction be one section of 100 word or so non-structured text, Feature Words (such as educational background, work unit, technical ability) are usual Only occur once.Here is that one section of user introduces sample:
I graduates from Stockholm Univ Sweden in 2010, works always in Sweden afterwards, once travels in Sweden Office participated in foreigner's travel reception work, was proficient in Swedish.
In this section of text, it would be desirable to keyword be " Stockholm University " and " Swedish ", but they only go out Now once, " Sweden " and " work " then repeatedly appearance, therefore TF-IDF algorithms can extract the two words as key Word.
In conclusion problem existing in the prior art is:
(1) the very low keyword of word frequency can not be extracted;
(2) " important word often repeatedly occurs " is assumed, but this hypothesis is in the scene that we run into and invalid.
Solve the difficulty and meaning of above-mentioned technical problem:In some scenarios (such as:User introduces myself), Feature Words are not It is that the frequency of occurrences is highest in text, is that frequency is relatively low on the contrary, or even only occurs once, the basis of this and TF-IDF algorithms Assuming that disagreing, TF-IDF is helpless for this kind of scene.
The content of the invention
In view of the problems of the existing technology, the present invention provides a kind of side that Feature Words are extracted from non-structured text Method and system, computer program.
The present invention is achieved in that a kind of method that Feature Words are extracted from non-structured text, described from non-structural Change the method that Feature Words are extracted in text for each section of text, sentence is split into by list by Hidden Markov Model first Word, vector is mapped to using word2vec by word;By k-means algorithms, all words occurred in text are clustered;One A word belongs to Top K classes, then is not keyword, is otherwise keyword.
Further, the method that Feature Words are extracted from non-structured text comprises the following steps:
Step 1, carries out word segmentation processing to text using Hidden Markov Model, obtains set of words A;
Step 2, to each word a in A, performs step 3 and step 4;
Step 3, using word2vec algorithms, vector v is mapped to by word a;
Step 4, if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and by a Add set B;It is feature set of words to obtain set B.
Further, the Hidden Markov Model is described by five yuan of parameter lambdas=(N, M, π, A, B), wherein:
(1)N:Model λ status numbers, if the state q of t moment modelt, then have qt∈{S1,S2,…,SN};
(2)M:The corresponding possible observed value number of each state;Remember that M observed value is V1,V2,…,VM, the sight of t moment Examine value Ot, then have Ot∈{V1,V2,…,VM};
(3)A:State transition probability matrix, A=(aij)N×N,aijIt is t moment state in which Si, it is transferred to the t+1 moment State SjState transition probability;Wherein:
aij=P (qt+1=Sj|qt=Si), 1≤i, j≤N;
(4)B:Observed value probability matrix, B=(bjk)N×M,bjkIt is in state SjUnder obtain symbol VkProbability, i.e. bjk=P (Ot=Vk|qt=Sj), 1≤j≤N, 1≤k≤M;
(5)π:Initial shape probability of state, π=(π12,…,πN), wherein πi=P (qi=Sj), 1≤i≤N, for describing Observation sequence O belongs to the probability distribution of each state in model in t=1 moment state in which;
Due to aij, bjk, πiAll it is probability, it is therefore desirable to meet normalizing condition:
Further, the word2vec includes:CBOW models and Skip-gram models;Model all includes three layers:Input Layer, projection layer and output layer;
CBOW models:
Input layer:Include the term vector v (Context (ω) of 2c word in Context (ω)1),v(Context (ω)2),…,v(Context(ω)2c)∈Rm;The implication of m represents the length of term vector;
Projection layer:It is cumulative that 2c vector of input layer is done into summation:
Output layer corresponds to a binary tree, is to work as leaf node with the word occurred in language material, is gone out with each word in language material Existing number works as the Huffman trees that weights construct;In this Huffman seeds, the common N of leaf node (=| D |) it is a, respectively Word in corresponding dictionary D, non-leaf nodes N-1;
Skip-gram models:
Input layer:Term vector v (ω) the ∈ R of centre word ω containing only current samplem
Projection layer:This is identical projection, and v (ω) is projected to v (ω);
Output layer:As CBOW models, output layer is also a Huffman.
Further, the K-Means clustering algorithms specifically include:
1) K document is randomly selected as barycenter from N number of document;
2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter;
3) barycenter of obtained each class is recalculated;
4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm;
Formula:
Another object of the present invention is to provide a kind of method for extracting Feature Words described in realize from non-structured text Computer program.
Another object of the present invention is to provide a kind of method for extracting Feature Words described in realize from non-structured text Information data processing terminal.
Another object of the present invention is to provide a kind of computer-readable recording medium, including instruction, when it is in computer During upper operation so that computer performs the method that Feature Words are extracted in the slave non-structured text.
Another object of the present invention is to provide a kind of method that Feature Words are extracted from non-structured text from The system that Feature Words are extracted in non-structured text, the method that Feature Words are extracted from non-structured text include:
Word segmentation processing module, for carrying out word segmentation processing to language material using HMM;
Processing module is mapped, for word to be mapped to vector using word2vec;
Sort module, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.
Another object of the present invention is to provide it is a kind of be provided with it is described extract Feature Words from non-structured text be The information data processing terminal of system.
In conclusion advantages of the present invention and good effect are:Feature Words extracting method proposed by the present invention, in discrimination TFIDF is substantially better than with terms of false recognition rate.The application effect of the present invention is explained in detail with reference to experiment.In order to test The accuracy rate of method proposed by the invention is demonstrate,proved, the present invention is extracted 1000 at random from the background data base of " looking for translation " APP The self-introduction text of interpreter and corresponding label (amounting to 3750) are used as validation data set.Since label is interpreter Fill in, have correspondence with self-introduction, therefore can regard the Feature Words of self-introduction as.The present invention uses respectively TFIDF and algorithm proposed by the present invention, extract Feature Words, the experimental results are shown inthe following table from 1000 texts.
The results show that TF-IDF discriminations are 34.13%, false recognition rate 82.9%, the Feature Words that this patent proposes carry It is 81.65% to take method discrimination, false recognition rate 40.25%, false recognition rate while discrimination greatly improves 47.52% Reduce 42.65%.Therefore test result indicates that, Feature Words extracting method proposed by the present invention, in discrimination and false recognition rate Aspect is substantially better than TFIDF.
Brief description of the drawings
Fig. 1 is the method flow diagram provided in an embodiment of the present invention that Feature Words are extracted from non-structured text.
Fig. 2 is the system structure diagram provided in an embodiment of the present invention that Feature Words are extracted from non-structured text;
In figure:1st, word segmentation processing module;2nd, processing module is mapped;3rd, sort module.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Keyword is information age people management, retrieves the important means and portable tool of resource, keyword automatic marking Technology is the important dependence that people obtain information in magnanimity information.
As shown in Figure 1, the method provided in an embodiment of the present invention that Feature Words are extracted from non-structured text is including following Step:
S101:To each language material, step S102 and step S103 is performed;
S102:Word segmentation processing is carried out to language material using HMM;
S103:For step S102's as a result, word is mapped to vector using word2vec;
S104:Using k-means algorithms, the vectorial clustering that step S101~step S103 is obtained, obtains Top K Classification.
As shown in Fig. 2, the system provided in an embodiment of the present invention that Feature Words are extracted from non-structured text includes:
Word segmentation processing module 1, for carrying out word segmentation processing to language material using HMM;
Processing module 2 is mapped, for word to be mapped to vector using word2vec;
Sort module 3, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.
The application principle of the present invention is further described with reference to specific embodiment.
Based on non-feature set of words, for the non-structured text of one section of self-introduction class, the steps of Feature Words is extracted It is rapid as follows:
1. carrying out word segmentation processing to text using HMM, set of words A is obtained;
2. each word a in couple A, performs step 3 and step 4;
3. using word2vec algorithms, word a is mapped to vector v;
4. if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and a is added Set B;
5. it is feature set of words to obtain set B.
HMM (Hidden Markov Model) basic principle:If " future " of a process relies only on " present " and does not depend on " past ", then this process has Markov property, or this process is Markov process.Markov chain (Markov chain) is Time and all discrete Markov process of state parameter.HMM grows up on the basis of Markov chain, due to reality Problem is more described than Markov chain model increasingly complex, it was observed that time be not one-to-one with state, it is but logical Cross one group of probability distribution to be associated, such model is known as HMM.HMM is dual random process:One of them is Markov chain, This is essentially random process, it describes the transfer of state, is implicit.Another random process is described between state and observed value Statistics correspondence, can be observed.
The definition of HMM:One has N number of state (to be denoted as S1,S2,…,SN) HMM be by five yuan of parameter lambdas=(N, M, π, A, B) describe, wherein:
(1)N:Model λ status numbers, if the state q of t moment modelt, then have qt∈{S1,S2,…,SN}。
(2)M:The corresponding possible observed value number of each state.Remember that M observed value is V1,V2,…,VM, the sight of t moment Examine value Ot, then have Ot∈{V1,V2,…,VM}。
(3)A:State transition probability matrix, A=(aij)N×N,aijIt is t moment state in which Si, it is transferred to the t+1 moment State SjState transition probability.Wherein,
aij=P (qt+1=Sj|qt=Si),1≤i,j≤N。
(4)B:Observed value probability matrix, B=(bjk)N×M,bjkIt is in state SjUnder obtain symbol VkProbability, i.e. bjk=P (Ot=Vk|qt=Sj),1≤j≤N,1≤k≤M。
(5)π:Initial shape probability of state, π=(π12,…,πN), wherein πi=P (qi=Sj), 1≤i≤N, for describing Observation sequence O belongs to the probability distribution of each state in model in t=1 moment state in which.
Due to aij, bjk, πiAll it is probability, it is therefore desirable to meet normalizing condition:
HMM is actually to be divided into two parts, first, Markov chain, by parameter, A descriptions, it is using one group and general Rate distribution be associated state transfer statistics correspondence, come describe each short-term stationarity section be how to be converted to it is next short When steady section, the output that this process produces is status switch;Second, a random process, describes between state and observed value Statistical relationship, implicit state is described with the sequence observed, is described by B, its produce output for observation value sequence.
Word2vec is an instrument that word is converted into vector form.Processing to content of text can be reduced to Vector operation in vector space, calculates the similarity in vector space, to represent the similarity on text semantic. Two important models are used in word2vec:CBOW models and Skip-gram models.Two models all include three layers:Input layer, Projection layer and output layer.
CBOW models are right with (assuming that Context (ω) is made of each c word before and after ω) exemplified by (Context (ω), ω) These three layers are briefly described.
Input layer:Include the term vector v (Context (ω) of 2c word in Context (ω)1),v(Context (ω)2),…,v(Context(ω)2c)∈Rm.Here, the implication of m represents the length of term vector.
Projection layer:2c vector of input layer is done summation to add up, i.e.,
Output layer corresponds to a binary tree, it is to work as leaf node with the word occurred in language material, with each word in language material The number of appearance works as the Huffman trees that weights construct.In this Huffman seeds, the common N of leaf node (=| D |) it is a, point The word in dictionary D, non-leaf nodes N-1 are not corresponded to.
Exemplified by Skip-gram models are with sample (ω, Context (ω)), three layers are briefly described.
1. input layer:Term vector v (ω) the ∈ R of centre word ω containing only current samplem
2. projection layer:This is identical projection, and v (ω) is projected to v (ω).
3. output layer:As CBOW models, output layer is also a Huffman.
K-Means is one kind in clustering algorithm, and wherein K represents classification number, and Means represents average.As the term suggests K- Means is a kind of algorithm clustered by average to data point.K-Means algorithms are by K values set in advance and each The initial barycenter of classification divides similar data point.And optimal cluster is obtained by the mean iterative optimization after division As a result.
Algorithmic procedure is as follows:
1) K document is randomly selected as barycenter from N number of document
2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter
3) barycenter of obtained each class is recalculated
4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm
Formula:
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its any combination real It is existing.When using whole or in part realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or performing the computer program instructions, produce whole or in part according to Flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer-readable recording medium is transmitted to another computer-readable recording medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as server that a or multiple usable mediums integrate, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disc Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims (10)

  1. A kind of 1. method that Feature Words are extracted from non-structured text, it is characterised in that described to be carried from non-structured text Take the method for Feature Words that sentence is split into word by Hidden Markov Model first, is used for each section of text Word is mapped to vector by word2vec;By k-means algorithms, all words occurred in text are clustered;One word category Then it is not keyword in Top K classes, is otherwise keyword.
  2. 2. the method for Feature Words is extracted from non-structured text as claimed in claim 1, it is characterised in that described from non-knot The method that Feature Words are extracted in structure text comprises the following steps:
    Step 1, carries out word segmentation processing to text using Hidden Markov Model, obtains set of words A;
    Step 2, to each word a in A, performs step 3 and step 4;
    Step 3, using word2vec algorithms, vector v is mapped to by word a;
    Step 4, if vector v belongs to any type in K classes, a is non-Feature Words, and otherwise a is characterized word, and a is added Set B;It is feature set of words to obtain set B.
  3. 3. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that the hidden Ma Er Section's husband's model is described by five yuan of parameter lambdas=(N, M, π, A, B), wherein:
    (1)N:Model λ status numbers, if the state q of t moment modelt, then have qt∈{S1,S2,…,SN};
    (2)M:The corresponding possible observed value number of each state;Remember that M observed value is V1,V2,…,VM, the observed value of t moment Ot, then have Ot∈{V1,V2,…,VM};
    (3)A:State transition probability matrix, A=(aij)N×N,aijIt is t moment state in which Si, it is transferred to the state at t+1 moment SjState transition probability;Wherein:
    ajk=P (qt+1=Sj|qt=Si), 1≤i, j≤N;
    (4)B:Observed value probability matrix, B=(bjk)N×M,bjkIt is in state SjUnder obtain symbol VkProbability, i.e. bjk=P (Ot= Vk|qt=Sj), 1≤j≤N, 1≤k≤M;
    (5)π:Initial shape probability of state, π=(π12,…,πN), wherein πi=P (qi=Sj), 1≤i≤N, for describing to observe Sequence O belongs to the probability distribution of each state in model in t=1 moment state in which;
    Due to aij, bjk, πiAll it is probability, it is therefore desirable to meet normalizing condition:
  4. 4. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that described Word2vec includes:CBOW models and Skip-gram models;Model all includes three layers:Input layer, projection layer and output layer;
    CBOW models:
    Input layer:Include the term vector v (Context (ω) of 2c word in Context (ω)1),v(Context(ω)2),…,v (Context(ω)2c)∈Rm;The implication of m represents the length of term vector;
    Projection layer:It is cumulative that 2c vector of input layer is done into summation:
    <mrow> <msub> <mi>x</mi> <mi>w</mi> </msub> <mo>=</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mn>2</mn> <mi>c</mi> </mrow> </msubsup> <mi>v</mi> <mrow> <mo>(</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>t</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <msub> <mrow> <mo>(</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&amp;Element;</mo> <msup> <mi>R</mi> <mi>m</mi> </msup> <mo>;</mo> </mrow>
    Output layer corresponds to a binary tree, is to work as leaf node with the word occurred in language material, occurs with each word in language material Number works as the Huffman trees that weights construct;In this Huffman seeds, the common N of leaf node (=| D |) it is a, correspond to respectively Word in dictionary D, non-leaf nodes N-1;
    Skip-gram models:
    Input layer:Term vector v (ω) the ∈ R of centre word ω containing only current samplem
    Projection layer:This is identical projection, and v (ω) is projected to v (ω);
    Output layer:As CBOW models, output layer is also a Huffman.
  5. 5. the method for Feature Words is extracted from non-structured text as claimed in claim 2, it is characterised in that the K- Means clustering algorithms specifically include:
    1) K document is randomly selected as barycenter from N number of document;
    2) it is measured to remaining each document and arrives the distance of each barycenter, and it is grouped into the class of nearest barycenter;
    3) barycenter of obtained each class is recalculated;
    4) 2~3 step of iteration is up to new barycenter is equal with the protoplasm heart or terminates less than specified threshold, algorithm;
    Formula:
    <mrow> <mi>v</mi> <mo>=</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </msub> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>-</mo> <msub> <mi>&amp;mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>.</mo> </mrow>
  6. A kind of 6. calculating for realizing the method for extracting Feature Words described in Claims 1 to 5 any one from non-structured text Machine program.
  7. A kind of 7. information for realizing the method for extracting Feature Words described in Claims 1 to 5 any one from non-structured text Data processing terminal.
  8. 8. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer is performed as weighed Profit requires the method that Feature Words are extracted in the slave non-structured text described in 1-5 any one.
  9. 9. carried in a kind of slave non-structured text for the method for extracting Feature Words from non-structured text as claimed in claim 1 The system for taking Feature Words, it is characterised in that the method that Feature Words are extracted from non-structured text includes:
    Word segmentation processing module, for carrying out word segmentation processing to language material using HMM;
    Processing module is mapped, for word to be mapped to vector using word2vec;
    Sort module, for using k-means algorithms, the vectorial clustering that will be obtained, obtains Top K classification.
  10. A kind of 10. information data processing for being provided with the system for extracting Feature Words described in claim 9 from non-structured text Terminal.
CN201810120746.4A 2018-02-07 2018-02-07 Method and system, the computer program of Feature Words are extracted from non-structured text Pending CN108038109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810120746.4A CN108038109A (en) 2018-02-07 2018-02-07 Method and system, the computer program of Feature Words are extracted from non-structured text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810120746.4A CN108038109A (en) 2018-02-07 2018-02-07 Method and system, the computer program of Feature Words are extracted from non-structured text

Publications (1)

Publication Number Publication Date
CN108038109A true CN108038109A (en) 2018-05-15

Family

ID=62096792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810120746.4A Pending CN108038109A (en) 2018-02-07 2018-02-07 Method and system, the computer program of Feature Words are extracted from non-structured text

Country Status (1)

Country Link
CN (1) CN108038109A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493936A (en) * 2018-10-16 2019-03-19 华东理工大学 It is a kind of to detect abnormal administrated method using improved continuous bag of words
CN110852097A (en) * 2019-10-15 2020-02-28 平安科技(深圳)有限公司 Feature word extraction method, text similarity calculation method, device and equipment
CN112541083A (en) * 2020-12-23 2021-03-23 西安交通大学 Text classification method based on active learning hybrid neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145623B1 (en) * 2009-05-01 2012-03-27 Google Inc. Query ranking based on query clustering and categorization
CN106326481A (en) * 2016-08-31 2017-01-11 中译语通科技(北京)有限公司 Detection method of Weibo hot topics based on suddenness
CN106354714A (en) * 2016-08-29 2017-01-25 广东工业大学 NLPIR Chinese character segmentation system based Chinese character segmentation tool

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145623B1 (en) * 2009-05-01 2012-03-27 Google Inc. Query ranking based on query clustering and categorization
CN106354714A (en) * 2016-08-29 2017-01-25 广东工业大学 NLPIR Chinese character segmentation system based Chinese character segmentation tool
CN106326481A (en) * 2016-08-31 2017-01-11 中译语通科技(北京)有限公司 Detection method of Weibo hot topics based on suddenness

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘海天 等: "隐马尔可夫模型及其在自然语言处理中的应用", 《微处理机》 *
周练: "Word2vec的工作原理及应用探究", 《科技情报开发与经济》 *
李跃鹏 等: "基于word2vec的关键词提取算法", 《科研信息化技术与应用》 *
熊志斌 等: "K-means聚类算法的研究和应用", 《电脑编程技巧与维护》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493936A (en) * 2018-10-16 2019-03-19 华东理工大学 It is a kind of to detect abnormal administrated method using improved continuous bag of words
CN109493936B (en) * 2018-10-16 2022-02-15 华东理工大学 Method for detecting abnormal medication by using improved continuous bag-of-words model
CN110852097A (en) * 2019-10-15 2020-02-28 平安科技(深圳)有限公司 Feature word extraction method, text similarity calculation method, device and equipment
CN110852097B (en) * 2019-10-15 2022-02-01 平安科技(深圳)有限公司 Feature word extraction method, text similarity calculation method, device and equipment
CN112541083A (en) * 2020-12-23 2021-03-23 西安交通大学 Text classification method based on active learning hybrid neural network

Similar Documents

Publication Publication Date Title
US10565244B2 (en) System and method for text categorization and sentiment analysis
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
Garreta et al. Learning scikit-learn: machine learning in python
CN111353303B (en) Word vector construction method and device, electronic equipment and storage medium
CN111159409B (en) Text classification method, device, equipment and medium based on artificial intelligence
WO2022126944A1 (en) Text clustering method, electronic device and storage medium
US11645447B2 (en) Encoding textual information for text analysis
KR102444457B1 (en) Method for dialogue summarization with word graphs
CN113836295B (en) Text abstract extraction method, system, terminal and storage medium
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
KR20180129001A (en) Method and System for Entity summarization based on multilingual projected entity space
CA3131157A1 (en) System and method for text categorization and sentiment analysis
CN108038109A (en) Method and system, the computer program of Feature Words are extracted from non-structured text
CN110674297A (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
JP2024012152A (en) Method for identify word corresponding to target word in text information
US20230032208A1 (en) Augmenting data sets for machine learning models
CN112364666B (en) Text characterization method and device and computer equipment
US9910890B2 (en) Synthetic events to chain queries against structured data
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN113704466A (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
Huang et al. [Retracted] Hybrid Graph Neural Network Model Design and Modeling Reasoning for Text Feature Extraction and Recognition
CN113220841B (en) Method, apparatus, electronic device and storage medium for determining authentication information
CN117149999B (en) Class case recommendation method and device based on legal element hierarchical network and text characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180515

RJ01 Rejection of invention patent application after publication