CN111859898B - Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium - Google Patents
Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium Download PDFInfo
- Publication number
- CN111859898B CN111859898B CN202010623820.1A CN202010623820A CN111859898B CN 111859898 B CN111859898 B CN 111859898B CN 202010623820 A CN202010623820 A CN 202010623820A CN 111859898 B CN111859898 B CN 111859898B
- Authority
- CN
- China
- Prior art keywords
- word
- feature
- clustering
- viewpoint
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 17
- 239000011159 matrix material Substances 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 239000013598 vector Substances 0.000 claims description 61
- 230000011218 segmentation Effects 0.000 claims description 13
- 230000002457 bidirectional effect Effects 0.000 claims description 11
- 239000000463 material Substances 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000011524 similarity measure Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000010276 construction Methods 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 238000005065 mining Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the field of computer natural language processing, and provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and the program is executed to realize a multi-domain text implicit characteristic extraction method based on a hidden association network. The method comprises the following steps: obtaining a main body, characteristics and viewpoint word sets through corpus preprocessing, and obtaining co-occurrence frequency matrixes of the main body, characteristics and viewpoint words in the corpus through statistics; two-way enhanced clustering is carried out on the three word sets according to the co-occurrence frequency matrix; calculating the association strength, and constructing a main body-characteristic-view hiding association network; implicit features are extracted using the hidden association network. Aiming at the problem that the prior implicit feature extraction method has poor effect in the multi-domain text, the method can better extract the implicit features in the multi-domain text by constructing a main body-feature-view hidden association network by considering the association between the features and domain knowledge.
Description
The application is a divisional application with the application date of 2019, 4 months and 16 days, the application number of 201910304794.3 and the invention creates a multi-field text implicit characteristic extraction method based on a hidden association network.
Technical Field
The invention relates to the field of computer natural language processing, in particular to a multi-domain text implicit characteristic extraction method based on a hidden association network.
Background
With the rise of electronic commerce and social networks, the number of information or short texts with subjective emotion colors of users, such as microblogs and commodity comments, is increasing at a high speed, and the information generated by the users is a precious resource, wherein subjective emotion, opinion and other information can help people to make decisions, so that a great deal of research is attracted by mining the ideas expressed in the text with the subjective emotion of the users. Among them, more and more researchers are focusing on finer opinion mining, which digs people's views of a certain aspect of things, which are called feature-level views in these studies.
Most of the research in this field is focused on finding explicit features in text, however in many cases feature words are implicitly expressed by viewpoint words, such as: what is implicit to "computer cheating" is that the subject-the feature of "computer" -the price "has a point of view-the" cheating ", and such features that do not appear explicitly in the text are referred to as implicit features. The research of implicit features mainly only considers the association between feature words and viewpoint words in texts, and hidden association between feature words and viewpoint words is mined through a co-occurrence frequency matrix of the feature words and the viewpoint words in the corpus, so that possible implicit features can be predicted under the condition that the viewpoint words are obtained by using the hidden association.
Many of the text today is mixed domain text, containing content in a variety of domains, such as: politics, biology, economy, etc. The implicit feature recognition method proposed by the former only considers the association between feature words and viewpoint words in the text, but does not consider the application in multi-field text, and cannot obtain good effect on the increasingly mixed field text nowadays.
Disclosure of Invention
The invention aims to solve the problem that the implicit characteristic recognition method is poor in multi-field text effect, and provides a method for extracting multi-field text implicit characteristics based on a hidden association network. According to the method, the main words are added as prior knowledge constraint in the field of texts, the construction of the hidden association network is participated, and the hidden association among main body-feature-view three parties is considered, so that the method can be well applied to the extraction of the implicit features of the multi-field texts.
To achieve the object of the present invention, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:
step 1: word vector training is carried out by using language to obtain word vectors of each word in the corpus, the language is preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics;
step 2: performing bidirectional enhanced clustering on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then re-clustering to obtain a clustering result in each word set;
step 3: calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a two-part graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network;
step 4: and for sentences needing implicit feature extraction, obtaining main body and viewpoint words in the sentences, judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature classes.
In the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the language materials, sentence segmentation, part-of-speech labeling and dependency analysis preprocessing are performed on the language materials to obtain main words, characteristic words and viewpoint words of each sentence, a main body, characteristic and viewpoint word set of the language materials is finally obtained, and co-occurrence frequency matrixes of each word between the main body-characteristic word set and the characteristic-viewpoint word set in the language materials are obtained through statistics.
In the step 2, firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained in the training in the step 1, then, taking the association between each word of one word set and the fixed aggregation class of the other word set into consideration between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally, converging to obtain a clustering result of the main body-feature word set and the feature-viewpoint word set. And re-clustering the characteristic word gathering class results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering class results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering class results simultaneously contain main body and viewpoint information.
In clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression w i Sum word w j Content similarity between (word vector similarity of words), S rel (W i ,W j ) The expression w i Sum word w j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>
The process of performing bidirectional enhanced clustering between the two word sets F and O is as follows:
a. only the internal similarity, namely cosine similarity among word vectors, is considered, and words in the set F are clustered into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1 ,Word O i The corresponding association vector with the set F clustering result is composed of +.>Representing, finally, a new n×k correlation matrix M is formed by the correlation vectors 1 . Association vector R' i Each component of (1) corresponds to one of the k classes after F-clustering, wherein +.>Is the word O i The weight between the clustering method and the x class after F clustering is word O i A sum of co-occurrence frequencies with all words in the x-th class;
c. based on the updated correlation matrix M between set O and set F 1 Clustering the data objects in the set O into l classes;
d. updating the correlation matrix M of the set F according to the clustering result of the set O 2 ,Word F i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors 2 . Association vector R' i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F i The weight between the clustering method and the x class after O clustering is word F i A sum of co-occurrence frequencies with all words in the x-th class;
e. based on the updated correlation matrix M between set F and set O 2 Reclustering the data objects in the set F into k classes;
f. and (c) iterating the steps b-e until the clustering results of the two object types are converged.
The process for reclustering the feature word gathering results comprises the following steps: clustering class results F for feature words requiring re-clustering r ,Feature word Y i Corresponding to the subject word gathering class result S r The association vector between them is composed ofA representation; association vector R' i Each component of (1) corresponds to S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r Weights between p classes of (c). At F r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr 。
And 3, calculating the association strength among the clustering classes of the main body-feature and feature-viewpoint word sets by utilizing the co-occurrence frequency matrix according to the clustering result obtained in the step 2, and finally constructing a main body-feature-viewpoint association network. The association strength is represented by the PMI between the two classes, defined as:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 The frequency of occurrence, P ', of the words in the corpus'
(c 1 ,c 2 ) Is class c 1 All words and classes c of 2 The sum of co-occurrence frequencies at the sentence level in the corpus. And using the mutual information PMI as the association strength between classes, associating the main body-feature word set and the feature-viewpoint word set, and constructing a main body-feature-viewpoint association network.
In the step 4, possible implicit features in sentences are extracted by using a main body-feature-viewpoint association network, and the basic flow is as follows: for sentences needing implicit feature extraction, obtaining main words and viewpoint words in the sentences by utilizing the techniques of word segmentation, part-of-speech tagging, dependency analysis and the like, considering main classes and viewpoint classes to which the main words and the viewpoint words belong, obtaining feature classes with the highest weighted association degree with the two main classes and viewpoint classes according to a main body-feature-viewpoint association network, and finally predicting the most probable feature words as implicit features. This implicit feature recognition also works well for multi-domain text because of the association with subject words that is considered.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a diagram of a subject-feature-perspective association network;
FIG. 3 is a flow chart for constructing a subject-feature-perspective association network;
fig. 4 is an example of implicit feature recognition using a subject-feature-perspective association network.
Detailed Description
The present invention will now be described in further detail with reference to the drawings and examples, which are not intended to limit the scope of the invention.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:
referring to fig. 1, a multi-domain text implicit feature extraction method based on a hidden association network includes the following steps:
ST1: word vector training is carried out by using language to obtain word vector of each word in the corpus, the language is preprocessed to obtain main body, feature and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics. The specific flow is as follows:
a. sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors corresponding to each word in the corpus.
b. And carrying out sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis on the corpus. If the similarity of the word vectors of the nouns in the sentences and the main body of the marked sentences is greater than a threshold value T, the nouns are added into a main body word set as main body words, otherwise, the nouns are used as feature word candidates, and adjectives in the sentences are used as viewpoint word candidates. According to the sentence dependency tree obtained by dependency analysis, selecting candidate feature words and candidate viewpoint words which are connected by specific relations on the dependency tree, adding the candidate feature words and the candidate viewpoint words into a feature word set and a viewpoint word set, for example, the candidate feature words and the candidate viewpoint words which are connected by the relations are often connected by edges of 'amod' and 'nsubj', and entering the feature word set and the viewpoint word set, so as to finally obtain a main body, a feature and the viewpoint word set of the corpus.
c. Determining a characteristic word f and a viewpoint word o in the sentence with the main word s determined according to the method, counting the co-occurrence frequency of the main word s, the characteristic word f and the viewpoint word o in the corpus, traversing all sentences in the corpus to finally obtain a co-occurrence frequency matrix M of each word between the main word and the characteristic word set and between the characteristic word and the viewpoint word set in the corpus sf And M fo 。
ST2: based on the co-occurrence frequency matrix M obtained by statistics in ST1 sf And M fo And carrying out bidirectional enhanced clustering among the main body-feature and feature-viewpoint word sets, and then re-clustering to obtain a clustering result inside each word set.
Concrete embodiments
Firstly, carrying out preliminary clustering in three word sets according to word vectors obtained through training in ST1, and then using a co-occurrence frequency matrix M between a main body-characteristic word set and a characteristic-viewpoint word set sf And M fo The correlation matrix is obtained by considering the correlation between each word of one word set and the class in the other word set. And carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a bidirectional enhancement clustering result of the main body-feature word set and the feature-viewpoint word set.
In clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression w i Sum word w j Content similarity between words, i.e. word vector similarity of words, S rel (W i ,W j ) The expression w i Sum word w j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>
The specific flow of mutual enhancement clustering between the two word sets F and O is as follows:
a. only the internal similarity, namely cosine similarity among word vectors, is considered, and words in the set F are clustered into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1 ,Word O i The corresponding association vector with the set F clustering result is composed of +.>Representing, finally, a new n×k correlation matrix M is formed by the correlation vectors 1 . Association vector R' i Each component of (1) corresponds to one of the k classes after F-clustering, wherein +.>Is the word O i The weight between the clustering method and the x class after F clustering is word O i A sum of co-occurrence frequencies with all words in the x-th class;
c. based on the updated correlation matrix M between set O and set F 1 Clustering the data objects in the set O into l classes;
d. updating set F based on the clustering result of set OCorrelation matrix M 2 ,Word F i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors 2 . Association vector R' i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F i The weight between the clustering method and the x class after O clustering is word F i A sum of co-occurrence frequencies with all words in the x-th class;
e. based on the updated correlation matrix M between set F and set O 2 Reclustering the data objects in the set F into k classes;
f. and (c) iterating the steps b-e until the clustering results of the two object types converge or the relative error is reduced to a certain degree.
Finally, a subject word gathering class result S is obtained by mutually enhanced clustering of subject-feature word sets r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F r Re-clustering is carried out to ensure that the finally obtained characteristic word gathering class result F fr And contains both body and perspective information. The re-clustering process is as follows:
clustering class results F for feature words requiring re-clustering r I.e., feature word sets that have been completed by mutual enhanced clustering between feature-perspective word sets,feature word Y i Corresponding to the subject word gathering class result S r The association vector between the two is->A representation;association vector R' i Each component of (1) corresponds to S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r Weights between p classes of (c). At F r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr 。
ST3: and calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a bipartite graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network.
Body-feature-perspective association network referring to fig. 2, where words are divided into three parts: a main word set, a characteristic word set and a viewpoint word set. The three word sets are clustered into a plurality of classes through the clustering in the ST2, the part encircled by each dotted line in the graph represents one class, the classes of the main body-feature word set and the feature-viewpoint word set contain correlations, the correlations among the classes are represented in the graph by the dotted line, and the correlations represent that the words in the two classes jointly appear in sentences in the corpus.
In fig. 2, the association between classes is represented by dotted lines between classes, and the method uses point mutual information PMI between classes as the association strength between classes, and the calculation formula of the PMI is:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 Frequency of occurrence, P' (c), of words in the corpus 1 ,c 2 ) Is class c 1 All words and classes c of 2 The sum of co-occurrence frequencies at the sentence level in the corpus.
Referring to fig. 3, a specific construction flow of the body-feature-perspective association network is as follows:
a. the content of the characteristic word set F is clustered into k classes only according to the internal relevance, namely cosine similarity among word vectors, so as to obtain the characteristic word set F after preliminary clustering 1 ;
b. According to the clustering method of mutual enhancement in ST2, the feature word set F 1 The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S 1 In the feature word set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O 1 And feature word set F 2 ;
c. Due to F 2 Some of the classes contain multi-domain features and therefore need to be based on the subject vocabulary S 1 The associated weight matrix pairs F 2 Re-clustering is carried out, the re-clustering method is as described in ST2, and finally the feature word set F after re-clustering is obtained 3 ;
d. Based on principal-feature, feature-perspective co-occurrence frequency matrix M obtained by statistics from corpus sf And M fo Construction of subject vocabulary S 1 And feature word set F 3 Feature word set F 3 Viewpoint word set O 1 The association strength between classes is represented by the PMI described above. The point mutual information PMI is used as the association strength between classes, and a main body-feature word set and a feature-viewpoint word set are associated to obtain a clustering result and association information of the three word sets: the number of classes, the class center vector of each class, the labels of the classes to which each word belongs, the association strength between classes, etc., which form a body-feature-perspective association network.
ST4: for sentences needing implicit feature extraction, firstly obtaining main body and viewpoint words in the sentences, then judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the classes. The specific flow is shown in fig. 4:
a. performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, taking nouns as main word candidates, adjectives as viewpoint words, inquiring that nouns and adjectives connected by specific relations exist on a dependency tree, judging whether the nouns exist in a feature word set, if so, extracting the nouns as explicit features, otherwise, taking the nouns as main words;
b. judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the feature class s with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-feature word set and each class of the feature-viewpoint word set stored in the association network;
c. the most probable word is extracted from the feature class s as implicit feature word, here we extract the word in the class that appears most frequently in the corpus as implicit feature word w.
Referring to fig. 4, for a specific example, implicit feature extraction is performed taking the sentence "some little but her performance has been approved" as an example:
a. the sentence "Zhang somewhere" is still very small, but her performance has already received approval "carry on word segmentation, part of speech marking and dependency analysis, the name" Zhang somewhere "has the connection of the appointed relation" nsubj "on the dependency tree with adjective" small ", judge" Zhang somewhere "does not have and feature word is concentrated, regard" Zhang somewhere "as the subject word, adjective" small "is regarded as the viewpoint word;
b. according to the main words 'Zhang somewhere' and the viewpoint words 'small' identified in a, calculating the similarity between the word vectors of the main words 'Zhang somewhere' and the viewpoint words 'Zhang', respectively, and between the word vectors of the main words 'Zhang' and the viewpoint words 'Zhang', respectively, selecting main words 'people' and the viewpoint words 'Zhang' with the highest similarity as the categories to which the main words 'people' and the viewpoint words 'size' belong, and selecting the feature class with the highest association strength with the main words 'people' and the viewpoint words 'size', according to the constructed main body-feature-viewpoint association network, wherein the feature class with the association exists between the main words 'people' and the viewpoint words 'size', and then calculating the feature class with the highest average association strength with the main words as the most probable feature class;
c. and b, selecting the most probable feature words from the most probable feature class obtained in the step b as predicted implicit features, wherein the feature words with highest occurrence frequency in the corpus in the feature class are selected as implicit feature words.
Claims (5)
1. A computer-readable storage medium having a program stored thereon, the program when executed implementing a hidden association network-based multi-domain text implicit feature extraction method, comprising the steps of:
step 1: word vector training is carried out by using language to obtain word vectors of each word in the corpus, the language is preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics;
step 2: performing bidirectional enhanced clustering on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then re-clustering to obtain a clustering result in each word set;
step 3: calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a two-part graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network;
step 4: and for sentences needing implicit feature extraction, obtaining main body and viewpoint words in the sentences, judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature classes.
2. The computer-readable storage medium of claim 1, wherein: in the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the corpus, the language materials are preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics, specifically: sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors of each word in the corpus; performing sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis preprocessing on the corpus, selecting possible nouns from sentences as main words and adding the main words into a main word set, otherwise, selecting adjectives in sentences as viewpoint word candidates, and selecting candidate feature words and candidate viewpoint words which are connected by specific relations to add the feature word set and the viewpoint word set according to a dependency tree obtained by dependency analysis; and counting the co-occurrence frequency matrix of each word in the corpus between the main body-feature word set and the feature-viewpoint word set.
3. The computer-readable storage medium of claim 1, wherein: in the step 2, bidirectional enhanced clustering is performed on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then clustering is performed again to obtain a clustering result in each word set, specifically: firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained through training in the step 1, then considering the association between each word of one word set and the fixed aggregation class of the other word set between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out iterative clustering with mutual enhancement by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a clustering result of the main body-feature and the feature-viewpoint word set; reclustering the characteristic word gathering results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering results simultaneously contain main body and viewpoint information;
in clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression W i Sum word W j Similarity of word vectors between them, herein S content (W i ,W j ) For the word W i Sum word W j Content similarity between them; s is S rel (W i ,W j ) The expression W i Sum word W j The corresponding correlation vector similarity in the mutual correlation matrix is called S rel (W i ,W j ) For the word W i Sum word W j Correlation similarity between the two;representing the weight that the internal similarity takes up,the mutual enhancement clustering procedure between the two word sets F and O is as follows:
a. only considering content similarity, namely cosine similarity among word vectors, and clustering words in the set F into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1 For any word O in set O i Word O i The corresponding association vector with the set F clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the k classes after clustering of set F, wherein +.>Word O i The weight between the x class clustered with the set F is the word O i Sum of co-occurrence frequencies with all words in the x-th class, x.epsilon.1, k]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new n x k-dimensional cross-correlation matrix M is formed by the correlation vectors of n words in the set O 1 ;
c. According to the updated correlation matrix M between the set O and the set F in the b 1 Clustering the data objects in the set O into l classes;
d. updating the correlation matrix M of the set F according to the clustering result of the set O 2 For any word F in the set F i Word F i The corresponding association vector with the set O clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the l classes after the set O is clustered, wherein +.>Word F i The weight between the y class clustered with the set O is the word F i Sum of co-occurrence frequencies with all words in the y-th class, y.epsilon.1, l]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new M x l-dimensional mutual correlation matrix M is formed by the correlation vectors of M words in the set F 2 ;
e. According to the updated correlation matrix M between the set F and the set O in d 2 Reclustering the data objects in the set F into k classes;
f. iterating the steps b-e until the clustering results of the two word sets are converged;
main word gathering class result S obtained by mutually enhanced clustering of main-feature word sets r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F r The process of re-clustering is as follows:
suppose that subject word gathering class result S r Class obtained by p bidirectional enhanced clustering, and feature word gathering class result F r The class comprises q bidirectional enhancement clusters; clustering class results F for feature words requiring re-clustering r ,F r Any one of the feature words Y i Corresponding to the subject word gathering class result S r The association vector between them is composed ofA representation; association vector R' i Each component in (1) corresponds to a subject word aggregation class result S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r The weights between the z-th class of (2), z E [1, p]The method comprises the steps of carrying out a first treatment on the surface of the Gathering class results F at feature words r In each class of the feature words, the feature words are paired pairwise to calculate the similarity of the associated vectors, the feature words with the similarity of the associated vectors smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr 。
4. The computer-readable storage medium of claim 1, wherein: in the step 3, mutual information between classes of two word sets is calculated as association strength between the classes by using the co-occurrence frequency matrix, and a bipartite graph between a main body and a feature, between the feature and a viewpoint word set is constructed to form a main body-feature-viewpoint association network, specifically:
a. clustering the feature word set F into k classes only according to the content similarity, namely cosine similarity among word vectors, to obtain the feature word set F after preliminary clustering 1 ;
b. According to the bi-directional enhanced clustering method in step 2, set F is used 1 The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S 1 Using set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O 1 And feature word set F 2 ;
c. Due to the set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain a clustered characteristic word set F 2 Some classes of the system contain multi-domain features, so that the words F are required to be based on the features 2 And subject word set S 1 The cross-correlation matrix M pairs of characteristic word sets F 2 Re-clustering, wherein the cross-correlation matrix M is formed by a characteristic word set F 2 Each feature word and subject word set S 1 The components of each associated vector representing the corresponding feature word and the subject word set S 1 A weight for each class; characteristic word set F is paired according to the correlation matrix M 2 Proceeding withThe re-clustering method is as described in step 2, and finally the feature word set F after re-clustering is obtained 3 ;
d. Constructing a subject word set S according to a subject-feature-viewpoint co-occurrence frequency matrix obtained by statistics from corpus 1 And feature word set F 3 Feature word set F 3 Viewpoint word set O 1 The association strength between classes is represented by PMI, and the calculation formula is:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 Frequency of occurrence, P' (c), of words in the corpus 1 ,c 2 ) Is class c 1 All words and classes c of 2 And (3) associating the main body-feature word set and the feature-viewpoint word set by using the mutual information PMI as the association strength between classes according to the sum of the co-occurrence frequencies of all words in the sentence level in the corpus, and constructing a main body-feature-viewpoint association network.
5. The computer-readable storage medium of claim 1, wherein: in the step 4, for sentences requiring implicit feature extraction, obtaining main body and viewpoint words therein, then judging the category belonging to each word set, determining possible implicit feature categories according to the main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature categories, wherein the specific steps are as follows: performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, and identifying possible subject words and viewpoint words from the sentences; judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the characteristic class f with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-characteristic word set and each class in the characteristic-viewpoint word set in the association network; and extracting the word with the largest occurrence number in the corpus from the feature class f to serve as an implicit feature word w.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010623820.1A CN111859898B (en) | 2019-04-16 | 2019-04-16 | Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910304794.3A CN110020439B (en) | 2019-04-16 | 2019-04-16 | Hidden associated network-based multi-field text implicit feature extraction method |
CN202010623820.1A CN111859898B (en) | 2019-04-16 | 2019-04-16 | Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910304794.3A Division CN110020439B (en) | 2019-04-16 | 2019-04-16 | Hidden associated network-based multi-field text implicit feature extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111859898A CN111859898A (en) | 2020-10-30 |
CN111859898B true CN111859898B (en) | 2024-01-16 |
Family
ID=67191503
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910304794.3A Active CN110020439B (en) | 2019-04-16 | 2019-04-16 | Hidden associated network-based multi-field text implicit feature extraction method |
CN202010623820.1A Active CN111859898B (en) | 2019-04-16 | 2019-04-16 | Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910304794.3A Active CN110020439B (en) | 2019-04-16 | 2019-04-16 | Hidden associated network-based multi-field text implicit feature extraction method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110020439B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113821587B (en) * | 2021-06-02 | 2024-05-17 | 腾讯科技(深圳)有限公司 | Text relevance determining method, model training method, device and storage medium |
CN115168600B (en) * | 2022-06-23 | 2023-07-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006338342A (en) * | 2005-06-02 | 2006-12-14 | Nippon Telegr & Teleph Corp <Ntt> | Word vector generation device, word vector generation method and program |
CN103365999A (en) * | 2013-07-16 | 2013-10-23 | 盐城工学院 | Text clustering integrated method based on similarity degree matrix spectral factorization |
CN103412880A (en) * | 2013-07-17 | 2013-11-27 | 百度在线网络技术(北京)有限公司 | Method and device for determining implicit associated information between multimedia resources |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN105007262A (en) * | 2015-06-03 | 2015-10-28 | 浙江大学城市学院 | WLAN multi-step attack intention pre-recognition method |
CN106354754A (en) * | 2016-08-16 | 2017-01-25 | 清华大学 | Bootstrap-type implicit characteristic mining method and system based on dispersed independent component analysis |
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
CN107358014A (en) * | 2016-11-02 | 2017-11-17 | 华南师范大学 | The clinical pre-treating method and system of a kind of physiological data |
CN107391575A (en) * | 2017-06-20 | 2017-11-24 | 浙江理工大学 | A kind of implicit features recognition methods of word-based vector model |
CN107562717A (en) * | 2017-07-24 | 2018-01-09 | 南京邮电大学 | A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence |
CN108140061A (en) * | 2015-06-05 | 2018-06-08 | 凯撒斯劳滕工业大学 | Network die body automatically determines |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5697202B2 (en) * | 2011-03-08 | 2015-04-08 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Method, program and system for finding correspondence of terms |
US20140272914A1 (en) * | 2013-03-15 | 2014-09-18 | William Marsh Rice University | Sparse Factor Analysis for Learning Analytics and Content Analytics |
US9594746B2 (en) * | 2015-02-13 | 2017-03-14 | International Business Machines Corporation | Identifying word-senses based on linguistic variations |
-
2019
- 2019-04-16 CN CN201910304794.3A patent/CN110020439B/en active Active
- 2019-04-16 CN CN202010623820.1A patent/CN111859898B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006338342A (en) * | 2005-06-02 | 2006-12-14 | Nippon Telegr & Teleph Corp <Ntt> | Word vector generation device, word vector generation method and program |
CN103365999A (en) * | 2013-07-16 | 2013-10-23 | 盐城工学院 | Text clustering integrated method based on similarity degree matrix spectral factorization |
CN103412880A (en) * | 2013-07-17 | 2013-11-27 | 百度在线网络技术(北京)有限公司 | Method and device for determining implicit associated information between multimedia resources |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN105007262A (en) * | 2015-06-03 | 2015-10-28 | 浙江大学城市学院 | WLAN multi-step attack intention pre-recognition method |
CN108140061A (en) * | 2015-06-05 | 2018-06-08 | 凯撒斯劳滕工业大学 | Network die body automatically determines |
CN106354754A (en) * | 2016-08-16 | 2017-01-25 | 清华大学 | Bootstrap-type implicit characteristic mining method and system based on dispersed independent component analysis |
CN106372117A (en) * | 2016-08-23 | 2017-02-01 | 电子科技大学 | Word co-occurrence-based text classification method and apparatus |
CN107358014A (en) * | 2016-11-02 | 2017-11-17 | 华南师范大学 | The clinical pre-treating method and system of a kind of physiological data |
CN107391575A (en) * | 2017-06-20 | 2017-11-24 | 浙江理工大学 | A kind of implicit features recognition methods of word-based vector model |
CN107562717A (en) * | 2017-07-24 | 2018-01-09 | 南京邮电大学 | A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence |
Non-Patent Citations (4)
Title |
---|
BIM应用问题分析与对策;任浩楹;《城市住宅》;第25卷(第07期);37-40 * |
Bipartite Graph Filter Banks: Polyphase Analysis and Generalization;David B. H. Tay;《 IEEE Transactions on Signal Processing》;第65卷(第18期);4833 - 4846 * |
基于义类同现频率的汉语语义排歧方法;张永奎;《计算机研究与发展》(第07期);1-5 * |
多元异构环境下个性化推荐方法研究;危建珍;《中国优秀硕士学位论文全文数据库》(第03期);I138-2137 * |
Also Published As
Publication number | Publication date |
---|---|
CN111859898A (en) | 2020-10-30 |
CN110020439A (en) | 2019-07-16 |
CN110020439B (en) | 2020-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241524B (en) | Semantic analysis method and device, computer-readable storage medium and electronic equipment | |
Shi et al. | Functional and contextual attention-based LSTM for service recommendation in mashup creation | |
CN110717106B (en) | Information pushing method and device | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN113268995B (en) | Chinese academy keyword extraction method, device and storage medium | |
CN110765260A (en) | Information recommendation method based on convolutional neural network and joint attention mechanism | |
CN111190997B (en) | Question-answering system implementation method using neural network and machine learning ordering algorithm | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
CN110705247B (en) | Based on x2-C text similarity calculation method | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111222318A (en) | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network | |
CN110807086A (en) | Text data labeling method and device, storage medium and electronic equipment | |
CN114936277A (en) | Similarity problem matching method and user similarity problem matching system | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN111191031A (en) | Entity relation classification method of unstructured text based on WordNet and IDF | |
CN112347758A (en) | Text abstract generation method and device, terminal equipment and storage medium | |
CN111859898B (en) | Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium | |
CN111737560A (en) | Content search method, field prediction model training method, device and storage medium | |
CN114997288A (en) | Design resource association method | |
CN110569355B (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
Swamy et al. | Nit-agartala-nlp-team at semeval-2020 task 8: Building multimodal classifiers to tackle internet humor | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |