CN111859898B

CN111859898B - Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium

Info

Publication number: CN111859898B
Application number: CN202010623820.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Zhongsenyunlian Chengdu Technology Co ltd
Current assignee: Zhongsenyunlian Chengdu Technology Co ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2024-01-16
Anticipated expiration: 2039-04-16
Also published as: CN111859898A; CN110020439A; CN110020439B

Abstract

The invention belongs to the field of computer natural language processing, and provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and the program is executed to realize a multi-domain text implicit characteristic extraction method based on a hidden association network. The method comprises the following steps: obtaining a main body, characteristics and viewpoint word sets through corpus preprocessing, and obtaining co-occurrence frequency matrixes of the main body, characteristics and viewpoint words in the corpus through statistics; two-way enhanced clustering is carried out on the three word sets according to the co-occurrence frequency matrix; calculating the association strength, and constructing a main body-characteristic-view hiding association network; implicit features are extracted using the hidden association network. Aiming at the problem that the prior implicit feature extraction method has poor effect in the multi-domain text, the method can better extract the implicit features in the multi-domain text by constructing a main body-feature-view hidden association network by considering the association between the features and domain knowledge.

Description

Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium

The application is a divisional application with the application date of 2019, 4 months and 16 days, the application number of 201910304794.3 and the invention creates a multi-field text implicit characteristic extraction method based on a hidden association network.

Technical Field

The invention relates to the field of computer natural language processing, in particular to a multi-domain text implicit characteristic extraction method based on a hidden association network.

Background

With the rise of electronic commerce and social networks, the number of information or short texts with subjective emotion colors of users, such as microblogs and commodity comments, is increasing at a high speed, and the information generated by the users is a precious resource, wherein subjective emotion, opinion and other information can help people to make decisions, so that a great deal of research is attracted by mining the ideas expressed in the text with the subjective emotion of the users. Among them, more and more researchers are focusing on finer opinion mining, which digs people's views of a certain aspect of things, which are called feature-level views in these studies.

Most of the research in this field is focused on finding explicit features in text, however in many cases feature words are implicitly expressed by viewpoint words, such as: what is implicit to "computer cheating" is that the subject-the feature of "computer" -the price "has a point of view-the" cheating ", and such features that do not appear explicitly in the text are referred to as implicit features. The research of implicit features mainly only considers the association between feature words and viewpoint words in texts, and hidden association between feature words and viewpoint words is mined through a co-occurrence frequency matrix of the feature words and the viewpoint words in the corpus, so that possible implicit features can be predicted under the condition that the viewpoint words are obtained by using the hidden association.

Many of the text today is mixed domain text, containing content in a variety of domains, such as: politics, biology, economy, etc. The implicit feature recognition method proposed by the former only considers the association between feature words and viewpoint words in the text, but does not consider the application in multi-field text, and cannot obtain good effect on the increasingly mixed field text nowadays.

Disclosure of Invention

The invention aims to solve the problem that the implicit characteristic recognition method is poor in multi-field text effect, and provides a method for extracting multi-field text implicit characteristics based on a hidden association network. According to the method, the main words are added as prior knowledge constraint in the field of texts, the construction of the hidden association network is participated, and the hidden association among main body-feature-view three parties is considered, so that the method can be well applied to the extraction of the implicit features of the multi-field texts.

To achieve the object of the present invention, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:

step 1: word vector training is carried out by using language to obtain word vectors of each word in the corpus, the language is preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics;

step 2: performing bidirectional enhanced clustering on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then re-clustering to obtain a clustering result in each word set;

step 3: calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a two-part graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network;

step 4: and for sentences needing implicit feature extraction, obtaining main body and viewpoint words in the sentences, judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature classes.

In the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the language materials, sentence segmentation, part-of-speech labeling and dependency analysis preprocessing are performed on the language materials to obtain main words, characteristic words and viewpoint words of each sentence, a main body, characteristic and viewpoint word set of the language materials is finally obtained, and co-occurrence frequency matrixes of each word between the main body-characteristic word set and the characteristic-viewpoint word set in the language materials are obtained through statistics.

In the step 2, firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained in the training in the step 1, then, taking the association between each word of one word set and the fixed aggregation class of the other word set into consideration between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally, converging to obtain a clustering result of the main body-feature word set and the feature-viewpoint word set. And re-clustering the characteristic word gathering class results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering class results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering class results simultaneously contain main body and viewpoint information.

In clustering, the similarity measure between words is defined as follows:

wherein S is _content (W _i ,W _j ) The expression w _i Sum word w _j Content similarity between (word vector similarity of words), S _rel (W _i ,W _j ) The expression w _i Sum word w _j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>

The process of performing bidirectional enhanced clustering between the two word sets F and O is as follows:

a. only the internal similarity, namely cosine similarity among word vectors, is considered, and words in the set F are clustered into k classes;

b. updating the correlation matrix M of the set O according to the clustering result of the set F ₁ ，Word O _i The corresponding association vector with the set F clustering result is composed of +.>Representing, finally, a new n×k correlation matrix M is formed by the correlation vectors ₁ . Association vector R' _i Each component of (1) corresponds to one of the k classes after F-clustering, wherein +.>Is the word O _i The weight between the clustering method and the x class after F clustering is word O _i A sum of co-occurrence frequencies with all words in the x-th class;

c. based on the updated correlation matrix M between set O and set F ₁ Clustering the data objects in the set O into l classes;

d. updating the correlation matrix M of the set F according to the clustering result of the set O ₂ ，Word F _i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors ₂ . Association vector R' _i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F _i The weight between the clustering method and the x class after O clustering is word F _i A sum of co-occurrence frequencies with all words in the x-th class;

e. based on the updated correlation matrix M between set F and set O ₂ Reclustering the data objects in the set F into k classes;

f. and (c) iterating the steps b-e until the clustering results of the two object types are converged.

The process for reclustering the feature word gathering results comprises the following steps: clustering class results F for feature words requiring re-clustering _r ，Feature word Y _i Corresponding to the subject word gathering class result S _r The association vector between them is composed ofA representation; association vector R' _i Each component of (1) corresponds to S _r One of p classes of (1), wherein->Is the characteristic word Y _i Gathering class results S with subject words _r Weights between p classes of (c). At F _r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained _fr 。

And 3, calculating the association strength among the clustering classes of the main body-feature and feature-viewpoint word sets by utilizing the co-occurrence frequency matrix according to the clustering result obtained in the step 2, and finally constructing a main body-feature-viewpoint association network. The association strength is represented by the PMI between the two classes, defined as:

here P (c) ₁ ) And P (c) ₂ ) Is class c ₁ Class c of sum ₂ The frequency of occurrence, P ', of the words in the corpus'

(c ₁ ,c ₂ ) Is class c ₁ All words and classes c of ₂ The sum of co-occurrence frequencies at the sentence level in the corpus. And using the mutual information PMI as the association strength between classes, associating the main body-feature word set and the feature-viewpoint word set, and constructing a main body-feature-viewpoint association network.

In the step 4, possible implicit features in sentences are extracted by using a main body-feature-viewpoint association network, and the basic flow is as follows: for sentences needing implicit feature extraction, obtaining main words and viewpoint words in the sentences by utilizing the techniques of word segmentation, part-of-speech tagging, dependency analysis and the like, considering main classes and viewpoint classes to which the main words and the viewpoint words belong, obtaining feature classes with the highest weighted association degree with the two main classes and viewpoint classes according to a main body-feature-viewpoint association network, and finally predicting the most probable feature words as implicit features. This implicit feature recognition also works well for multi-domain text because of the association with subject words that is considered.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a diagram of a subject-feature-perspective association network;

FIG. 3 is a flow chart for constructing a subject-feature-perspective association network;

fig. 4 is an example of implicit feature recognition using a subject-feature-perspective association network.

Detailed Description

The present invention will now be described in further detail with reference to the drawings and examples, which are not intended to limit the scope of the invention.

An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:

referring to fig. 1, a multi-domain text implicit feature extraction method based on a hidden association network includes the following steps:

ST1: word vector training is carried out by using language to obtain word vector of each word in the corpus, the language is preprocessed to obtain main body, feature and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics. The specific flow is as follows:

a. sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors corresponding to each word in the corpus.

b. And carrying out sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis on the corpus. If the similarity of the word vectors of the nouns in the sentences and the main body of the marked sentences is greater than a threshold value T, the nouns are added into a main body word set as main body words, otherwise, the nouns are used as feature word candidates, and adjectives in the sentences are used as viewpoint word candidates. According to the sentence dependency tree obtained by dependency analysis, selecting candidate feature words and candidate viewpoint words which are connected by specific relations on the dependency tree, adding the candidate feature words and the candidate viewpoint words into a feature word set and a viewpoint word set, for example, the candidate feature words and the candidate viewpoint words which are connected by the relations are often connected by edges of 'amod' and 'nsubj', and entering the feature word set and the viewpoint word set, so as to finally obtain a main body, a feature and the viewpoint word set of the corpus.

c. Determining a characteristic word f and a viewpoint word o in the sentence with the main word s determined according to the method, counting the co-occurrence frequency of the main word s, the characteristic word f and the viewpoint word o in the corpus, traversing all sentences in the corpus to finally obtain a co-occurrence frequency matrix M of each word between the main word and the characteristic word set and between the characteristic word and the viewpoint word set in the corpus _sf And M _fo 。

ST2: based on the co-occurrence frequency matrix M obtained by statistics in ST1 _sf And M _fo And carrying out bidirectional enhanced clustering among the main body-feature and feature-viewpoint word sets, and then re-clustering to obtain a clustering result inside each word set.

Concrete embodiments

Firstly, carrying out preliminary clustering in three word sets according to word vectors obtained through training in ST1, and then using a co-occurrence frequency matrix M between a main body-characteristic word set and a characteristic-viewpoint word set _sf And M _fo The correlation matrix is obtained by considering the correlation between each word of one word set and the class in the other word set. And carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a bidirectional enhancement clustering result of the main body-feature word set and the feature-viewpoint word set.

In clustering, the similarity measure between words is defined as follows:

wherein S is _content (W _i ,W _j ) The expression w _i Sum word w _j Content similarity between words, i.e. word vector similarity of words, S _rel (W _i ,W _j ) The expression w _i Sum word w _j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>

The specific flow of mutual enhancement clustering between the two word sets F and O is as follows:

d. updating set F based on the clustering result of set OCorrelation matrix M ₂ ，Word F _i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors ₂ . Association vector R' _i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F _i The weight between the clustering method and the x class after O clustering is word F _i A sum of co-occurrence frequencies with all words in the x-th class;

f. and (c) iterating the steps b-e until the clustering results of the two object types converge or the relative error is reduced to a certain degree.

Finally, a subject word gathering class result S is obtained by mutually enhanced clustering of subject-feature word sets _r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F _r Re-clustering is carried out to ensure that the finally obtained characteristic word gathering class result F _fr And contains both body and perspective information. The re-clustering process is as follows:

clustering class results F for feature words requiring re-clustering _r I.e., feature word sets that have been completed by mutual enhanced clustering between feature-perspective word sets,feature word Y _i Corresponding to the subject word gathering class result S _r The association vector between the two is->A representation;association vector R' _i Each component of (1) corresponds to S _r One of p classes of (1), wherein->Is the characteristic word Y _i Gathering class results S with subject words _r Weights between p classes of (c). At F _r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained _fr 。

ST3: and calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a bipartite graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network.

Body-feature-perspective association network referring to fig. 2, where words are divided into three parts: a main word set, a characteristic word set and a viewpoint word set. The three word sets are clustered into a plurality of classes through the clustering in the ST2, the part encircled by each dotted line in the graph represents one class, the classes of the main body-feature word set and the feature-viewpoint word set contain correlations, the correlations among the classes are represented in the graph by the dotted line, and the correlations represent that the words in the two classes jointly appear in sentences in the corpus.

In fig. 2, the association between classes is represented by dotted lines between classes, and the method uses point mutual information PMI between classes as the association strength between classes, and the calculation formula of the PMI is:

here P (c) ₁ ) And P (c) ₂ ) Is class c ₁ Class c of sum ₂ Frequency of occurrence, P' (c), of words in the corpus ₁ ,c ₂ ) Is class c ₁ All words and classes c of ₂ The sum of co-occurrence frequencies at the sentence level in the corpus.

Referring to fig. 3, a specific construction flow of the body-feature-perspective association network is as follows:

a. the content of the characteristic word set F is clustered into k classes only according to the internal relevance, namely cosine similarity among word vectors, so as to obtain the characteristic word set F after preliminary clustering ₁ ；

b. According to the clustering method of mutual enhancement in ST2, the feature word set F ₁ The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S ₁ In the feature word set F ₁ Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O ₁ And feature word set F ₂ ；

c. Due to F ₂ Some of the classes contain multi-domain features and therefore need to be based on the subject vocabulary S ₁ The associated weight matrix pairs F ₂ Re-clustering is carried out, the re-clustering method is as described in ST2, and finally the feature word set F after re-clustering is obtained ₃ ；

d. Based on principal-feature, feature-perspective co-occurrence frequency matrix M obtained by statistics from corpus _sf And M _fo Construction of subject vocabulary S ₁ And feature word set F ₃ Feature word set F ₃ Viewpoint word set O ₁ The association strength between classes is represented by the PMI described above. The point mutual information PMI is used as the association strength between classes, and a main body-feature word set and a feature-viewpoint word set are associated to obtain a clustering result and association information of the three word sets: the number of classes, the class center vector of each class, the labels of the classes to which each word belongs, the association strength between classes, etc., which form a body-feature-perspective association network.

ST4: for sentences needing implicit feature extraction, firstly obtaining main body and viewpoint words in the sentences, then judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the classes. The specific flow is shown in fig. 4:

a. performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, taking nouns as main word candidates, adjectives as viewpoint words, inquiring that nouns and adjectives connected by specific relations exist on a dependency tree, judging whether the nouns exist in a feature word set, if so, extracting the nouns as explicit features, otherwise, taking the nouns as main words;

b. judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the feature class s with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-feature word set and each class of the feature-viewpoint word set stored in the association network;

c. the most probable word is extracted from the feature class s as implicit feature word, here we extract the word in the class that appears most frequently in the corpus as implicit feature word w.

Referring to fig. 4, for a specific example, implicit feature extraction is performed taking the sentence "some little but her performance has been approved" as an example:

a. the sentence "Zhang somewhere" is still very small, but her performance has already received approval "carry on word segmentation, part of speech marking and dependency analysis, the name" Zhang somewhere "has the connection of the appointed relation" nsubj "on the dependency tree with adjective" small ", judge" Zhang somewhere "does not have and feature word is concentrated, regard" Zhang somewhere "as the subject word, adjective" small "is regarded as the viewpoint word;

b. according to the main words 'Zhang somewhere' and the viewpoint words 'small' identified in a, calculating the similarity between the word vectors of the main words 'Zhang somewhere' and the viewpoint words 'Zhang', respectively, and between the word vectors of the main words 'Zhang' and the viewpoint words 'Zhang', respectively, selecting main words 'people' and the viewpoint words 'Zhang' with the highest similarity as the categories to which the main words 'people' and the viewpoint words 'size' belong, and selecting the feature class with the highest association strength with the main words 'people' and the viewpoint words 'size', according to the constructed main body-feature-viewpoint association network, wherein the feature class with the association exists between the main words 'people' and the viewpoint words 'size', and then calculating the feature class with the highest average association strength with the main words as the most probable feature class;

c. and b, selecting the most probable feature words from the most probable feature class obtained in the step b as predicted implicit features, wherein the feature words with highest occurrence frequency in the corpus in the feature class are selected as implicit feature words.

Claims

1. A computer-readable storage medium having a program stored thereon, the program when executed implementing a hidden association network-based multi-domain text implicit feature extraction method, comprising the steps of:

2. The computer-readable storage medium of claim 1, wherein: in the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the corpus, the language materials are preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics, specifically: sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors of each word in the corpus; performing sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis preprocessing on the corpus, selecting possible nouns from sentences as main words and adding the main words into a main word set, otherwise, selecting adjectives in sentences as viewpoint word candidates, and selecting candidate feature words and candidate viewpoint words which are connected by specific relations to add the feature word set and the viewpoint word set according to a dependency tree obtained by dependency analysis; and counting the co-occurrence frequency matrix of each word in the corpus between the main body-feature word set and the feature-viewpoint word set.

3. The computer-readable storage medium of claim 1, wherein: in the step 2, bidirectional enhanced clustering is performed on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then clustering is performed again to obtain a clustering result in each word set, specifically: firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained through training in the step 1, then considering the association between each word of one word set and the fixed aggregation class of the other word set between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out iterative clustering with mutual enhancement by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a clustering result of the main body-feature and the feature-viewpoint word set; reclustering the characteristic word gathering results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering results simultaneously contain main body and viewpoint information;

in clustering, the similarity measure between words is defined as follows:

wherein S is _content (W _i ,W _j ) The expression W _i Sum word W _j Similarity of word vectors between them, herein S _content (W _i ,W _j ) For the word W _i Sum word W _j Content similarity between them; s is S _rel (W _i ,W _j ) The expression W _i Sum word W _j The corresponding correlation vector similarity in the mutual correlation matrix is called S _rel (W _i ,W _j ) For the word W _i Sum word W _j Correlation similarity between the two;representing the weight that the internal similarity takes up,the mutual enhancement clustering procedure between the two word sets F and O is as follows:

a. only considering content similarity, namely cosine similarity among word vectors, and clustering words in the set F into k classes;

b. updating the correlation matrix M of the set O according to the clustering result of the set F ₁ For any word O in set O _i Word O _i The corresponding association vector with the set F clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the k classes after clustering of set F, wherein +.>Word O _i The weight between the x class clustered with the set F is the word O _i Sum of co-occurrence frequencies with all words in the x-th class, x.epsilon.1, k]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new n x k-dimensional cross-correlation matrix M is formed by the correlation vectors of n words in the set O ₁ ；

c. According to the updated correlation matrix M between the set O and the set F in the b ₁ Clustering the data objects in the set O into l classes;

d. updating the correlation matrix M of the set F according to the clustering result of the set O ₂ For any word F in the set F _i Word F _i The corresponding association vector with the set O clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the l classes after the set O is clustered, wherein +.>Word F _i The weight between the y class clustered with the set O is the word F _i Sum of co-occurrence frequencies with all words in the y-th class, y.epsilon.1, l]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new M x l-dimensional mutual correlation matrix M is formed by the correlation vectors of M words in the set F ₂ ；

e. According to the updated correlation matrix M between the set F and the set O in d ₂ Reclustering the data objects in the set F into k classes;

f. iterating the steps b-e until the clustering results of the two word sets are converged;

main word gathering class result S obtained by mutually enhanced clustering of main-feature word sets _r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F _r The process of re-clustering is as follows:

suppose that subject word gathering class result S _r Class obtained by p bidirectional enhanced clustering, and feature word gathering class result F _r The class comprises q bidirectional enhancement clusters; clustering class results F for feature words requiring re-clustering _r ，F _r Any one of the feature words Y _i Corresponding to the subject word gathering class result S _r The association vector between them is composed ofA representation; association vector R' _i Each component in (1) corresponds to a subject word aggregation class result S _r One of p classes of (1), wherein->Is the characteristic word Y _i Gathering class results S with subject words _r The weights between the z-th class of (2), z E [1, p]The method comprises the steps of carrying out a first treatment on the surface of the Gathering class results F at feature words _r In each class of the feature words, the feature words are paired pairwise to calculate the similarity of the associated vectors, the feature words with the similarity of the associated vectors smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained _fr 。

4. The computer-readable storage medium of claim 1, wherein: in the step 3, mutual information between classes of two word sets is calculated as association strength between the classes by using the co-occurrence frequency matrix, and a bipartite graph between a main body and a feature, between the feature and a viewpoint word set is constructed to form a main body-feature-viewpoint association network, specifically:

a. clustering the feature word set F into k classes only according to the content similarity, namely cosine similarity among word vectors, to obtain the feature word set F after preliminary clustering ₁ ；

b. According to the bi-directional enhanced clustering method in step 2, set F is used ₁ The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S ₁ Using set F ₁ Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O ₁ And feature word set F ₂ ；

c. Due to the set F ₁ Performing bidirectional enhanced clustering with the viewpoint word set O to obtain a clustered characteristic word set F ₂ Some classes of the system contain multi-domain features, so that the words F are required to be based on the features ₂ And subject word set S ₁ The cross-correlation matrix M pairs of characteristic word sets F ₂ Re-clustering, wherein the cross-correlation matrix M is formed by a characteristic word set F ₂ Each feature word and subject word set S ₁ The components of each associated vector representing the corresponding feature word and the subject word set S ₁ A weight for each class; characteristic word set F is paired according to the correlation matrix M ₂ Proceeding withThe re-clustering method is as described in step 2, and finally the feature word set F after re-clustering is obtained ₃ ；

d. Constructing a subject word set S according to a subject-feature-viewpoint co-occurrence frequency matrix obtained by statistics from corpus ₁ And feature word set F ₃ Feature word set F ₃ Viewpoint word set O ₁ The association strength between classes is represented by PMI, and the calculation formula is:

here P (c) ₁ ) And P (c) ₂ ) Is class c ₁ Class c of sum ₂ Frequency of occurrence, P' (c), of words in the corpus ₁ ,c ₂ ) Is class c ₁ All words and classes c of ₂ And (3) associating the main body-feature word set and the feature-viewpoint word set by using the mutual information PMI as the association strength between classes according to the sum of the co-occurrence frequencies of all words in the sentence level in the corpus, and constructing a main body-feature-viewpoint association network.

5. The computer-readable storage medium of claim 1, wherein: in the step 4, for sentences requiring implicit feature extraction, obtaining main body and viewpoint words therein, then judging the category belonging to each word set, determining possible implicit feature categories according to the main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature categories, wherein the specific steps are as follows: performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, and identifying possible subject words and viewpoint words from the sentences; judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the characteristic class f with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-characteristic word set and each class in the characteristic-viewpoint word set in the association network; and extracting the word with the largest occurrence number in the corpus from the feature class f to serve as an implicit feature word w.