CN111859898B - Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium - Google Patents

Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium Download PDF

Info

Publication number
CN111859898B
CN111859898B CN202010623820.1A CN202010623820A CN111859898B CN 111859898 B CN111859898 B CN 111859898B CN 202010623820 A CN202010623820 A CN 202010623820A CN 111859898 B CN111859898 B CN 111859898B
Authority
CN
China
Prior art keywords
word
feature
clustering
viewpoint
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010623820.1A
Other languages
Chinese (zh)
Other versions
CN111859898A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongsenyunlian Chengdu Technology Co ltd
Original Assignee
Zhongsenyunlian Chengdu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongsenyunlian Chengdu Technology Co ltd filed Critical Zhongsenyunlian Chengdu Technology Co ltd
Priority to CN202010623820.1A priority Critical patent/CN111859898B/en
Publication of CN111859898A publication Critical patent/CN111859898A/en
Application granted granted Critical
Publication of CN111859898B publication Critical patent/CN111859898B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the field of computer natural language processing, and provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and the program is executed to realize a multi-domain text implicit characteristic extraction method based on a hidden association network. The method comprises the following steps: obtaining a main body, characteristics and viewpoint word sets through corpus preprocessing, and obtaining co-occurrence frequency matrixes of the main body, characteristics and viewpoint words in the corpus through statistics; two-way enhanced clustering is carried out on the three word sets according to the co-occurrence frequency matrix; calculating the association strength, and constructing a main body-characteristic-view hiding association network; implicit features are extracted using the hidden association network. Aiming at the problem that the prior implicit feature extraction method has poor effect in the multi-domain text, the method can better extract the implicit features in the multi-domain text by constructing a main body-feature-view hidden association network by considering the association between the features and domain knowledge.

Description

Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium
The application is a divisional application with the application date of 2019, 4 months and 16 days, the application number of 201910304794.3 and the invention creates a multi-field text implicit characteristic extraction method based on a hidden association network.
Technical Field
The invention relates to the field of computer natural language processing, in particular to a multi-domain text implicit characteristic extraction method based on a hidden association network.
Background
With the rise of electronic commerce and social networks, the number of information or short texts with subjective emotion colors of users, such as microblogs and commodity comments, is increasing at a high speed, and the information generated by the users is a precious resource, wherein subjective emotion, opinion and other information can help people to make decisions, so that a great deal of research is attracted by mining the ideas expressed in the text with the subjective emotion of the users. Among them, more and more researchers are focusing on finer opinion mining, which digs people's views of a certain aspect of things, which are called feature-level views in these studies.
Most of the research in this field is focused on finding explicit features in text, however in many cases feature words are implicitly expressed by viewpoint words, such as: what is implicit to "computer cheating" is that the subject-the feature of "computer" -the price "has a point of view-the" cheating ", and such features that do not appear explicitly in the text are referred to as implicit features. The research of implicit features mainly only considers the association between feature words and viewpoint words in texts, and hidden association between feature words and viewpoint words is mined through a co-occurrence frequency matrix of the feature words and the viewpoint words in the corpus, so that possible implicit features can be predicted under the condition that the viewpoint words are obtained by using the hidden association.
Many of the text today is mixed domain text, containing content in a variety of domains, such as: politics, biology, economy, etc. The implicit feature recognition method proposed by the former only considers the association between feature words and viewpoint words in the text, but does not consider the application in multi-field text, and cannot obtain good effect on the increasingly mixed field text nowadays.
Disclosure of Invention
The invention aims to solve the problem that the implicit characteristic recognition method is poor in multi-field text effect, and provides a method for extracting multi-field text implicit characteristics based on a hidden association network. According to the method, the main words are added as prior knowledge constraint in the field of texts, the construction of the hidden association network is participated, and the hidden association among main body-feature-view three parties is considered, so that the method can be well applied to the extraction of the implicit features of the multi-field texts.
To achieve the object of the present invention, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:
step 1: word vector training is carried out by using language to obtain word vectors of each word in the corpus, the language is preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics;
step 2: performing bidirectional enhanced clustering on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then re-clustering to obtain a clustering result in each word set;
step 3: calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a two-part graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network;
step 4: and for sentences needing implicit feature extraction, obtaining main body and viewpoint words in the sentences, judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature classes.
In the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the language materials, sentence segmentation, part-of-speech labeling and dependency analysis preprocessing are performed on the language materials to obtain main words, characteristic words and viewpoint words of each sentence, a main body, characteristic and viewpoint word set of the language materials is finally obtained, and co-occurrence frequency matrixes of each word between the main body-characteristic word set and the characteristic-viewpoint word set in the language materials are obtained through statistics.
In the step 2, firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained in the training in the step 1, then, taking the association between each word of one word set and the fixed aggregation class of the other word set into consideration between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally, converging to obtain a clustering result of the main body-feature word set and the feature-viewpoint word set. And re-clustering the characteristic word gathering class results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering class results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering class results simultaneously contain main body and viewpoint information.
In clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression w i Sum word w j Content similarity between (word vector similarity of words), S rel (W i ,W j ) The expression w i Sum word w j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>
The process of performing bidirectional enhanced clustering between the two word sets F and O is as follows:
a. only the internal similarity, namely cosine similarity among word vectors, is considered, and words in the set F are clustered into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1Word O i The corresponding association vector with the set F clustering result is composed of +.>Representing, finally, a new n×k correlation matrix M is formed by the correlation vectors 1 . Association vector R' i Each component of (1) corresponds to one of the k classes after F-clustering, wherein +.>Is the word O i The weight between the clustering method and the x class after F clustering is word O i A sum of co-occurrence frequencies with all words in the x-th class;
c. based on the updated correlation matrix M between set O and set F 1 Clustering the data objects in the set O into l classes;
d. updating the correlation matrix M of the set F according to the clustering result of the set O 2Word F i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors 2 . Association vector R' i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F i The weight between the clustering method and the x class after O clustering is word F i A sum of co-occurrence frequencies with all words in the x-th class;
e. based on the updated correlation matrix M between set F and set O 2 Reclustering the data objects in the set F into k classes;
f. and (c) iterating the steps b-e until the clustering results of the two object types are converged.
The process for reclustering the feature word gathering results comprises the following steps: clustering class results F for feature words requiring re-clustering rFeature word Y i Corresponding to the subject word gathering class result S r The association vector between them is composed ofA representation; association vector R' i Each component of (1) corresponds to S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r Weights between p classes of (c). At F r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr
And 3, calculating the association strength among the clustering classes of the main body-feature and feature-viewpoint word sets by utilizing the co-occurrence frequency matrix according to the clustering result obtained in the step 2, and finally constructing a main body-feature-viewpoint association network. The association strength is represented by the PMI between the two classes, defined as:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 The frequency of occurrence, P ', of the words in the corpus'
(c 1 ,c 2 ) Is class c 1 All words and classes c of 2 The sum of co-occurrence frequencies at the sentence level in the corpus. And using the mutual information PMI as the association strength between classes, associating the main body-feature word set and the feature-viewpoint word set, and constructing a main body-feature-viewpoint association network.
In the step 4, possible implicit features in sentences are extracted by using a main body-feature-viewpoint association network, and the basic flow is as follows: for sentences needing implicit feature extraction, obtaining main words and viewpoint words in the sentences by utilizing the techniques of word segmentation, part-of-speech tagging, dependency analysis and the like, considering main classes and viewpoint classes to which the main words and the viewpoint words belong, obtaining feature classes with the highest weighted association degree with the two main classes and viewpoint classes according to a main body-feature-viewpoint association network, and finally predicting the most probable feature words as implicit features. This implicit feature recognition also works well for multi-domain text because of the association with subject words that is considered.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a diagram of a subject-feature-perspective association network;
FIG. 3 is a flow chart for constructing a subject-feature-perspective association network;
fig. 4 is an example of implicit feature recognition using a subject-feature-perspective association network.
Detailed Description
The present invention will now be described in further detail with reference to the drawings and examples, which are not intended to limit the scope of the invention.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed, implements a hidden association network-based multi-domain text implicit feature extraction method, the method comprising:
referring to fig. 1, a multi-domain text implicit feature extraction method based on a hidden association network includes the following steps:
ST1: word vector training is carried out by using language to obtain word vector of each word in the corpus, the language is preprocessed to obtain main body, feature and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics. The specific flow is as follows:
a. sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors corresponding to each word in the corpus.
b. And carrying out sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis on the corpus. If the similarity of the word vectors of the nouns in the sentences and the main body of the marked sentences is greater than a threshold value T, the nouns are added into a main body word set as main body words, otherwise, the nouns are used as feature word candidates, and adjectives in the sentences are used as viewpoint word candidates. According to the sentence dependency tree obtained by dependency analysis, selecting candidate feature words and candidate viewpoint words which are connected by specific relations on the dependency tree, adding the candidate feature words and the candidate viewpoint words into a feature word set and a viewpoint word set, for example, the candidate feature words and the candidate viewpoint words which are connected by the relations are often connected by edges of 'amod' and 'nsubj', and entering the feature word set and the viewpoint word set, so as to finally obtain a main body, a feature and the viewpoint word set of the corpus.
c. Determining a characteristic word f and a viewpoint word o in the sentence with the main word s determined according to the method, counting the co-occurrence frequency of the main word s, the characteristic word f and the viewpoint word o in the corpus, traversing all sentences in the corpus to finally obtain a co-occurrence frequency matrix M of each word between the main word and the characteristic word set and between the characteristic word and the viewpoint word set in the corpus sf And M fo
ST2: based on the co-occurrence frequency matrix M obtained by statistics in ST1 sf And M fo And carrying out bidirectional enhanced clustering among the main body-feature and feature-viewpoint word sets, and then re-clustering to obtain a clustering result inside each word set.
Concrete embodiments
Firstly, carrying out preliminary clustering in three word sets according to word vectors obtained through training in ST1, and then using a co-occurrence frequency matrix M between a main body-characteristic word set and a characteristic-viewpoint word set sf And M fo The correlation matrix is obtained by considering the correlation between each word of one word set and the class in the other word set. And carrying out mutual enhancement clustering between the two word sets by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a bidirectional enhancement clustering result of the main body-feature word set and the feature-viewpoint word set.
In clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression w i Sum word w j Content similarity between words, i.e. word vector similarity of words, S rel (W i ,W j ) The expression w i Sum word w j The associated similarity between (the corresponding associated vector similarity in the associated matrix),representing the weight occupied by the internal similarity, +.>
The specific flow of mutual enhancement clustering between the two word sets F and O is as follows:
a. only the internal similarity, namely cosine similarity among word vectors, is considered, and words in the set F are clustered into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1Word O i The corresponding association vector with the set F clustering result is composed of +.>Representing, finally, a new n×k correlation matrix M is formed by the correlation vectors 1 . Association vector R' i Each component of (1) corresponds to one of the k classes after F-clustering, wherein +.>Is the word O i The weight between the clustering method and the x class after F clustering is word O i A sum of co-occurrence frequencies with all words in the x-th class;
c. based on the updated correlation matrix M between set O and set F 1 Clustering the data objects in the set O into l classes;
d. updating set F based on the clustering result of set OCorrelation matrix M 2Word F i The corresponding association vector with the set O cluster result is composed of +.>Representing, finally, a new m×l correlation matrix M is formed by the correlation vectors 2 . Association vector R' i Each component of (1) corresponds to one of the O-clustered classes, wherein +.>Is the word F i The weight between the clustering method and the x class after O clustering is word F i A sum of co-occurrence frequencies with all words in the x-th class;
e. based on the updated correlation matrix M between set F and set O 2 Reclustering the data objects in the set F into k classes;
f. and (c) iterating the steps b-e until the clustering results of the two object types converge or the relative error is reduced to a certain degree.
Finally, a subject word gathering class result S is obtained by mutually enhanced clustering of subject-feature word sets r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F r Re-clustering is carried out to ensure that the finally obtained characteristic word gathering class result F fr And contains both body and perspective information. The re-clustering process is as follows:
clustering class results F for feature words requiring re-clustering r I.e., feature word sets that have been completed by mutual enhanced clustering between feature-perspective word sets,feature word Y i Corresponding to the subject word gathering class result S r The association vector between the two is->A representation;association vector R' i Each component of (1) corresponds to S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r Weights between p classes of (c). At F r In each class of the feature words, the feature words are pairwise matched to calculate the similarity of the associated vectors, the feature words with the vector similarity smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr
ST3: and calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a bipartite graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network.
Body-feature-perspective association network referring to fig. 2, where words are divided into three parts: a main word set, a characteristic word set and a viewpoint word set. The three word sets are clustered into a plurality of classes through the clustering in the ST2, the part encircled by each dotted line in the graph represents one class, the classes of the main body-feature word set and the feature-viewpoint word set contain correlations, the correlations among the classes are represented in the graph by the dotted line, and the correlations represent that the words in the two classes jointly appear in sentences in the corpus.
In fig. 2, the association between classes is represented by dotted lines between classes, and the method uses point mutual information PMI between classes as the association strength between classes, and the calculation formula of the PMI is:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 Frequency of occurrence, P' (c), of words in the corpus 1 ,c 2 ) Is class c 1 All words and classes c of 2 The sum of co-occurrence frequencies at the sentence level in the corpus.
Referring to fig. 3, a specific construction flow of the body-feature-perspective association network is as follows:
a. the content of the characteristic word set F is clustered into k classes only according to the internal relevance, namely cosine similarity among word vectors, so as to obtain the characteristic word set F after preliminary clustering 1
b. According to the clustering method of mutual enhancement in ST2, the feature word set F 1 The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S 1 In the feature word set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O 1 And feature word set F 2
c. Due to F 2 Some of the classes contain multi-domain features and therefore need to be based on the subject vocabulary S 1 The associated weight matrix pairs F 2 Re-clustering is carried out, the re-clustering method is as described in ST2, and finally the feature word set F after re-clustering is obtained 3
d. Based on principal-feature, feature-perspective co-occurrence frequency matrix M obtained by statistics from corpus sf And M fo Construction of subject vocabulary S 1 And feature word set F 3 Feature word set F 3 Viewpoint word set O 1 The association strength between classes is represented by the PMI described above. The point mutual information PMI is used as the association strength between classes, and a main body-feature word set and a feature-viewpoint word set are associated to obtain a clustering result and association information of the three word sets: the number of classes, the class center vector of each class, the labels of the classes to which each word belongs, the association strength between classes, etc., which form a body-feature-perspective association network.
ST4: for sentences needing implicit feature extraction, firstly obtaining main body and viewpoint words in the sentences, then judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the classes. The specific flow is shown in fig. 4:
a. performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, taking nouns as main word candidates, adjectives as viewpoint words, inquiring that nouns and adjectives connected by specific relations exist on a dependency tree, judging whether the nouns exist in a feature word set, if so, extracting the nouns as explicit features, otherwise, taking the nouns as main words;
b. judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the feature class s with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-feature word set and each class of the feature-viewpoint word set stored in the association network;
c. the most probable word is extracted from the feature class s as implicit feature word, here we extract the word in the class that appears most frequently in the corpus as implicit feature word w.
Referring to fig. 4, for a specific example, implicit feature extraction is performed taking the sentence "some little but her performance has been approved" as an example:
a. the sentence "Zhang somewhere" is still very small, but her performance has already received approval "carry on word segmentation, part of speech marking and dependency analysis, the name" Zhang somewhere "has the connection of the appointed relation" nsubj "on the dependency tree with adjective" small ", judge" Zhang somewhere "does not have and feature word is concentrated, regard" Zhang somewhere "as the subject word, adjective" small "is regarded as the viewpoint word;
b. according to the main words 'Zhang somewhere' and the viewpoint words 'small' identified in a, calculating the similarity between the word vectors of the main words 'Zhang somewhere' and the viewpoint words 'Zhang', respectively, and between the word vectors of the main words 'Zhang' and the viewpoint words 'Zhang', respectively, selecting main words 'people' and the viewpoint words 'Zhang' with the highest similarity as the categories to which the main words 'people' and the viewpoint words 'size' belong, and selecting the feature class with the highest association strength with the main words 'people' and the viewpoint words 'size', according to the constructed main body-feature-viewpoint association network, wherein the feature class with the association exists between the main words 'people' and the viewpoint words 'size', and then calculating the feature class with the highest average association strength with the main words as the most probable feature class;
c. and b, selecting the most probable feature words from the most probable feature class obtained in the step b as predicted implicit features, wherein the feature words with highest occurrence frequency in the corpus in the feature class are selected as implicit feature words.

Claims (5)

1. A computer-readable storage medium having a program stored thereon, the program when executed implementing a hidden association network-based multi-domain text implicit feature extraction method, comprising the steps of:
step 1: word vector training is carried out by using language to obtain word vectors of each word in the corpus, the language is preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics;
step 2: performing bidirectional enhanced clustering on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then re-clustering to obtain a clustering result in each word set;
step 3: calculating mutual information between classes of two word sets by using the co-occurrence frequency matrix to serve as the association strength between the classes, constructing a two-part graph between a main body and the feature, between the feature and the viewpoint word set, and forming a main body-feature-viewpoint association network;
step 4: and for sentences needing implicit feature extraction, obtaining main body and viewpoint words in the sentences, judging the classes in the respective word sets, determining possible implicit feature classes according to a main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature classes.
2. The computer-readable storage medium of claim 1, wherein: in the step 1, word vector training is performed by using language materials to obtain word vectors of each word in the corpus, the language materials are preprocessed to obtain main bodies, characteristics and viewpoint word sets, and co-occurrence frequency matrixes of each word among the word sets in the corpus are obtained through statistics, specifically: sentence segmentation and word segmentation are carried out on the corpus to obtain training data, and word vector training is carried out on the training data to obtain word vectors of each word in the corpus; performing sentence segmentation, word segmentation, part-of-speech tagging and dependency analysis preprocessing on the corpus, selecting possible nouns from sentences as main words and adding the main words into a main word set, otherwise, selecting adjectives in sentences as viewpoint word candidates, and selecting candidate feature words and candidate viewpoint words which are connected by specific relations to add the feature word set and the viewpoint word set according to a dependency tree obtained by dependency analysis; and counting the co-occurrence frequency matrix of each word in the corpus between the main body-feature word set and the feature-viewpoint word set.
3. The computer-readable storage medium of claim 1, wherein: in the step 2, bidirectional enhanced clustering is performed on the main body-feature and feature-viewpoint word sets according to the co-occurrence frequency matrix, and then clustering is performed again to obtain a clustering result in each word set, specifically: firstly, carrying out preliminary clustering in three word sets according to the word vectors obtained through training in the step 1, then considering the association between each word of one word set and the fixed aggregation class of the other word set between the main body-feature word set and the feature-viewpoint word set to obtain an inter-association matrix, carrying out iterative clustering with mutual enhancement by utilizing the association similarity and the content similarity between the words, and finally converging to obtain a clustering result of the main body-feature and the feature-viewpoint word set; reclustering the characteristic word gathering results obtained by the mutual enhancement clustering of the characteristic-viewpoint word sets by using the main word gathering results obtained by the mutual enhancement clustering of the main body-characteristic word sets, so as to ensure that the finally obtained characteristic word gathering results simultaneously contain main body and viewpoint information;
in clustering, the similarity measure between words is defined as follows:
wherein S is content (W i ,W j ) The expression W i Sum word W j Similarity of word vectors between them, herein S content (W i ,W j ) For the word W i Sum word W j Content similarity between them; s is S rel (W i ,W j ) The expression W i Sum word W j The corresponding correlation vector similarity in the mutual correlation matrix is called S rel (W i ,W j ) For the word W i Sum word W j Correlation similarity between the two;representing the weight that the internal similarity takes up,the mutual enhancement clustering procedure between the two word sets F and O is as follows:
a. only considering content similarity, namely cosine similarity among word vectors, and clustering words in the set F into k classes;
b. updating the correlation matrix M of the set O according to the clustering result of the set F 1 For any word O in set O i Word O i The corresponding association vector with the set F clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the k classes after clustering of set F, wherein +.>Word O i The weight between the x class clustered with the set F is the word O i Sum of co-occurrence frequencies with all words in the x-th class, x.epsilon.1, k]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new n x k-dimensional cross-correlation matrix M is formed by the correlation vectors of n words in the set O 1
c. According to the updated correlation matrix M between the set O and the set F in the b 1 Clustering the data objects in the set O into l classes;
d. updating the correlation matrix M of the set F according to the clustering result of the set O 2 For any word F in the set F i Word F i The corresponding association vector with the set O clustering result is composed ofRepresenting, association vector->Each component of (1) corresponds to one of the l classes after the set O is clustered, wherein +.>Word F i The weight between the y class clustered with the set O is the word F i Sum of co-occurrence frequencies with all words in the y-th class, y.epsilon.1, l]The method comprises the steps of carrying out a first treatment on the surface of the Finally, a new M x l-dimensional mutual correlation matrix M is formed by the correlation vectors of M words in the set F 2
e. According to the updated correlation matrix M between the set F and the set O in d 2 Reclustering the data objects in the set F into k classes;
f. iterating the steps b-e until the clustering results of the two word sets are converged;
main word gathering class result S obtained by mutually enhanced clustering of main-feature word sets r Clustering the mutual enhancement of the feature-viewpoint word sets to obtain a feature word gathering class result F r The process of re-clustering is as follows:
suppose that subject word gathering class result S r Class obtained by p bidirectional enhanced clustering, and feature word gathering class result F r The class comprises q bidirectional enhancement clusters; clustering class results F for feature words requiring re-clustering r ,F r Any one of the feature words Y i Corresponding to the subject word gathering class result S r The association vector between them is composed ofA representation; association vector R' i Each component in (1) corresponds to a subject word aggregation class result S r One of p classes of (1), wherein->Is the characteristic word Y i Gathering class results S with subject words r The weights between the z-th class of (2), z E [1, p]The method comprises the steps of carrying out a first treatment on the surface of the Gathering class results F at feature words r In each class of the feature words, the feature words are paired pairwise to calculate the similarity of the associated vectors, the feature words with the similarity of the associated vectors smaller than the threshold t are divided into new classes, and finally the feature word set F after re-clustering is obtained fr
4. The computer-readable storage medium of claim 1, wherein: in the step 3, mutual information between classes of two word sets is calculated as association strength between the classes by using the co-occurrence frequency matrix, and a bipartite graph between a main body and a feature, between the feature and a viewpoint word set is constructed to form a main body-feature-viewpoint association network, specifically:
a. clustering the feature word set F into k classes only according to the content similarity, namely cosine similarity among word vectors, to obtain the feature word set F after preliminary clustering 1
b. According to the bi-directional enhanced clustering method in step 2, set F is used 1 The two-way enhanced clustering is carried out between the main word set S and the main word set S to obtain the clustered main word set S 1 Using set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain clustered viewpoint word set O 1 And feature word set F 2
c. Due to the set F 1 Performing bidirectional enhanced clustering with the viewpoint word set O to obtain a clustered characteristic word set F 2 Some classes of the system contain multi-domain features, so that the words F are required to be based on the features 2 And subject word set S 1 The cross-correlation matrix M pairs of characteristic word sets F 2 Re-clustering, wherein the cross-correlation matrix M is formed by a characteristic word set F 2 Each feature word and subject word set S 1 The components of each associated vector representing the corresponding feature word and the subject word set S 1 A weight for each class; characteristic word set F is paired according to the correlation matrix M 2 Proceeding withThe re-clustering method is as described in step 2, and finally the feature word set F after re-clustering is obtained 3
d. Constructing a subject word set S according to a subject-feature-viewpoint co-occurrence frequency matrix obtained by statistics from corpus 1 And feature word set F 3 Feature word set F 3 Viewpoint word set O 1 The association strength between classes is represented by PMI, and the calculation formula is:
here P (c) 1 ) And P (c) 2 ) Is class c 1 Class c of sum 2 Frequency of occurrence, P' (c), of words in the corpus 1 ,c 2 ) Is class c 1 All words and classes c of 2 And (3) associating the main body-feature word set and the feature-viewpoint word set by using the mutual information PMI as the association strength between classes according to the sum of the co-occurrence frequencies of all words in the sentence level in the corpus, and constructing a main body-feature-viewpoint association network.
5. The computer-readable storage medium of claim 1, wherein: in the step 4, for sentences requiring implicit feature extraction, obtaining main body and viewpoint words therein, then judging the category belonging to each word set, determining possible implicit feature categories according to the main body-feature-viewpoint association network, and finally obtaining the most possible implicit feature words from the implicit feature categories, wherein the specific steps are as follows: performing word segmentation, part-of-speech tagging and dependency analysis on sentences with implicit features to be identified, and identifying possible subject words and viewpoint words from the sentences; judging the main body class s and the viewpoint class o to which the identified main body words and viewpoint words belong, and selecting the characteristic class f with the strongest average association strength with the main body class s and the viewpoint class o according to the association strength between the main body-characteristic word set and each class in the characteristic-viewpoint word set in the association network; and extracting the word with the largest occurrence number in the corpus from the feature class f to serve as an implicit feature word w.
CN202010623820.1A 2019-04-16 2019-04-16 Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium Active CN111859898B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010623820.1A CN111859898B (en) 2019-04-16 2019-04-16 Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910304794.3A CN110020439B (en) 2019-04-16 2019-04-16 Hidden associated network-based multi-field text implicit feature extraction method
CN202010623820.1A CN111859898B (en) 2019-04-16 2019-04-16 Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910304794.3A Division CN110020439B (en) 2019-04-16 2019-04-16 Hidden associated network-based multi-field text implicit feature extraction method

Publications (2)

Publication Number Publication Date
CN111859898A CN111859898A (en) 2020-10-30
CN111859898B true CN111859898B (en) 2024-01-16

Family

ID=67191503

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910304794.3A Active CN110020439B (en) 2019-04-16 2019-04-16 Hidden associated network-based multi-field text implicit feature extraction method
CN202010623820.1A Active CN111859898B (en) 2019-04-16 2019-04-16 Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910304794.3A Active CN110020439B (en) 2019-04-16 2019-04-16 Hidden associated network-based multi-field text implicit feature extraction method

Country Status (1)

Country Link
CN (2) CN110020439B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821587B (en) * 2021-06-02 2024-05-17 腾讯科技(深圳)有限公司 Text relevance determining method, model training method, device and storage medium
CN115168600B (en) * 2022-06-23 2023-07-11 广州大学 Value chain knowledge discovery method under personalized customization

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006338342A (en) * 2005-06-02 2006-12-14 Nippon Telegr & Teleph Corp <Ntt> Word vector generation device, word vector generation method and program
CN103365999A (en) * 2013-07-16 2013-10-23 盐城工学院 Text clustering integrated method based on similarity degree matrix spectral factorization
CN103412880A (en) * 2013-07-17 2013-11-27 百度在线网络技术(北京)有限公司 Method and device for determining implicit associated information between multimedia resources
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN105007262A (en) * 2015-06-03 2015-10-28 浙江大学城市学院 WLAN multi-step attack intention pre-recognition method
CN106354754A (en) * 2016-08-16 2017-01-25 清华大学 Bootstrap-type implicit characteristic mining method and system based on dispersed independent component analysis
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN107358014A (en) * 2016-11-02 2017-11-17 华南师范大学 The clinical pre-treating method and system of a kind of physiological data
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
CN108140061A (en) * 2015-06-05 2018-06-08 凯撒斯劳滕工业大学 Network die body automatically determines

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5697202B2 (en) * 2011-03-08 2015-04-08 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method, program and system for finding correspondence of terms
US20140272914A1 (en) * 2013-03-15 2014-09-18 William Marsh Rice University Sparse Factor Analysis for Learning Analytics and Content Analytics
US9594746B2 (en) * 2015-02-13 2017-03-14 International Business Machines Corporation Identifying word-senses based on linguistic variations

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006338342A (en) * 2005-06-02 2006-12-14 Nippon Telegr & Teleph Corp <Ntt> Word vector generation device, word vector generation method and program
CN103365999A (en) * 2013-07-16 2013-10-23 盐城工学院 Text clustering integrated method based on similarity degree matrix spectral factorization
CN103412880A (en) * 2013-07-17 2013-11-27 百度在线网络技术(北京)有限公司 Method and device for determining implicit associated information between multimedia resources
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN105007262A (en) * 2015-06-03 2015-10-28 浙江大学城市学院 WLAN multi-step attack intention pre-recognition method
CN108140061A (en) * 2015-06-05 2018-06-08 凯撒斯劳滕工业大学 Network die body automatically determines
CN106354754A (en) * 2016-08-16 2017-01-25 清华大学 Bootstrap-type implicit characteristic mining method and system based on dispersed independent component analysis
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN107358014A (en) * 2016-11-02 2017-11-17 华南师范大学 The clinical pre-treating method and system of a kind of physiological data
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BIM应用问题分析与对策;任浩楹;《城市住宅》;第25卷(第07期);37-40 *
Bipartite Graph Filter Banks: Polyphase Analysis and Generalization;David B. H. Tay;《 IEEE Transactions on Signal Processing》;第65卷(第18期);4833 - 4846 *
基于义类同现频率的汉语语义排歧方法;张永奎;《计算机研究与发展》(第07期);1-5 *
多元异构环境下个性化推荐方法研究;危建珍;《中国优秀硕士学位论文全文数据库》(第03期);I138-2137 *

Also Published As

Publication number Publication date
CN111859898A (en) 2020-10-30
CN110020439A (en) 2019-07-16
CN110020439B (en) 2020-07-07

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
Shi et al. Functional and contextual attention-based LSTM for service recommendation in mashup creation
CN110717106B (en) Information pushing method and device
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN110765260A (en) Information recommendation method based on convolutional neural network and joint attention mechanism
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN110705247B (en) Based on x2-C text similarity calculation method
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN110807086A (en) Text data labeling method and device, storage medium and electronic equipment
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN112347758A (en) Text abstract generation method and device, terminal equipment and storage medium
CN111859898B (en) Hidden association network-based multi-domain text implicit feature extraction method and computer storage medium
CN111737560A (en) Content search method, field prediction model training method, device and storage medium
CN114997288A (en) Design resource association method
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Swamy et al. Nit-agartala-nlp-team at semeval-2020 task 8: Building multimodal classifiers to tackle internet humor
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant