CN112800180A

CN112800180A - Automatic extraction scheme of comment text labels

Info

Publication number: CN112800180A
Application number: CN202110166250.2A
Authority: CN
Inventors: 岑袁京
Original assignee: Beijing Yiche Interconnection Information Technology Co ltd
Current assignee: Beijing Yiche Interconnection Information Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-05-14

Abstract

The application discloses automatic extraction scheme of comment text label, including word segmentation module: performing word segmentation on the comment text; word embedding model: according to the imbedd i ng vector representation of words at the massive text training position; emotion polarity model: judging the emotion classification of the text, and marking the text with positive, negative and neutral marks; and obtaining a comment result. The beneficial effect of this application is: the machine learning mode is used, the text labels of the comment texts are automatically extracted, the workload of manual labeling is greatly reduced on the basis of ensuring the correctness, the internal semantic meanings of words can be mined, the number of categories of the text labels is reduced by clustering the scattered text labels, the accuracy of data is enhanced, the texts can be intuitively classified according to emotion through the introduction of a text emotion polarity model, and the matching effect of the comment texts and the label texts is perfected through the emotion polarity judgment of the comment texts and the label texts.

Description

Automatic extraction scheme of comment text labels

Technical Field

The application relates to an automatic extraction scheme, in particular to an automatic extraction scheme for commenting text labels.

Background

Text, which refers to the representation of a written language, is, from a grammatical point of view, usually a sentence or a combination of sentences having a complete, systematic meaning, a text can be a sentence, a paragraph or a chapter, and is broadly "text": any words fixed by writing, narrowly defined "text": the literary entity composed of language and characters, which is referred to as 'works', constitutes an independent and self-sufficient system relative to the author and the world.

Technical solutions known in the art: operators manually label texts, and the text matching based on rules and the tf-idf scheme based on word frequency statistics have the following defects: the method has the advantages of large workload, high cost, continuous manual rule addition, large personal subjective randomness and incapability of discovering the internal semantic relation of words in the text when massive texts are published at any time. Therefore, an automatic extraction scheme of the comment text label is proposed for the above problems.

Disclosure of Invention

The present application is directed to provide an automatic extraction scheme of a comment text label to solve the above problems.

The above purpose is achieved through the following technical scheme, and the scheme for automatically extracting the comment text label comprises the following steps:

step one, a word segmentation module: performing word segmentation on the comment text;

step two, embedding words into a model: according to the imbedding vector representation of words at the massive text training position;

step three, emotion polarity model: judging the emotion classification of the text, and marking the text with positive, negative and neutral marks;

and step four, obtaining a comment result.

Preferably, in the step one, the word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification, and the word segmentation is divided into english word segmentation and chinese word segmentation.

Preferably, the Chinese word segmentation technology is a mechanical word segmentation technology, a sequence labeling technology based on statistics and a hidden Markov model technology, and the hidden Markov model is preferably used as a main engine of the word segmentation module.

Preferably, the basic idea of the hidden markov model is to find a real hidden state value sequence according to the observed value sequence, manually collect a part of unique word sets, and perform sequential proofreading after word segmentation by using a conditional random field.

Preferably, the word embedding model mechanism is as follows:

(1) firstly, acquiring a large amount of text data;

(2) then we build a window that can slide along the text;

(3) using such a sliding window, a large amount of sample data can be generated for the training model.

Preferably, the Word embedding model in the second step represents each Word in the natural language into a short vector with unified meaning and unified dimension, and if a rarely-used Word is encountered, Word2Vec is used for capturing and acquiring the Word.

Preferably, the Word embedding trained by Word2Vec has two characteristics as follows:

(1) semantic similarity is embodied, for example, words closest to red are embedded, and the result is words representing colors, such as white, black and the like.

(2) The semantic translation relation is embodied, such as the word embedding closest to the distance from 'wman' to 'man' + 'king' is calculated, and the result is 'queen'.

Preferably, the emotion polarity model in the third step can be divided into emotion analysis based on news comments and emotion analysis based on product comments according to different types of processed texts, the public opinion is monitored and information is predicted based on the emotion analysis of the news comments, and the emotion analysis based on the product comments helps a user to know the public praise of a certain product in public mind.

Preferably, the emotion polarity analysis method of the emotion polarity model is divided into an emotion dictionary-based method and a machine learning-based method, and the machine learning-based method is used, and a bidirectional long-time neural network is used as a main engine for emotion classification.

Preferably, the bidirectional long-and-short-term neural network comprises a forward LSTM part and a backward LSTM part, the two parts are used for modeling context information in a natural language processing task, and the context information is used for emotion classification after vectors are spliced.

The beneficial effects of this application are: the machine learning mode is used, the text labels of the comment texts are automatically extracted, the workload of manual labeling is greatly reduced on the basis of ensuring the correctness, the internal semantic meanings of words can be mined, the number of categories of the text labels is reduced by clustering the scattered text labels, the accuracy of data is enhanced, the texts can be intuitively classified according to emotion through the introduction of a text emotion polarity model, and the matching effect of the comment texts and the label texts is perfected through the emotion polarity judgment of the comment texts and the label texts.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a design architecture of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely below, and it should be apparent that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The first embodiment is as follows:

as shown in fig. 1, an automatic extraction scheme for comment text labels includes the following steps:

and step four, obtaining a comment result.

Furthermore, in the first step, word segmentation is a process of recombining continuous word sequences into word sequences according to a certain standard, and the word segmentation is divided into English word segmentation and Chinese word segmentation.

Furthermore, the Chinese word segmentation technology is a mechanical word segmentation technology, a sequence labeling technology based on statistics and a hidden Markov model technology, and preferably the hidden Markov model is used as a main engine of the word segmentation module.

Further, the basic idea of the hidden Markov model is to find a real hidden state value sequence according to an observed value sequence, manually collect a part of unique word sets, and use a conditional random field to perform sequential proofreading after word segmentation.

Further, the word embedding model mechanism is as follows:

(1) firstly, acquiring a large amount of text data;

(2) then we build a window that can slide along the text;

Furthermore, the Word embedding model in the second step represents each Word in the natural language into a short vector with unified meaning and unified dimension, and if a rarely-used Word is encountered, Word2Vec is used for capturing and acquiring the Word.

Further, the Word embedding trained by Word2Vec has two characteristics as follows:

Furthermore, the emotion polarity model in the third step can be divided into emotion analysis based on news comments and emotion analysis based on product comments according to different types of processed texts, public opinion monitoring and information prediction are carried out based on the emotion analysis of the news comments, and the emotion analysis based on the product comments helps a user to know the public praise of a certain product in public mind.

Furthermore, the emotion polarity analysis method of the emotion polarity model is divided into an emotion dictionary-based method and a machine learning-based method, and a machine learning-based method is adopted, and a bidirectional long-time neural network is adopted as a main engine for emotion classification.

Furthermore, the bidirectional long-short-time neural network comprises a forward LSTM part and a backward LSTM part, wherein the two parts are used for modeling context information in a natural language processing task and are used for emotion classification after vectors are spliced.

The automatic extraction scheme of the comment text label has the advantages that: the hidden Markov model is used as a main engine of a word segmentation module, the basic idea is to find a real hidden state value sequence according to an observed value sequence and manually collect part of specific word sets to improve the accuracy of word segmentation, and in addition, a conditional random field is used for sequential proofreading after word segmentation to improve the rationality of multi-ambiguity word segmentation.

Example two:

and step four, obtaining a comment result.

Further, the word embedding model mechanism is as follows:

(1) firstly, acquiring a large amount of text data;

(2) then we build a window that can slide along the text;

The automatic extraction scheme of the comment text label has the advantages that: the machine learning mode is used, the text labels of the comment texts are automatically extracted, and the workload of manual labeling is greatly reduced on the basis of ensuring the correctness.

Example three:

and step four, obtaining a comment result.

The automatic extraction scheme of the comment text label has the advantages that: the internal semantic meaning of the words can be mined, and the category number of the text labels is reduced and the accuracy of the data is enhanced by clustering the scattered text labels.

Example four:

and step four, obtaining a comment result.

The automatic extraction scheme of the comment text label has the advantages that: through the introduction of the text emotion polarity model, the texts can be classified intuitively in emotion, and the matching effect of the comment texts and the label texts is improved through judgment of emotion polarities of the comment texts and the label texts.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. An automatic extraction scheme of comment text labels is characterized in that: the method comprises the following steps:

and step four, obtaining a comment result.

2. The automatic extraction scheme of the comment text label as claimed in claim 1, wherein: and step one, the word segmentation re-combines the continuous character sequences into word sequences according to a certain standard, and the word segmentation is divided into English word segmentation and Chinese word segmentation.

3. The automatic extraction scheme of the comment text label as claimed in claim 2, wherein: the Chinese word segmentation technology is a mechanical word segmentation technology, a sequence labeling technology based on statistics and a hidden Markov model technology, and preferably the hidden Markov model is used as a main engine of a word segmentation module.

4. The automatic extraction scheme of the comment text label as claimed in claim 3, wherein: the basic idea of the hidden Markov model is to find a real hidden state value sequence according to an observed value sequence, manually collect part of special word sets, and use a conditional random field to carry out sequential proofreading after word segmentation.

5. The automatic extraction scheme of the comment text label as claimed in claim 1, wherein: the word embedding model mechanism is as follows:

(1) firstly, acquiring a large amount of text data;

(2) then we build a window that can slide along the text;

6. The automatic extraction scheme of the comment text label as claimed in claim 1, wherein: and in the second step, the Word embedding model expresses each Word in the natural language into a short vector with unified meaning and unified dimensionality, and if a rarely-used Word is encountered, Word2Vec is used for capturing and acquiring the Word.

7. The automatic extraction scheme of the comment text label as claimed in claim 6, wherein: the Word embedding trained by Word2Vec has two characteristics as follows:

8. The automatic extraction scheme of the comment text label as claimed in claim 1, wherein: the emotion polarity model in the third step can be divided into emotion analysis based on news comments and emotion analysis based on product comments according to different types of processed texts, public opinion monitoring and information prediction are carried out based on the emotion analysis of the news comments, and the emotion analysis based on the product comments helps a user to know the public praise of a certain product in public mind.

9. The automatic extraction scheme of the comment text label as claimed in claim 1, wherein: the emotion polarity analysis method of the emotion polarity model comprises an emotion dictionary-based method and a machine learning-based method, wherein a machine learning-based method is adopted, and a bidirectional long-time and short-time neural network is adopted as a main engine for emotion classification.

10. The automatic extraction scheme of the comment text label as claimed in claim 9, wherein: the bidirectional long-short-time neural network comprises a forward LSTM part and a backward LSTM part, wherein the two parts are used for modeling context information in a natural language processing task, and are used for emotion classification after vectors are spliced.