CN114861027A

CN114861027A - Multi-dimensional public opinion recommendation method based on big data and natural language processing

Info

Publication number: CN114861027A
Application number: CN202210483561.6A
Authority: CN
Inventors: 夏超; 贺鹏; 周嘉宜; 张�杰; 黄友汉; 倪安
Original assignee: Shenzhen Dongsheng Data Co ltd
Current assignee: Shenzhen Dongsheng Data Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-05
Anticipated expiration: 2042-04-29
Also published as: CN114861027B

Abstract

The invention discloses a multidimensional public opinion recommendation method based on big data and natural language processing, which comprises data acquisition, data access, data cleaning, public opinion scoring, public opinion recommendation and public opinion display.

Description

Multi-dimensional public opinion recommendation method based on big data and natural language processing

Technical Field

The invention relates to the technical field of data processing, in particular to a multidimensional public opinion recommendation method based on big data and natural language processing.

Background

In the big data era, mass data exist on the internet, and how to recommend public sentiment to the mass data is an important research topic of enterprises at present. Common public opinion recommendation processing technologies include simple regular expression rule collection and filtering, text pattern matching, emotion analysis, text similarity and the like, but the accuracy of the conventional recommendation technology based on rule matching or pure keyword matching is low.

Accordingly, the prior art is deficient and needs improvement.

Disclosure of Invention

The invention mainly aims to provide a multidimensional public opinion recommendation method based on big data and natural language processing, aiming at enabling a user to quickly obtain high-quality public opinion information meeting requirements and improving the efficiency of public opinion analysis.

In order to achieve the above object, the invention provides a multidimensional public opinion recommendation method based on big data and natural language processing, which comprises the following steps:

s1: the method comprises the steps of crawling internet public opinion data by using an internet crawler technology, and storing the crawled data into a database mysql;

s2: a big data technology real-time acquisition technology Flink cdc is adopted, full and incremental data are read from mysql in real time, the theme, content and release date of a webpage are extracted from the webpage content, and the webpage content is stored in a big data cluster Hive database;

s3: reading a plurality of keyword matching methods set by a user from a keyword table, analyzing each keyword matching method according to a pattern matching method, and matching with the content of each record in Hive data; if the content meets one of the keywords, the content is considered to meet the keyword matching, and the matched data is stored in a cleaned result database;

s4: scoring public sentiments, including classifying and scoring public sentiments, keyword scoring and media scoring, and classifying and scoring public sentiments, keyword scoring and media scoringCalculating the score through an algorithm formula to obtain the total public opinion score; the formula of the algorithm is S ═ lambda ₁ *S _c +λ ₂ *S _ky *S _m

Wherein S is the total score of public sentiment, S _C Score value for public sentiment classification, S _ky Score value for public sentiment keywords, S _m Score value, lambda, for public sentiment media ₁ Weight coefficient, lambda, for public opinion classification ₂ The weight coefficient is public opinion keyword; dividing the score into step intervals, S ₁ ，S ₂ A public opinion importance degree threshold value;

s5: carrying out screening and sequencing on dimensions such as public opinion total score, public opinion classification category and the like to recommend results; and screening the recommended data by using the public opinion classification category, sorting by using the total score, and recommending to the front end for display.

Preferably, in step S1, the stored data structure includes date, URL of the web page, and web page content.

Preferably, the public opinion classification score specifically comprises:

performing multi-classification operation on the text content through a deep learning technology;

labeling classified data by using data labeling software, and performing data labeling on each piece of data to obtain classified training data;

selecting a classification model, setting different parameters and carrying out model training on classification training data;

the classification model is deployed to be an inference interface which predicts public opinion texts and returns a classification category and the probability of the classification category;

selecting texts and labels with the screening probability higher than a set threshold value from the predicted public opinion text categories and category probabilities as training data of a future optimization classification model;

using S _C ＝S _ci *P _c Calculating score values of public opinion classification;

wherein S _C Score value for public sentiment classification, S _ci Is a score value, P, of a certain category _c A probability value for the classification is predicted for the classification model.

Preferably, the public opinion keyword scoring specifically comprises:

giving a keyword list, and performing keyword matching on public sentiment texts;

and after acquiring all matched keywords of the public opinion text, calculating the scores of the keywords.

Preferably, the calculating the score of the keyword specifically includes: setting a keyword inverse density as text length/keyword score:

where μ is the inverse density of the keyword, len (text) is the text length,

is the sum of the scores of the keywords;

setting two thresholds (mu) by analyzed keyword inverse density _min ,μ _max ) Normal public opinion texts are arranged between the two groups;

the analysis mode can draw a scatter diagram, the ordinate is the inverse density of the key words, and the abscissa is the sequence number after sequencing; intercepting a section of normal text after outlier values are removed, and obtaining two boundaries as threshold values:

wherein mu _t To determine if a normal text factor is present, μ _min ,μ _max Boundary thresholds for normal text, respectively;

setting step threshold value to make coefficient punishment on keyword score, analyzing keyword inverse density distribution and setting mu ₁ ，μ ₂ A threshold value;

the analysis method comprises the steps of drawing a histogram, wherein the abscissa is the inverse density of a keyword, and the ordinate is a numerical value of the keyword inverse density for carrying out barrel dividing operation;

penalizing the score according to the inverse densityμ _n Multiplying, and recommending texts with higher density;

wherein mu _n Is an inverse density penalty factor, mu _min ,μ _max Boundary threshold, μ, for normal text respectively ₁ ，μ ₂ Penalizing a threshold for a step;

the total score calculation formula of the keyword scores is as follows:

wherein S _ky The score value of the public opinion keywords is obtained,

is the sum of the scores of the keywords, μ _t To determine if a normal text factor is present, μ _n Is an inverse density penalty factor.

Preferably, the public opinion media scoring specifically comprises: s _m ＝S _mi

Wherein S _m Score value for public sentiment media, S _mi Is the media confidence value.

Compared with the prior art, the invention has the beneficial effects that: by utilizing big data and natural language processing technology, on the basis of a grading algorithm based on keywords, comprehensive evaluation and recommendation are carried out on a plurality of dimensional public opinion information such as comprehensive text classification. The method can enable the user to quickly obtain high-quality public opinion information meeting the requirements, and the recommendation accuracy is higher than that of the prior art, so that the public opinion analysis efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the overall process of the present invention;

FIG. 2 is a schematic diagram of a public opinion classification scoring process according to the present invention;

FIG. 3 is a schematic diagram of a public opinion keyword scoring process according to the present invention;

FIG. 4 is a schematic diagram of a general public opinion scoring process according to the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The multidimensional public opinion recommendation method based on big data and natural language processing provided by the embodiment comprises the following steps:

s1: the method comprises the steps of crawling internet public opinion data by using an internet crawler technology, and storing the crawled data into a database mysql; the stored data structure is date, URL of web page, web page content, etc.

S2: adopting a big data technology real-time acquisition technology Flink cdc, reading full and incremental data from mysql in real time, extracting topics (extracted from < h >, < title > of html), contents (from < body >, < text >, < textarea > of html, and the like) and release dates of webpages from webpage contents, and storing the topics and the incremental data in a big data cluster Hive database;

s3: reading a plurality of keyword matching methods set by a user from a keyword table, analyzing each keyword matching method according to a mode matching method (the content simultaneously comprises one or more characters or only comprises one of the characters, such as a + b-c, and simultaneously matches a and b but does not comprise c), and matching with the content of each record in Hive data (the matching method inquires whether the content comprises one or more keywords according to the requirement of mode matching); if the content meets one of the keywords, the content is considered to meet the keyword matching, and the matched data is stored in a cleaned result database;

s4: scoring public sentiments, including classification of public sentimentsScoring, scoring public sentiment keywords, scoring public sentiment media, and calculating public sentiment classification scores, public sentiment keyword scores and public sentiment media scores through an algorithm formula to obtain public sentiment total scores; the formula of the algorithm is S ═ lambda ₁ *S _c +λ ₂ *S _ky *S _m

Specifically, the purpose of public opinion classification scoring is to classify text contents into categories with different topics. Different categories are given different scores and the two fields are stored in the database. The category and the probability of being identified as the category are used as scoring basis.

Through a deep learning technology, text content is subjected to multi-classification operation, as shown in fig. 2, the deep learning text multi-classification operation is supervised learning, a large amount of labeled data needs to be prepared in advance, each piece of labeled data corresponds to one label, for a text multi-classification task, the data labeled with the data is the text content, and the label is one of the categories of the classification task.

And labeling the classified data by using data labeling software, and labeling each piece of data to obtain classified training data.

The classification models can be selected from a plurality of classes, such as TextCNN, TextRNN, TextRCNN, FastText, BERT, ALBERT and the like can be used as classifiers, the text classifiers are used for carrying out model training on classification training data by setting different parameters, models and parameters are preferably selected according to evaluation indexes F1, the sizes of the models and inference time, and the TextCNN model is finally selected for carrying out multi-classification on the texts. The method for constructing the deep learning model comprises the steps of using a deep learning framework pytorch to encode a text, using a torch.nn.Embedding layer of the pytorch to embed words in the text, using a torch.nn.Conv1d layer of the pytorch to embed words in the text, using a torch.cat layer of the pytorch to perform convolution operation on the words in the CNN layer, using a torch.nn.Linear layer of the pytorch to perform full connection operation on the linear layer, and using a torch.nn.Dropout of the pytorch to randomly discard the tensor. The model may be gradient descent trained using the GPU.

The TextCNN model can be deployed as an inference interface for predicting the public opinion text, and the interface returns a classified category and the probability of the category.

And using the predicted public opinion text category, the text with the screening probability higher than a threshold value (which can be set as 80%) in the category probability and the label as training data of a later optimization classification model. This data can be manually reviewed to improve accuracy. This positive feedback process can improve the accuracy of the classification model.

Using S last _C ＝S _ci *P _c And calculating the score of the public opinion classification.

The public opinion keywords are scored, and a keyword list is given firstly, wherein the keyword list contains fields 'keywords' and 'scores'. All keywords are found out from the text, and the keyword scoring flow is shown in fig. 3 according to all found keywords and the keyword scores as scoring bases.

The key word matching algorithm uses an ac automaton, the ac automaton uses a Trie tree which is also a dictionary tree and is combined with a KMP algorithm, the key point is that space is used for exchanging time, and the public prefix of a character string is used for reducing the expense of query time so as to achieve the purpose of high efficiency. And putting all keywords into a Trie tree, matching the texts in an ac automaton from the beginning of a target string one by one when the texts are matched, counting when the texts are matched, and jumping out of the accompanied position to try to match if the texts are not matched until all matching is completed.

After all matched keywords of the public opinion text are acquired, the scores of the keywords need to be calculated, the long text may be matched with more keywords, and if the scores of the keywords are linear or monotonically increasing functions, the problem of preference for the long text exists, so an algorithm needs to be designed to balance the relation between the text length and the scores of the keywords.

And designing a keyword inverse density concept, text length/keyword score.

Where μ is the inverse density of the keyword, len (text) is the text length,

is the sum of the scores of the keywords.

The keyword inverse density can reflect whether the text is normal text or not, and two threshold values (mu) are set by analyzing the keyword inverse density _min ,μ _max ) The normal public opinion text is between the two.

The analysis mode can draw a scatter diagram, wherein the ordinate is the inverse density of the key words, and the abscissa is the sequence number after sequencing; and intercepting a section of normal text after the outlier is removed to obtain two boundaries as threshold values.

Wherein mu _t To determine whether it is a normal text factor, μ _min ,μ _max Respectively, the boundary threshold for normal text.

Setting step threshold value to make coefficient punishment on keyword score, analyzing keyword inverse density distribution and setting mu ₁ ，μ ₂ A threshold value.

The analysis method comprises the steps of drawing a histogram, wherein the abscissa is the inverse density of the key words, and the ordinate is the numerical value of the key word inverse density for carrying out barrel dividing operation.

Punishment coefficient mu is carried out on the fraction according to the inverse density _n The effect is that the higher the density of text, the more recommended.

Wherein mu _n Is an inverse density penalty factor, mu _min ,μ _max Respectively, the boundary threshold for normal text. Mu.s ₁ ，μ ₂ A threshold is penalized for the step.

The total score calculation formula of the keyword scores is as follows:

wherein S _ky The score value of the public opinion keywords is obtained,

For public opinion texts, if the inverse density of the keywords is in a normal range, the more keywords are matched, the larger the score is.

The public opinion media scoring is the basis of scoring by using media information obtained in data acquisition. The confidence levels of different media are different, the message sources of some serious media have higher confidence level, and the message sources of non-serious media have lower confidence level. Based on this dimension, we design a media database, and fields can be added to media sources, media confidence values, etc.

S _m ＝S _mi

In the overall scoring of the public sentiment, the overall scoring is to comprehensively score multiple dimensions such as public sentiment classification scoring, public sentiment keyword scoring, public sentiment media scoring and the like into a final score through an algorithm formula.

As shown in fig. 4, the overall total scoring process first analyzes the value range of each dimension, for example, the expected value of the public sentiment classification dimension is the average value of the scores of all classification categories, and the expected value of the public sentiment keyword dimension is the expected value of the keyword score of the median of the text length. And carrying out average balance on the expected values of all dimensions by using weight coefficients.

And (3) integral public opinion scoring algorithm:

S＝(λ ₁ *S _c +λ ₂ *S _ky )*S _m

wherein S is the total score of public sentiment, S _C Score value for public sentiment classification, S _ky Score value for public sentiment keywords, S _m Score value, lambda, for public sentiment media ₁ Weight coefficient, lambda, for public opinion classification ₂ The weight coefficient is the public opinion keyword.

Dividing the score into step intervals, S ₁ ,S ₂ Is the threshold value of public opinion importance degree.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A multidimensional public opinion recommendation method based on big data and natural language processing is characterized by comprising the following steps:

s2: a big data technology real-time acquisition technology Flink cdc is adopted, full and incremental data are read from mysql in real time, the theme, content and release date of the webpage are extracted from the webpage content, and the webpage content is stored in a big data cluster Hive database;

s4: carrying out public opinion scoring, including public opinion classification scoring, public opinion keyword scoring and public opinion media scoring, and calculating public opinion classification score values, public opinion keyword score values and public opinion media score values through an algorithm formula to obtain public opinion total scoring; the formula of the algorithm is S ═ lambda ₁ *S _c +λ ₂ *S _ky )*S _m

Wherein S is the total score of public sentiment, S _C For public opinion classification score value, S _ky Score value for public sentiment keywords, S _m Score value, lambda, for public sentiment media ₁ Weight coefficient, lambda, for public opinion classification ₂ The weight coefficient is public opinion keyword; dividing the score into step intervals, S ₁ ，S ₂ A public opinion importance degree threshold value;

2. The method for multi-dimensional public opinion recommendation based on big data and natural language processing as claimed in claim 1, wherein in step S1, the stored data structure includes date, URL of web page, web page content.

3. The method as claimed in claim 1, wherein the public opinion classification scoring specifically includes:

4. The method as claimed in claim 1, wherein the multidimensional public opinion recommendation method based on big data and natural language processing specifically comprises:

5. The method as claimed in claim 4, wherein the calculating the score of the keyword specifically comprises: setting a keyword inverse density as text length/keyword score:

where μ is the inverse density of the keyword, len (text) is the text length,

is the sum of the scores of the keywords;

meridian pointKeyword analysis inverse density setting two thresholds (mu) _min ，μ _max ) Normal public opinion texts are arranged between the two groups;

the analysis mode can draw a scatter diagram, the ordinate is the inverse density of the key words, and the abscissa is the sequence number after sequencing; intercepting a section of normal text after outliers are removed, and obtaining two boundaries as threshold values:

wherein mu _t To determine if a normal text factor is present, μ _min ，μ _max Boundary thresholds for normal text, respectively;

punishment coefficient mu is carried out on the fraction according to the inverse density _n Multiplying, and recommending texts with higher density;

wherein mu _n Is an inverse density penalty factor, mu _min ，μ _max Boundary threshold, μ, for normal text respectively ₁ ，μ ₂ Penalizing a threshold for a step;

the total score calculation formula of the keyword scores is as follows:

wherein S _ky The score value of the public opinion keywords is obtained,

6. The method as claimed in claim 1, wherein the public opinion media scoring specifically includes: s _m ＝S _mi