CN117875309B - Public opinion analysis method, device and medium based on big data and deep learning - Google Patents

Public opinion analysis method, device and medium based on big data and deep learning Download PDF

Info

Publication number
CN117875309B
CN117875309B CN202311496623.8A CN202311496623A CN117875309B CN 117875309 B CN117875309 B CN 117875309B CN 202311496623 A CN202311496623 A CN 202311496623A CN 117875309 B CN117875309 B CN 117875309B
Authority
CN
China
Prior art keywords
word
text data
sequence
probability
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311496623.8A
Other languages
Chinese (zh)
Other versions
CN117875309A (en
Inventor
陈斌
陈茹铭
杨婷婷
朵瑞雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanzhou University
Original Assignee
Lanzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanzhou University filed Critical Lanzhou University
Priority to CN202311496623.8A priority Critical patent/CN117875309B/en
Publication of CN117875309A publication Critical patent/CN117875309A/en
Application granted granted Critical
Publication of CN117875309B publication Critical patent/CN117875309B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a public opinion analysis method, device and medium based on big data and deep learning, wherein the method comprises the following steps: acquiring text data related to public opinion; extracting word characteristics based on the text data; performing part-of-speech tagging on the word features, wherein the word features with the same part-of-speech tagging are used as one type of word features; and counting the occurrence frequency of various word characteristics, and outputting the positive and negative degrees of public opinion. The invention not only provides more comprehensive and accurate data sources, but also improves emotion analysis and word segmentation modes and improves the efficiency and accuracy of natural language processing, thereby enabling analysis of public emotion and demand to have more depth and credibility and having wide application prospect in the field of public opinion analysis.

Description

Public opinion analysis method, device and medium based on big data and deep learning
Technical Field
The invention belongs to the field of computer science, in particular to the crossing field of big data analysis, deep learning, natural language processing, social media data analysis and public opinion analysis, and more particularly relates to a public opinion analysis method, device and medium based on big data and deep learning.
Background
Public opinion refers to the social attitude of the public to the generation, propagation and change of social events in a certain social space. With the rapid development and popularization of internet multimedia, the public can participate in the social discussion process of the network community widely, so as to form hundreds of millions of user communication behavior data, and the communication in the form not only promotes information transmission and knowledge sharing, but also can improve the capacity of the public to participate in social transactions. Through the social network discussion, the public freely expresses the views, discusses important issues, and proposes problem suggestions, so that the ideas are formed, the public participation degree can be enhanced, and a wider sound and richer thought is provided for social development.
There are some prior art in the field of public opinion analysis that can provide guidance. These prior art techniques are based primarily on social media data for emotion analysis and public opinion mining to learn about the public's emotion, needs and opinion. The main advantage of these techniques is the use of large scale data sources to learn about the public attitude, but at the same time suffer from some drawbacks.
In the traditional method, some prior art relies on traditional questionnaire mode to collect data, and this method has the problems of low recovery rate, data redundancy and the like. In the internet era, data collection through social media and other channels is more convenient, but unstructured and large-scale data needs to be solved at the same time. Furthermore, text information in social media is often unstructured and spoken, which makes traditional text analysis methods, such as LDA topic models, limited in processing text in which such features are present. Thus, more advanced text analysis methods are needed to understand these unstructured data. Some prior art techniques are limited by the data source being one-sided or insufficient number of samples, which may result in insufficiently comprehensive or highly biased results of the analysis. Emotion analysis is a key aspect of public opinion analysis, but its challenge is that emotion is subjective and therefore quantifying emotion becomes challenging. Traditional emotion analysis models may perform poorly in interpreting complex emotions, resulting in deficiencies in the prior art in interpreting analysis results. And thus there is a need to improve the interpretation ability of technical models.
The defects are mainly caused by a plurality of objective reasons such as insufficient data acquisition mode, unstructured text, difficulty in emotion analysis and the like. Therefore, new solutions are aimed at overcoming these problems, in order to better exploit big data, deep learning techniques to solve the challenges of the public opinion analysis field, to provide a more comprehensive, accurate and interpreted analysis result.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems occurring in the prior art. Therefore, a public opinion analysis method, device and medium based on big data and deep learning are needed, and the defects of insufficient data acquisition mode, text unstructured, spoken language and insufficient samples in the prior art are overcome through big data and deep learning technology.
According to a first aspect of the present invention, there is provided a public opinion analysis method based on big data and deep learning, the method comprising:
Acquiring text data related to public opinion;
extracting word characteristics based on the text data;
Performing part-of-speech tagging on the word features, wherein the word features with the same part-of-speech tagging are used as one type of word features;
And counting the occurrence frequency of various word characteristics, and outputting the positive and negative degrees of public opinion.
Further, the obtaining text data related to public opinion specifically includes:
According to the set keywords, the text data is crawled in the Internet by manual downloading and/or script writing,
And manually cleaning the text data to obtain text data related to public opinion.
Further, the extracting the word features based on the text data specifically includes:
Constructing a character-based generation model, extracting word features from the text data by using the character-based generation model, wherein the character-based generation model is expressed as:
wherein W Seq represents an output sequence of the character-based generation model representing a series of words or words extracted from the text data, is a word sequence or text sequence finally extracted from the text data, arg max represents a sequence selected to have a maximum likelihood probability among a set of alternative sequences, P represents a probability distribution of the sequence given the character sequence, W Seq1 represents an alternative word sequence or text sequence, Representing a given character sequence, IV Recall is an indicator representing the ratio of the number of words of interest correctly extracted in a particular word segment to the number of words of interest in all possible word segments for evaluating the performance of word feature extraction, N IV in all word segmentation represents the number of words of interest in all possible word segments, N IV in specific word segmentation represents the number of words of interest correctly extracted in a particular word segment,/>Representing a sequence of words or terms, representing a sequence of terms in text,Representing the probability of a given word sequence,/>Representing the conditional probability of a word sequence given a context, [ c, t ] 1 represents the entire word sequence, including/>All words in/>A contextual window is represented comprising the first k-1 words of the word sequence for calculating the conditional probability.
Further, the marking the word characteristics by part of speech specifically includes:
Based on the word characteristics, generating a current state according to the past states, generating a current word according to the current state, and finally generating a future state according to the past states and the current state until a complete state sequence and a word sequence are generated.
Further, extracting word characteristics by using a hidden Markov model to label the word characteristics in part of speech, wherein the hidden Markov model comprises a Markov chain and a group of output distributions, and the Markov chain is used for representing a short-time stable sequence of neuron evolution; the set of output distributions conceals the state sequence from the observer, and the hidden markov model performs part-of-speech tagging by the following formula:
Where p (x, y, y m+1) is the joint probability distribution of a given observation x, a sequence of hidden states y, and y m+1, q (y i|yi-2,yi-1) is the transition probability that the state following y i-2,yi-1 is y i, e (x i|yi) is the emission probability, any x i∈V,yi∈S,xi in e (x i|yi) is the ith observation, y i is the ith hidden state, y m+1 is the last state of the sequence of hidden states, x is the given observation, and y is the sequence of hidden states.
Further, the statistics of the occurrence frequency of various word features specifically includes:
Text classification tasks are performed using a naive bayes classifier,
The text classification task includes:
calculating hit probability of each specific category;
Calculating conditional probabilities of all partitions for each attribute;
calculating the conditional probability under each category, taking the maximum term as the category of the corresponding sentence, and judging the category of the text data according to the calculated P value, wherein the P value is calculated according to the following formula:
Where P (category|features) is a conditional probability of a category to which text data belongs in the case of a given feature, and P (feature 1|category),P(feature2|category),P(featuren |category) represents a conditional probability of each feature in the case of the given category, respectively. These probabilities are used to calculate the class of the text data, P (feature 1),P(feature2),P(featuren) is the marginal probability of each feature, respectively, i.e. the probability of each feature occurring without consideration of the class, P λ(X(j)=ajl|Y=ck) is the conditional probability calculated using laplace smoothing (λsmoothing) given the condition X (j)=ajl and class y=c k, X (j) is the jth feature, a jl is the value of feature X (j), Y is the class, c k is one of the classes, For the j-th eigenvalue of the i-th sample, y i is the class of the i-th sample,/>To indicate the function, when/>Y i=ck is1 when all are satisfied, otherwise, is 0. This is used to calculate whether the sample satisfies a particular feature and class under given conditions, λ is a smoothing parameter, to prevent zero probability problems in probability estimation, I (y i=ck) is another indicator function, and takes a value of 1 when y i=ck is satisfied, otherwise 0. This is used to calculate whether the sample satisfies a particular feature and class under a given condition, S j representing the number of possible values for the jth feature.
Further, the method further comprises: the emotion influence capacity of each piece of text data is quantified by constructing an index Z by the following formula:
Wherein x ri、xli、xci represents the forwarding, praise and comment numbers of the ith text data, norm represents the normalization operation of the data, and Seti i represents the emotion score obtained by natural language processing of the ith text data.
Further, after constructing the index Z to quantify the emotion influence capacity of each piece of text data, the method further includes:
A second-order Armond distribution hysteresis model with 3 periods of hysteresis is adopted to build a regression equation between the emotion influence capacity of the current text data and the emotion influence capacity of the previous text data:
Y=-0.194+0.991Z0t-1.076Z1t+0.25Z2t
Where Y t represents the emotion influence of the current text data, α represents the intercept term of the regression equation, β 0、β1、β2、β3 represents the coefficient related to X t、Xt-1、Xt-2、Xt-3 in the regression equation, i.e., the weight of different time lags, α 0、α1、α2 is an auxiliary parameter, which is an intermediate parameter used in the calculation of the β coefficient, X t、Xt-1、Xt-2、Xt-3 represents the emotion influence capability of the current text data and three different period-lag text data, u t is an error term of the model, represents random variations that the model fails to interpret, Y represents the emotion influence of the current text data obtained based on the big data of the present study, and Z 0t、Z1t、Z2t represents the time lag coefficient obtained based on the big data of the present study, equivalent to X t、Xt-1、Xt-2 in meaning.
According to a second technical scheme of the present invention, there is provided a public opinion analysis device based on big data and deep learning, the device comprising:
A data acquisition unit configured to acquire text data related to public opinion;
A feature extraction unit configured to extract word features based on the text data;
the part-of-speech tagging unit is configured to perform part-of-speech tagging on the word features, and the word features with the same part-of-speech tagging are used as one type of word features;
and the text classification unit is configured to count the occurrence frequency of various word characteristics and output the positive and negative degrees of public opinion.
According to a third aspect of the present invention, there is provided a readable storage medium storing one or more programs executable by one or more processors to implement the method as described above.
The invention has at least the following beneficial effects:
1) The invention adopts a deep learning technology, in particular to emotion analysis of text data based on a character generation model, a hidden Markov model and a naive Bayesian classifier. Therefore, the invention not only can better understand unstructured and spoken text, but also can more accurately quantify emotion tendencies. This helps to more fully analyze public emotions and emotion changes.
2) The character-based generation model provided by the invention is excellent in word segmentation, can efficiently find or retrieve the vocabulary in the dictionary, and has good disambiguation capability. This helps to increase the accuracy of word segmentation, thereby improving the effectiveness of subsequent natural language processing work. The word segmentation mode based on the character generation model is not limited by unregistered words, so the word segmentation mode has a high IV recall value, which indicates that the word segmentation device can efficiently identify or retrieve words in a dictionary, and is beneficial to improving the accuracy of subsequent analysis tasks.
3) According to the invention, irrelevant information such as water blog water evaluation and irrelevant video or picture links are removed through manual cleaning of the acquired data. This improves the data quality, ensuring a more accurate analysis process.
4) The invention not only provides more comprehensive and accurate data sources, but also improves emotion analysis and word segmentation modes and improves the efficiency and accuracy of natural language processing, so that analysis on public emotion and demand has more depth and credibility, and further the invention has wide application prospect in the field of public opinion analysis.
Drawings
FIG. 1 shows a flowchart of a big data and deep learning based public opinion analysis method according to an embodiment of the present invention;
FIG. 2 illustrates a schematic of emotion score for a breakpoint regression design in accordance with an embodiment of the present invention;
FIG. 3 shows a time series of public opinion emotion scores and emotional tendencies according to an embodiment of the present invention;
FIG. 4 shows a time series of the public opinion emotion influence capability, its 95% quantile upper and lower bounds, variance according to an embodiment of the present invention;
FIG. 5 illustrates a three-phase public opinion focus and its quantification score according to an embodiment of the present invention;
FIG. 6 illustrates a related topic emotion scoring heat map according to an embodiment of the present invention;
Fig. 7 shows a block diagram of a public opinion analysis device based on big data and deep learning according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention. Embodiments of the present invention will be described in further detail below with reference to the drawings and specific examples, but not by way of limitation. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.
Fig. 1 shows a flowchart of a public opinion analysis method based on big data and deep learning according to an embodiment of the present invention. As shown in FIG. 1, the embodiment of the invention provides a public opinion analysis method based on big data and deep learning, which comprises the following steps S1-S4.
The method starts with step S1, obtaining text data related to public opinion.
By way of example, the method of combining manual downloading and script writing is utilized to crawl the text data of the new wave microblogs and the present headlines, and the public opinion data of the microblogs and the present headlines from 1 month in 2020 to 8 months in 2020 are collected; the collection topics related to the microblog are UID, a user nickname, a posting terminal, a posting topic, a posting address, a microblog text, a posting time, a praise, forwarding, a comment number and the like, and 384 pieces of related data are collected. The collection content of the current headline is user id, news headline content, news release date, news praise, forwarding, comment number, comment user id, news comment content and the like, and 93 surplus related data are collected; and the collected data is manually cleaned in the later period, and redundant information such as the water blog water comment, irrelevant microblog videos, picture url and the like is selectively removed.
Carrying out emotion analysis on the Chinese blog and comment content by using a natural language processing technology, and quantifying text content in three steps (corresponding to the following steps S2-S4 respectively): firstly, analyzing text data words by adopting a Character-based generation model (Character-Based Generative Model); deleting nonsensical but largely-appearing stop words such as mood words, conjunctions, adverbs and the like; part of speech tagging based on hidden Markov models (Hidden Markov Model, HMM); the frequency of occurrence of each feature is counted using a naive bayes classifier (Naive Bayes Model). The task of judging the tendency of the public opinion emotion, quantifying and outputting the positive and negative degrees of the public opinion is completed.
And step S2, extracting word characteristics based on the text data.
Traditional vocabulary-based word segmentation is subject to limitations of unregistered words (OOV). As a first step in natural language processing applications, errors in word segmentation will continue to affect subsequent work. IV recall is a comprehensive index for measuring ambiguity eliminating capability of word segmentation mode, and a high IV recall value means that the word segmentation device can efficiently find or retrieve words in a dictionary, and the word segmentation device has good ambiguity eliminating capability. Unlike english, chinese characters and words are not separated by space separators. Therefore, when natural language processing is performed on Chinese, selecting a proper Chinese word segmentation mode is a primary task. In order to solve the recognition problem of unregistered vocabulary in the traditional word segmentation mode, a technical scheme provides a character-based generation model which regards characters as a unit, combines excellent capability of the generation model to process dependency of adjacent words, has comprehensive performance of combining a language model and a statistical machine learning sequence label, and has been proved to be capable of obtaining a high IV recall value. The character-based generation model is expressed as:
where W Seq to the left of the equation represents the output sequence of the character-based generative model, typically representing a series of words or words extracted from the text data, is the final extracted word sequence or text sequence from the text data, which is the result of the selection based on the generative model, arg max represents the sequence selected with the highest likelihood probability among a set of alternative sequences, P represents the probability distribution of the sequence given the character sequence, W Seq1 to the right of the equation represents the alternative word sequence or text sequence, Representing a given sequence of characters, IV Recall is an indicator representing the ratio of the number of words of interest extracted correctly in a particular word segmentation to the number of words of interest in all possible word segmentations. It is generally used to evaluate the performance of word feature extraction, particularly in the field of information retrieval or natural language processing, N IV in all word segmentation represents the number of words of interest in all possible word segmentations, N IV in specific word segmentation represents the number of words of interest in a particular word segmentation, and/orRepresenting a sequence of words or terms, typically representing a sequence of terms in text,/>The probability of representing a given word sequence, i.e. the probability of the entire text sequence,Representing the conditional probability of a word sequence given the context. This part of the formula is used to calculate the conditional probability of each term given the context, [ c, t ] 1 represents the whole sequence of terms, including/>All words in/>A contextual window is represented comprising the first k-1 words of the word sequence for calculating the conditional probability.
And step S3, marking the word characteristics in parts of speech, wherein the word characteristics marked in the same parts of speech are used as one type of word characteristics.
Part of speech tagging is a procedure that tags each word in a word segmentation result with the correct part of speech, and conventional part of speech tagging methods can be categorized as rule-based, statistical-based, and deep learning-based. The technical scheme uses a hidden Markov model (Hidden Markov Model, HMM) based on statistics to finish the work of marking the part of speech. The hidden Markov model is a generated sequence labeling model, which can allocate labels or classes to each unit in a sequence, firstly generate a current state according to a past state, then generate a current word according to the current state, finally generate a future state according to a plurality of past states and the current state, and the like until a complete state sequence and a word sequence are generated. The hidden Markov model consists of two parts, the first part being a Markov chain, which is used to characterize a short time stationary sequence of neuron evolution; the second part is a set of output distributions that hide the state sequence from the observer. The triple hidden Markov model (3-gram HMM) consists of a finite vocabulary V, a finite set of states S. The hidden Markov model carries out part-of-speech tagging through the following formula:
Where p (x, y, y m+1) is the joint probability distribution of a given observation x, a sequence of hidden states y, and y m+1, q (y i|yi-2,yi-1) is the transition probability that the state following y i-2,yi-1 is y i, e (x i|yi) is the emission probability, any x i∈V,yi∈S,xi in e (x i|yi) is the ith observation, y i is the ith hidden state, y m+1 is the last state of the sequence of hidden states, x is the given observation, and y is the sequence of hidden states.
And S4, counting the occurrence frequency of various word characteristics, and outputting the positive and negative degrees of public opinion.
Text automatic classification is an important branch of the natural language processing field, which has formed a large number of models and algorithms, naive bayes being one of the continuous research hotspots in this field. According to the technical scheme, a corpus of shopping class comments is used as a model training sample, and a naive Bayesian classifier is operated to conduct text classification tasks. The specific flow is as follows: firstly, calculating hit probability of each specific category, then calculating conditional probability of all partitions for each attribute, finally calculating conditional probability under each category, taking the maximum item as the category of the corresponding sentence, and judging the category of the text data according to the calculated P value. To avoid the problem of insufficient training samples at the time of testing, resulting in zero probability estimates, laplace smoothing is introduced in a naive bayes classifier. The P value is calculated by the following formula:
Where P (category|features) is a conditional probability of a category to which text data belongs in the case of a given feature, and P (feature 1|category),P(feature2|category),P(featuren |category) represents a conditional probability of each feature in the case of the given category, respectively. These probabilities are used to calculate the class of the text data, P (feature 1),P(feature2),P(featuren) is the marginal probability of each feature, respectively, i.e. the probability of each feature occurring without consideration of the class, P λ(X(j)=ajl|Y=ck) is the conditional probability calculated using laplace smoothing (λsmoothing) given the condition X (j)=ajl and class y=c k, X (j) is the jth feature, a jl is the value of feature X (j), Y is the class, c k is one of the classes, For the j-th eigenvalue of the i-th sample, y i is the class of the i-th sample,/>To indicate the function, when/>Y i=ck is1 when all are satisfied, otherwise, is 0. This is used to calculate whether the sample satisfies a particular feature and class under given conditions, λ is a smoothing parameter, to prevent zero probability problems in probability estimation, I (y i=ck) is another indicator function, and takes a value of 1 when y i=ck is satisfied, otherwise 0. This is used to calculate whether the sample satisfies a particular feature and class under a given condition, S j representing the number of possible values for the jth feature.
In some embodiments, the inventor considers that the high self-sex and personal centrality of the microblog have research significance on the emotional expression and the socialization transmission problem of the emotion of the individual, so that it is necessary to establish an index to measure the influence of each blog on other users in the same public opinion environment. On the basis of fully utilizing big data information of praise, forwarding and comment numbers acquired from the microblogs, constructing an index Z to quantify emotion influence capacity of each microblog.
Wherein x ri、xli、xci represents the forwarding, praise and comment numbers of the ith text data, norm represents the normalization operation of the data, and Seti i represents the emotion score obtained by natural language processing of the ith text data.
In an exemplary embodiment, fig. 2 shows a schematic diagram of emotion scores of a breakpoint regression design according to an embodiment of the present invention, as shown in fig. 2, with obvious breakpoints on day 23 of 1/2020 according to the breakpoint regression design; after the second breakpoint, a sudden drop in attention occurs. The research period of the invention patent is divided into three stages by two break points; first stage (PrLD, 1/2020-23/2020). Second stage (LD, 2020, 23/month-2020, 3/month, 19). Third stage (LiftLD, 19/2020-8/4/2020).
As shown in fig. 3, macroscopically, individual expressions characterized by sporadic and group expressions characterized by aggregated are maintained, and a unique network mass emotion profile is formed at this stage. The public, after meeting the basic demands of the self level, has the remaining power to change the focus of attention to the mental demands of the higher level. The emotion score distribution of the related topics shows that the user attention of part of topics is high, the posting frequency is high, and the emotion sharing is high. In the specific implementation, public opinion topics focused by the public at each stage can be obtained by using the method provided by the embodiment, the public major questions focused at different stages are positively responded, and future challenges are reasonably met.
In order to know the 'push wave-assisted' type fluctuation of the network public opinion field, whether the current microblog emotion is influenced by the traditional microblog emotion is discussed. As shown in fig. 4, the generation of the emotion-linked effect is related to four factors, first, there is a first driving force to start the whole public opinion field; secondly, the public with low pushing resistance is easily affected by authorities; the timeliness of news is satisfied again, namely a reasonable time distance exists between the microblog utterances which are mutually influenced; finally, a series system is formed between the network public opinion plants for the public emotion fermentation.
The regression equation between the influence of the current microblog emotion and the influence capacity of the previous microblog emotion is built by adopting a second-order Armond distribution hysteresis model with 3 periods of hysteresis, the regression equation is obtained by a standardized regression coefficient, the dependent variable is influenced by the previous three-period value of the dependent variable as an explanatory variable, namely, for an event, a hysteresis effect exists in a microblog public opinion field, and the influence of the previous emotion on the later emotion direction is realized.
Y=-0.194+0.991Z0t-1.076Z1t+0.25Z2t
Where Y t represents the emotion influence of the current text data, α represents the intercept term of the regression equation, β 0、β1、β2、β3 represents the coefficient related to X t、Xt-1、Xt-2、Xt-3 in the regression equation, i.e., the weight of different time lags, α 0、α1、α2 is an auxiliary parameter, which is an intermediate parameter used in the calculation of the β coefficient, X t、Xt-1、Xt-2、Xt-3 represents the emotion influence capability of the current text data and three different period-lag text data, u t is an error term of the model, represents random variations that the model fails to interpret, Y represents the emotion influence of the current text data obtained based on the big data of the present study, and Z 0t、Z1t、Z2t represents the time lag coefficient obtained based on the big data of the present study, equivalent to X t、Xt-1、Xt-2 in meaning.
The final three-stage public opinion focus and its quantification score and related topic emotion scoring heatmaps are shown in fig. 5 and 6.
The embodiment provides an objective public opinion analysis method aiming at important public health events, multi-source mass data and machine learning means are utilized to conduct quantitative and qualitative evaluation in multiple aspects from the perspective of public emotions, the conclusion is helpful for researching the public emotions, and the objective public opinion analysis method provides constructive opinions for insight, prevention and control of public health event public opinion farms. In order to support the patent, the article acquires 108 pieces of newwave microblog data comprising 1 st month 1 st year 2020 to 4 th month 8 th year by combining manual downloading with script crawler, and objectively and quantitatively develops public opinion emotion in different stages based on algorithms such as a naive Bayesian classifier, breakpoint regression design, K-means clustering, a second-order Almond distribution hysteresis model and the like. In the follow-up progress, keywords with potential public opinion conditions are output through various means of statistical feature analysis, trend analysis, word frequency statistics and the like, the statistical significance of the public opinion scores is initially measured, and deep features such as emotion influence capacity and the like of users are further excavated through multiple aspects and multiple indexes.
It should be noted that the public opinion analysis as exemplified above is only for better describing a specific example of the method according to the present invention, and the method according to the present invention may be applied to other public opinion analysis when implemented in practice, and the present embodiment is only an example and not a limitation of the present invention.
The embodiment of the invention provides a public opinion analysis device based on big data and deep learning, as shown in fig. 7, the device 700 comprises:
a data acquisition unit 701 configured to acquire text data related to public opinion;
a feature extraction unit 702 configured to extract word features based on the text data;
A part-of-speech tagging unit 703 configured to perform part-of-speech tagging on the word features, where word features labeled with the same part of speech are used as a class of word features;
The text classification unit 704 is configured to count occurrence frequencies of various word features and output positive and negative degrees of public opinion.
In some embodiments, the data acquisition unit is further configured to:
According to the set keywords, the text data is crawled in the Internet by manual downloading and/or script writing,
And manually cleaning the text data to obtain text data related to public opinion.
In some embodiments, the feature extraction unit is further configured to:
Constructing a character-based generation model, extracting word features from the text data by using the character-based generation model, wherein the character-based generation model is expressed as:
where W Seq to the left of the equation represents the output sequence of the character-based generative model, typically representing a series of words or words extracted from the text data, is the final extracted word sequence or text sequence from the text data, which is the result of the selection based on the generative model, arg max represents the sequence selected with the highest likelihood probability among a set of alternative sequences, P represents the probability distribution of the sequence given the character sequence, W Seq1 to the right of the equation represents the alternative word sequence or text sequence, Representing a given sequence of characters, IV Recall is an indicator representing the ratio of the number of words of interest extracted correctly in a particular word segmentation to the number of words of interest in all possible word segmentations. It is generally used to evaluate the performance of word feature extraction, particularly in the field of information retrieval or natural language processing, N IV in all word segmentation represents the number of words of interest in all possible word segmentations, N IV in specific word segmentation represents the number of words of interest in a particular word segmentation, and/orRepresenting a sequence of words or terms, typically representing a sequence of terms in text,/>The probability of representing a given word sequence, i.e. the probability of the entire text sequence,Representing the conditional probability of a word sequence given the context. This part of the formula is used to calculate the conditional probability of each term given the context, [ c, t ] 1 represents the whole sequence of terms, including/>All words in/>A contextual window is represented comprising the first k-1 words of the word sequence for calculating the conditional probability.
In some embodiments, the part-of-speech tagging unit is further configured to:
Based on the word characteristics, generating a current state according to the past states, generating a current word according to the current state, and finally generating a future state according to the past states and the current state until a complete state sequence and a word sequence are generated.
In some embodiments, the part-of-speech tagging unit is further configured to extract word features for part-of-speech tagging the word features using a hidden markov model, the hidden markov model comprising a markov chain and a set of output distributions, the markov chain being used to characterize a short-time stationary sequence of neuronal evolution; the set of output distributions conceals the state sequence from the observer, and the hidden markov model performs part-of-speech tagging by the following formula:
Where p (x, y, y m+1) is the joint probability distribution of a given observation x, a sequence of hidden states y, and y m+1, q (y i|yi-2,yi-1) is the transition probability that the state following y i-2,yi-1 is y i, e (x i|yi) is the emission probability, any x i∈V,yi∈S,xi in e (x i|yi) is the ith observation, y i is the ith hidden state, y m+1 is the last state of the sequence of hidden states, x is the given observation, and y is the sequence of hidden states.
In some embodiments, the text classification unit is further configured to:
Text classification tasks are performed using a naive bayes classifier,
The text classification task includes:
calculating hit probability of each specific category;
Calculating conditional probabilities of all partitions for each attribute;
calculating the conditional probability under each category, taking the maximum term as the category of the corresponding sentence, and judging the category of the text data according to the calculated P value, wherein the P value is calculated according to the following formula:
Where P (category|features) is a conditional probability of a category to which text data belongs in the case of a given feature, and P (feature 1|category),P(feature2|category),P(featuren |category) represents a conditional probability of each feature in the case of the given category, respectively. These probabilities are used to calculate the class of the text data, P (feature 1),P(feature2),P(featuren) is the marginal probability of each feature, respectively, i.e. the probability of each feature occurring without consideration of the class, P λ(X(j)=ajl|Y=ck) is the conditional probability calculated using laplace smoothing (λsmoothing) given the condition X (j)=ajl and class y=c k, X (j) is the jth feature, a jl is the value of feature X (j), Y is the class, c k is one of the classes, For the j-th eigenvalue of the i-th sample, y i is the class of the i-th sample,/>To indicate the function, when/>Y i=ck is1 when all are satisfied, otherwise, is 0. This is used to calculate whether the sample satisfies a particular feature and class under given conditions, λ is a smoothing parameter, to prevent zero probability problems in probability estimation, I (y i=ck) is another indicator function, and takes a value of 1 when y i=ck is satisfied, otherwise 0. This is used to calculate whether the sample satisfies a particular feature and class under a given condition, S j representing the number of possible values for the jth feature.
In some embodiments, the big data and deep learning based public opinion analysis device further comprises an influence quantification unit configured to quantify the emotional influence capacity of each piece of text data by constructing an index Z by the following formula:
Wherein x ri、xli、xci represents the forwarding, praise and comment numbers of the ith text data, norm represents the normalization operation of the data, and Seti i represents the emotion score obtained by natural language processing of the ith text data.
In some embodiments, the big data and deep learning based public opinion analysis device further comprises an equation building unit configured to:
A second-order Armond distribution hysteresis model with 3 periods of hysteresis is adopted to build a regression equation between the emotion influence capacity of the current text data and the emotion influence capacity of the previous text data:
Y=-0.194+0.991Z0t-1.076Z1t+0.25Z2t
Where Y t represents the emotion influence of the current text data, α represents the intercept term of the regression equation, β 0、β1、β2、β3 represents the coefficient related to X t、Xt-1、Xt-2、Xt-3 in the regression equation, i.e., the weight of different time lags, α 0、α1、α2 is an auxiliary parameter, which is an intermediate parameter used in the calculation of the β coefficient, X t、Xt-1、Xt-2、Xt-3 represents the emotion influence capability of the current text data and three different period-lag text data, u t is an error term of the model, represents random variations that the model fails to interpret, Y represents the emotion influence of the current text data obtained based on the big data of the present study, and Z 0t、Z1t、Z2t represents the time lag coefficient obtained based on the big data of the present study, equivalent to X t、Xt-1、Xt-2 in meaning.
It should be noted that, the device in this embodiment and the method described in the foregoing belong to the same technical idea, and the same technical effects can be achieved, which are not repeated here.
Embodiments of the present invention provide a readable storage medium storing one or more programs executable by one or more processors to implement the methods described in the above embodiments.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (8)

1. The public opinion analysis method based on big data and deep learning is characterized by comprising the following steps:
Acquiring text data related to public opinion;
extracting word characteristics based on the text data;
performing part-of-speech tagging on the word features by using a hidden Markov model, wherein the word features with the same part-of-speech tagging are used as one type of word features;
counting the occurrence frequency of various word characteristics, and outputting the positive and negative degrees of public opinion;
the statistics of the occurrence frequency of various word characteristics specifically comprises the following steps:
Text classification tasks are performed using a naive bayes classifier,
The text classification task includes:
calculating hit probability of each specific category;
Calculating conditional probabilities of all partitions for each attribute;
calculating the conditional probability of each category, taking the maximum item of the conditional probability as the category of the corresponding sentence, and judging the category of the text data according to the calculated probability value;
the method further comprises the steps of: the emotion influence capacity of each piece of text data is quantified by constructing an index Z by the following formula:
Wherein x ri、xli、xci represents the forwarding, praise and comment numbers of the ith text data respectively, norm represents the normalization operation of the data, and Seti i represents the emotion score obtained by natural language processing of the ith text data;
After constructing the index Z quantifies the mood influence capability of each piece of text data, the method further comprises:
A second-order Armond distribution hysteresis model with 3 periods of hysteresis is adopted to build a regression equation between the emotion influence capacity of the current text data and the emotion influence capacity of the previous text data:
Wherein Y t represents the emotion influence of the current text data, α represents the intercept term of the regression equation, β 0、β1、β2、β3 represents the coefficient related to X t、Xt-1、Xt-2、Xt-3 in the regression equation, i.e., the weight of different time lags, α 0、α1、α2 is an auxiliary parameter, is an intermediate parameter used in the calculation of the β coefficient, X t、Xt-1、Xt-2、Xt-3 represents the emotion influence capability of the current text data and the text data of three different period lags, u t is an error term of the model, and represents random variation which the model fails to interpret.
2. The method of claim 1, wherein the obtaining text data related to public opinion specifically comprises:
According to the set keywords, the text data is crawled in the Internet by manual downloading and/or script writing,
And manually cleaning the text data to obtain text data related to public opinion.
3. The method according to claim 1, wherein the extracting word features based on the text data specifically comprises:
Constructing a character-based generation model, extracting word features from the text data by using the character-based generation model, wherein the character-based generation model is expressed as:
wherein W Seq represents an output sequence of the character-based generation model representing a series of words or words extracted from the text data, is a word sequence or text sequence finally extracted from the text data, arg max represents a sequence selected to have a maximum likelihood probability among a set of alternative sequences, P represents a probability distribution of the sequence given the character sequence, W Seq1 represents an alternative word sequence or text sequence, Representing a given character sequence, IV Recall is an indicator representing the ratio of the number of words of interest correctly extracted in a particular word segment to the number of words of interest in all possible word segments for evaluating the performance of word feature extraction, N IV in all word segmentation represents the number of words of interest in all possible word segments, N IV in specific word segmentation represents the number of words of interest correctly extracted in a particular word segment,/>Representing a sequence of words or terms, representing a sequence of terms in text,/>Representing the probability of a given word sequence,/>Representing the conditional probability of a word sequence given a context, [ c, t ] 1 represents the entire word sequence, including/>All words in/>A contextual window is represented comprising the first k-1 words of the word sequence for calculating the conditional probability.
4. The method according to claim 1, wherein the part-of-speech tagging of the word features specifically comprises:
Based on the word characteristics, generating a current state according to the past states, generating a current word according to the current state, and finally generating a future state according to the past states and the current state until a complete state sequence and a word sequence are generated.
5. The method of claim 1, wherein the hidden markov model comprises a markov chain and a set of output distributions, the markov chain being used to characterize a short time stationary sequence of neuron evolution; the set of output distributions conceals the state sequence from the observer, and the hidden markov model performs part-of-speech tagging by the following formula:
Where p (x, y, y m+1) is the joint probability distribution of a given observation x, a sequence of hidden states y, and y m+1, q (y i|yi-2,yi-1) is the transition probability that the state following y i-2,yi-1 is y i, e (x i|yi) is the emission probability, any x i∈V,yi∈S,xi in e (x i|yi) is the ith observation, y i is the ith hidden state, y m+1 is the last state of the sequence of hidden states, x is the given observation, and y is the sequence of hidden states.
6. The method of claim 1, wherein the probability value is calculated by the formula:
where P (feature|features) is a conditional probability of a category to which text data belongs in the case of a given feature, P (feature 1|category),P(feature2|category),P(featuren |features) represents a conditional probability of each feature in the case of a given category, P (feature 1),P(feature2),P(featuren) is a marginal probability of each feature, i.e., a probability of each feature appearing without consideration of the category, P λ(X(j)=ajl|Y=ck) is a conditional probability calculated using Laplacian smoothing in the case of a given condition X (j)=ajl and category Y=c k, X (j) is a jth feature, a jl is a value of feature X (j), Y is a category, c k is one of the categories, For the j-th eigenvalue of the i-th sample, y i is the class of the i-th sample,/>To indicate the function, when/>Y i=ck is 1 when all are satisfied, otherwise, 0 is obtained, λ is a smoothing parameter, used for preventing zero probability problem in probability estimation, I (y i=ck) is another indication function, and is 1 when y i=ck is satisfied, otherwise, 0,S j represents possible value number of j-th feature.
7. Public opinion analysis method device based on big data and deep learning, which is characterized by comprising the following steps:
A data acquisition unit configured to acquire text data related to public opinion;
A feature extraction unit configured to extract word features based on the text data;
The part-of-speech tagging unit is configured to tag the part of speech of the word features by using a hidden Markov model, and the word features with the same part of speech tag are used as a class of word features;
The text classification unit is configured to count the occurrence frequency of various word characteristics and output the positive and negative degrees of public opinion;
An influence quantization unit configured to quantize emotion influence capability of each piece of text data by constructing an index Z by the following formula:
Wherein x ri、xli、xci represents the forwarding, praise and comment numbers of the ith text data respectively, norm represents the normalization operation of the data, and Seti i represents the emotion score obtained by natural language processing of the ith text data;
An equation building unit configured to:
A second-order Armond distribution hysteresis model with 3 periods of hysteresis is adopted to build a regression equation between the emotion influence capacity of the current text data and the emotion influence capacity of the previous text data:
Wherein Y t represents the emotion influence of the current text data, α represents the intercept term of the regression equation, β 0、β1、β2、β3 represents the coefficient related to X t、Xt-1、Xt-2、Xt-3 in the regression equation, i.e., the weight with different time lags, α 0、α1、α2 is an auxiliary parameter, is an intermediate parameter used in the calculation of the β coefficient, X t、Xt-1、Xt-2、Xt-3 represents the emotion influence capability of the current text data and the text data with three different period lags, u t is an error term of the model, and represents random variation which the model fails to interpret;
the text classification unit is further configured to:
Text classification tasks are performed using a naive bayes classifier,
The text classification task includes:
calculating hit probability of each specific category;
Calculating conditional probabilities of all partitions for each attribute;
and calculating the conditional probability under each category, taking the maximum term as the category of the corresponding sentence, and judging the category of the text data according to the calculated probability value.
8. A readable storage medium storing one or more programs executable by one or more processors to implement the method of any of claims 1-6.
CN202311496623.8A 2023-11-10 2023-11-10 Public opinion analysis method, device and medium based on big data and deep learning Active CN117875309B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311496623.8A CN117875309B (en) 2023-11-10 2023-11-10 Public opinion analysis method, device and medium based on big data and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311496623.8A CN117875309B (en) 2023-11-10 2023-11-10 Public opinion analysis method, device and medium based on big data and deep learning

Publications (2)

Publication Number Publication Date
CN117875309A CN117875309A (en) 2024-04-12
CN117875309B true CN117875309B (en) 2024-06-14

Family

ID=90588998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311496623.8A Active CN117875309B (en) 2023-11-10 2023-11-10 Public opinion analysis method, device and medium based on big data and deep learning

Country Status (1)

Country Link
CN (1) CN117875309B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN117009524A (en) * 2023-08-08 2023-11-07 宇哲融创科技(北京)有限公司 Internet big data analysis method and system based on public opinion emotion analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN117009524A (en) * 2023-08-08 2023-11-07 宇哲融创科技(北京)有限公司 Internet big data analysis method and system based on public opinion emotion analysis

Also Published As

Publication number Publication date
CN117875309A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN111737495B (en) Middle-high-end talent intelligent recommendation system and method based on domain self-classification
Mukhtar et al. Urdu sentiment analysis using supervised machine learning approach
US11227120B2 (en) Open domain targeted sentiment classification using semisupervised dynamic generation of feature attributes
CN108269125B (en) Comment information quality evaluation method and system and comment information processing method and system
CN109726745B (en) Target-based emotion classification method integrating description knowledge
Huang et al. Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow
CN114548321B (en) Self-supervision public opinion comment viewpoint object classification method based on contrast learning
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
Shahbazi et al. Topic modeling in short-text using non-negative matrix factorization based on deep reinforcement learning
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
Kawintiranon et al. PoliBERTweet: a pre-trained language model for analyzing political content on Twitter
Wang et al. An answer recommendation algorithm for medical community question answering systems
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
Perrone et al. Lexical semantic change for Ancient Greek and Latin
Hashemzadeh et al. Improving keyword extraction in multilingual texts.
Caicedo et al. Bootstrapping semi-supervised annotation method for potential suicidal messages
Bergam et al. Legal and political stance detection of SCOTUS language
Addepalli et al. A proposed framework for measuring customer satisfaction and product recommendation for ecommerce
Song et al. Research on Kano model based on online comment data mining
CN111859955A (en) Public opinion data analysis model based on deep learning
Feng et al. An emotion analysis dataset of course comment texts in massive online learning course platforms
CN111598691A (en) Method, system and device for evaluating default risk of credit/debt main body
Zadgaonkar et al. An Approach for analyzing unstructured text data using topic modeling techniques for efficient information extraction
Ezzat et al. Topicanalyzer: A system for unsupervised multi-label arabic topic categorization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant