CN113343118A

CN113343118A - Hot event discovery method under mixed new media

Info

Publication number: CN113343118A
Application number: CN202110444596.4A
Authority: CN
Inventors: 曹玖新; 洪智高; 刘佳
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-09-03

Abstract

The invention discloses a method for discovering hot events under a mixed new media, which comprises the following steps: firstly, performing word segmentation and slicing processing on online news portal website data in a specific time period, and discovering and mining various topic events based on a probabilistic topic model; then according to the information of the topic, the keywords, the named entity and the like of the event, searching and acquiring social information related to the event and user behavior relation data thereof from social network media; and finally, judging whether the event belongs to the hot event or not according to the report quantity of the event in the news portal website and the propagation scale of the event in the social network. The research result of the algorithm has an important supporting effect on the practical application in the aspects of network event retrieval, online public opinion monitoring, emergency detection, related safety decision and the like.

Description

Hot event discovery method under mixed new media

Technical Field

The invention relates to a method for discovering social hot events in a mixed media environment, belonging to the technical field of internet monitoring.

Background

Currently, social networks (such as micro blogs, micro messages, and the like) are social new media which are most active, rich in content, and most widely influenced by users, and form a mixed online new media environment together with various online news portal networks. Some social events are known by people through news portal reports, and are transferred and fermented through various social media, so that netizens are fiercely discussed, network public opinion games are developed, and finally internet social hotspot events are formed.

The invention constructs a mixed new media environment by comprehensively considering the functional action and the interaction relation of the social new media and the news portal website in the Internet. On the basis, the topics of the events are found through mining a news portal website, news corpus data and social media data are obtained facing the events, and the social hotspot events are judged, so that people are helped to deeply understand and grasp the current situation and the future development trend of the social hotspot events in the network environment. The research result of the invention has important support effect on the practical application of network event retrieval, online public opinion monitoring, emergency detection, related safety decision and the like.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method provides a model which can effectively extract the potential topic information in the document and judge whether the topic information is a hot event or not.

In order to solve the technical problems, the technical scheme adopted by the invention is that after data is preprocessed, a document is vectorized and expressed and is subjected to modeling by a neural topic model, and then topics obtained by modeling are combined.

In order to achieve the purpose, the technical scheme of the invention is as follows: a method for discovering hot events under mixed new media comprehensively considers the functions and the relations of social new media and a news portal website in the Internet, constructs a mixed new media environment, obtains news corpus data and social media data facing social hot events, and discovers topics through the data of the mixed media, so that people are helped to deeply understand and grasp the current situation and the future development trend of the social hot events in a network environment, and the method comprises the following steps:

step 1) preprocessing the collected data by news data, including removing hypertext links, stop words, punctuation marks and digital useless information, and performing word segmentation by using a HanLP natural language processing tool;

step 2) dispersing the document to each time slice according to time sequence, wherein the time interval is 1 day, so that the subsequent evolution analysis processing is facilitated, and all events examine the document within 30 days of the occurrence of the event, namely 30 time slices;

step 3), vectorizing the text, and expressing the text by using a document pre-trained by BERT to improve the continuity of the topic;

step 4) topic modeling, namely performing topic modeling by using a neural topic model, wherein the input bag-of-words representation is replaced by context embedding;

step 5) modeling the topics obtained in the step 4), merging the topics,

step 6), after event detection of the news portal website is completed, the microblog content of each event in the social network and the user social relationship of each event need to be associated;

and 7) according to a certain judgment standard, judging that the event is a hot spot event when a certain threshold value is exceeded.

The division of the time slices in the step 2) has important influence on the evolution of the processing time in a period of time and the heat change rule thereof, and the time slices are fixed in 30 days in the invention and can be adaptively set according to the time length of the crawled news content.

The text vectorization in the step 3) replaces the bag-of-words representation of the input topic model with context embedding, namely, a neural coding layer of document representation pre-trained from a BERT language model is introduced before the topic modeling process. First, a dictionary of topic corpus is built by calling the BERT _ serving packet and a BERT word vector model is trained. And each document obtains a matrix formed by word vectors, and well matched data is stored so as to facilitate task processing of subsequent topic modeling.

And in the step 4), when topic modeling is performed, the vectorized text data in the step 3) is used as a context embedded model, the neural topic model used in the invention is a generation model based on a neural variational inference framework, is inspired by a variational automatic encoder, and selects Gaussian distribution generation parameters, wherein the Gaussian parameters can be obtained by linear calculation.

Step 5) after modeling the topics, merging the topics, setting a threshold value zeta to measure the distance between the two topics, and if the distance between the two topics is greater than the threshold value, judging the two topics to be the same topic and merging the topics; otherwise, the two topics are different, and the two topics do not need to be combined.

In the steps 6) and 7), the microblog platform provides rich topic classification and content label information, integrates the time, the named entity and the keyword information obtained in the event detection process, searches microblog content related to the event key information from the microblog, calculates the cosine distance between the event key information and the content, classification and label of the search result, detects the similarity relation between the event and the microblog, and establishes the event-news-microblog association relation. For the judgment of the hot event, the invention combines the social network attribute of the event, and calculates the heat value of the topic obtained in the step 5) by using a formula (1):

wherein N is_e、S_eAnd C_eRespectively representing the number of news reports, the number of user forwarding and the number of comments of the event e, and N, S and C respectively representing the total number of corresponding indexes; α, β, γ respectively represent proportionality coefficients (e.g., 0.6, 0.2, 0.2) set according to the importance of the above factors, when the integrated calorific value (range is [0,1 ]]) And if the ratio of the report to the discussion of the event e exceeds 0.4 (namely, if the ratio of the report to the discussion of the event e exceeds 40 percent), the event e is judged to be the hot event.

Compared with the prior art, the invention has the following advantages:

1. the invention improves the modeling method of topics under the mixed media, comprehensively considers the functions and the relations of the new social media and the news portal website in the Internet, obtains news corpus data and social media data facing social hot events, and discovers the current hot topics through the data of the mixed media.

2. The NTM neural topic model is provided based on the variational automatic encoder framework, and because the encoder and the decoder in the variational automatic encoder can carry out combined training through back propagation, compared with the traditional probability model, the complexity of the mathematical derivation process during the training of the NTM model is lower, and the extension is easy.

3. The NTM model used by the invention receives the document representation after BERT training as input, the topic modeling part consists of an encoder and a decoder, the process of generating topics by the NTM is similar to the data reconstruction process, and the bag-of-words representation of the input topic model is replaced by context embedding, namely, before the topic modeling process, a neural coding layer of the document representation pre-trained by the BERT language model is introduced, so that the interpretability and the consistency of the topics are improved.

4. According to the method, news report data of the main news media in a period of time are crawled by a certain keyword, the evolution situation of news in a period of time can be tracked, the time slice of news evolution is divided in a self-adaptive mode, and the stage change of a hot event is judged by combination or not.

Drawings

Fig. 1 is a flowchart illustrating a hot event determination process according to the present invention.

FIG. 2 is a topic model diagram of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments, which are intended to be illustrative only and not to be limiting of the scope of the invention, and various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the limits of the appended claims.

Example 1: referring to fig. 1 and 2, a method for discovering a hotspot event under a mixed new media includes the following steps:

step 5) modeling the topics obtained in the step 4), merging the topics,

Application example 1: referring to fig. 2, the method for topic modeling of a document based on a neural topic model according to the present invention includes the following steps:

step 1. encoding procedure

Generating a Gaussian prior distribution theta for the document d by using an encoder:

1) a document representation s is obtained after the BERT processing.

s＝BERT(d) (1)

2) The document representation s is projected towards the hidden layer, which is concatenated with the bag of words representation BoW of document d.

h＝[s，BoW] (2)

3) Mu and log sigma, which are hyper-parameters set by the present invention for computing gaussian unit variance, are obtained by two independent multi-layer feed-forward neural networks. Wherein f (-) denotes a neural perceptron with a ReLU activation function, weight W₁，W₂And deviation b₁，b₂Are learnable parameters that are shared between different inputs.

μ＝W₁f(h)+b₁ (3)

logσ＝W₂f(h)+b₂ (4)

4) Selecting hidden variables z-N (mu, sigma)²) Wherein N (μ, σ)²) In a multidimensional Gaussian distribution, the z component is a Gaussian distribution random variable which is independent of each other. The hidden variable z can be expressed as:

where epsilon can be considered as an auxiliary noise variable. ε may be sampled from the normal distribution N (0, I).

Step 2. decoding procedure

Assume that there are K topics in a given corpus C, each topic K being distributed by a topic vocabulary

(k) And each document d in the C corresponds to a topic set represented by a variable theta, wherein theta is a K-dimensional distribution vector and is constructed by Gaussian softmax. Therefore, the decoder takes the following steps to simulate the way each document d is generated:

1) deriving a Gaussian prior distribution θ from an implicit variable z, where w_θAre variables that can be trained.

θ＝softmax(w_θz) (6)

2) Deducing each vocabulary w in the document d from the variable theta, where f_φDistribution of words to topics

(k) Weight matrix of

In summary, based on the lower bound of variation, the objective function of the NTM model defined by the present invention is:

L_NTM＝E_q(z|d)[p(d|z)]-D_KL[q(z|d)||p(z|μ，σ)] (8)

the first term in equation (8) is the reconstruction loss, the second term is the Kullback-Leibler divergence loss, and p (z | μ, σ) represents the standard normal prior. q (z | d) and p (z | μ, σ) denote an encoding process and a decoding process, respectively.

To achieve back propagation during model training, a re-parameterization technique is used, as shown in equation (5), by sampling the noise ε from the normal distribution N (0, I), to obtain θ. To calculate L_NTMThe gradient of the model adopts the Adam algorithm as a gradient descent algorithm.

Step 3. merging of the same topics

The method for calculating the distribution distance is usually adopted for identifying the same topic, because the topics obtained after modeling are distributed on the same dimension, and because the distribution distances among different topics are determined and are not related to the sequence of the topics, the similarity among the topics can be measured through the symmetrical Kullback-Leibler distance.

Let w_iIs the probability distribution of the ith word in a topic,

is the topic vocabulary distribution of the kth topic, then the topic k₁And k₂The KL distance of (a) can be calculated by equation (10):

while the symmetric KL distance can be further calculated using the KL distance:

as can be seen from equations (9) and (10), the smaller the KL distance between two topics, the closer the KL distance to 0, the closer the two probability distributions are, and the higher the similarity between the two topics. If the KL distance between two topics is larger, the probability distribution difference of the two topics is larger. A threshold value ζ is set, and if the KL distance between two topics is greater than the threshold value, the two topics are determined to be the same topic, and the topics need to be merged. Otherwise, the two topics are different, and the two topics do not need to be combined.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A hot event discovery method under a mixed new media is characterized by comprising the following steps:

step 5) modeling the topics obtained in the step 4), and then merging the topics;

step 6), after event detection of the news portal website is completed, associating microblog content of each event in the social network and the social relation of the event to a user;

and 7) calculating the heat value of the topic, and judging that the topic is a hot event when the heat value exceeds a certain threshold value.

2. The method for discovering social hotspot events in the mixed media environment according to claim 1, wherein the time slice division in the step 2) has an important influence on the evolution of the processing time in a period of time and the change rule of the heat degree, and can be fixed in 30 days in the invention or can be adaptively set according to the time length of crawling news contents.

3. The method for discovering social hotspot events in a mixed media environment as claimed in claim 1, wherein in the step 3), text vectorization replaces bag-of-word representation of the input topic model with context embedding, that is, before the topic modeling process, a neural coding layer represented by a document pre-trained from a BERT language model is introduced, and first, a dictionary of a self-constructed topic corpus is called by a BERT _ serving packet and a BERT word vector model is trained, each document obtains a matrix formed by word vectors, and the well-matched data is stored for task processing of subsequent topic modeling.

4. The method for discovering social hotspot events in the mixed media environment according to claim 1, wherein during topic modeling in the step 4), the vectorized text data in the step 3) is used as a context embedding model, the neural topic model used in the invention is a generation model based on a neural variation inference framework, is inspired by a variation automatic encoder, and selects Gaussian distribution generation parameters, wherein the Gaussian parameters can be obtained by linear computation.

5. The method for discovering social hotspot events in the mixed media environment according to claim 1, wherein in the step 5), after modeling the topics, merging the topics is required, a threshold value ζ is set to measure the distance between the two topics, and if the distance between the two topics is greater than the threshold value, the two topics are determined as the same topic and the topics are required to be merged; otherwise, the two topics are different, and the two topics do not need to be combined.

6. The method for discovering social hotspot events in the mixed media environment according to claim 1, wherein in steps 6) and 7), the microblog platform provides rich topic classification and content tag information, integrates the time, named entity and keyword information obtained in the event detection process, searches microblog content related to the event key information from the microblog, then calculates the cosine distance between the event key information and the content, classification and tag of the search result to detect the similarity between the event and the microblog, establishes the event-news-microblog association relationship, and for the discrimination of the hotspot events, the social network attribute of the event is combined, and the heat value of the topic obtained in step 5) is calculated by using a formula (1):