CN117670571B - Incremental social media event detection method based on heterogeneous message graph relation embedding - Google Patents

Incremental social media event detection method based on heterogeneous message graph relation embedding Download PDF

Info

Publication number
CN117670571B
CN117670571B CN202410125597.6A CN202410125597A CN117670571B CN 117670571 B CN117670571 B CN 117670571B CN 202410125597 A CN202410125597 A CN 202410125597A CN 117670571 B CN117670571 B CN 117670571B
Authority
CN
China
Prior art keywords
message
messages
vector
prefix
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410125597.6A
Other languages
Chinese (zh)
Other versions
CN117670571A (en
Inventor
线岩团
李溥
王红斌
余正涛
黄于欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202410125597.6A priority Critical patent/CN117670571B/en
Publication of CN117670571A publication Critical patent/CN117670571A/en
Application granted granted Critical
Publication of CN117670571B publication Critical patent/CN117670571B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to an incremental social media event detection method based on heterogeneous message graph relation embedding, and relates to the technical field of real-time event detection. In order to solve the problem, the invention maps complex relations between events to a prefix relation sequence and obtains prefix relation embedded vectors through independent embedded layers, and simultaneously obtains message pair sentence embedded vectors through an embedded layer of a pre-training language model; the prefix relation embedded vector and the message-to-sentence embedded vector are used as the input of a pre-training language model coding layer together so as to fully interact and maximize the information utilization; incremental detection of social media events is achieved by introducing pairwise loss, intra-cluster loss, and inter-cluster loss constraints and guiding model training. Compared with the traditional baseline model, the invention has the advantage that the evaluation index of the experimental group is obviously improved compared with all baseline models.

Description

Incremental social media event detection method based on heterogeneous message graph relation embedding
Technical Field
The invention discloses an incremental social media event detection method based on heterogeneous message graph relation embedding, and relates to the technical field of real-time event detection.
Background
Social media platforms play a key role in the current information age, provide abundant real-time information resources for us, and comprise various data types such as texts, images, videos and the like, and cover events, trends and topics in various fields. Thus, efficient detection of events from social media platforms and making cross-domain decisions is critical to businesses, governments, and individuals.
The Graph Neural Network (GNN) is a method that has made significant progress in social media event detection, and has successfully solved many of the problems with conventional approaches. Traditional methods, such as keyword-based, machine learning or topic modeling, are often limited by problems such as topic drift, poor generalization capability, low real-time detection efficiency, poor interpretability and the like. Currently, advanced graph neural network methods have been widely used for social media event detection. These methods represent social media messages as nodes and build heterogeneous social media message graphs by establishing relationships between additional elements, which are then converted into isomorphic social media message graphs. Although these approaches have made significant progress in improving efficiency, there are still some drawbacks:
Semantic relationship restriction: GNN methods still present a challenge in capturing complex semantic relationships between messages because they typically rely on graph structures, sometimes without a comprehensive understanding of the meaning between messages.
Sparse data problem: social media message graphs typically exhibit sparsity, meaning that the relationships between many messages are not explicitly modeled, potentially affecting the accuracy of event detection.
Cluster structure is not fully considered: sometimes the GNN method may not adequately consider the cluster structure of social media messages, and may erroneously connect messages of different clusters together, thereby introducing noise.
Isolated node problem: in some cases, there are a large number of isolated nodes that have difficulty updating their features efficiently.
The problem of converting heterogeneous message maps to homogeneous message maps: converting heterogeneous message maps to isomorphic message maps may lose side information, which may limit a comprehensive understanding of the relationships between messages.
Disclosure of Invention
In order to solve the problems, the invention provides an incremental social media event detection method based on heterogeneous message graph relation embedding; the invention obtains the prefix relation embedded vector by mapping the complex relation between the events to the prefix relation sequence and through the independent embedded layer, and simultaneously obtains the message pair sentence embedded vector through the embedded layer of the pre-training language model, and then takes the message pair sentence embedded vector as the input of the coding layer of the pre-training language model together so as to fully interact and maximize the information utilization; by introducing the pair loss, the intra-cluster loss and the inter-cluster loss, the method can effectively restrict and guide model training, realize incremental detection of social media events, adapt to the change of continuously changing data patterns and event category number, and improve event detection performance.
The technical scheme of the invention is as follows: the incremental social media event detection method based on heterogeneous message graph relation embedding comprises the following steps:
mapping complex relations between events to a prefix relation sequence, obtaining prefix relation embedded vectors through independent embedded layers, and obtaining message pair sentence embedded vectors through an embedded layer of a pre-training language model;
the prefix relation embedded vector and the message-to-sentence embedded vector are used as the input of a pre-training language model coding layer together so as to fully interact and maximize the information utilization;
Incremental detection of social media events is achieved by introducing pairwise loss, intra-cluster loss, and inter-cluster loss constraints and guiding model training.
Further, the method comprises the following steps:
S1, dividing a social media message stream into different message blocks; for each message block, a training set, a testing set and a verification set are divided according to a certain proportion, a heterogeneous message diagram among messages is respectively constructed, message contents are used as message nodes, and topic labels, entities, users and release time among the messages are used as relational nodes; then sampling other message content nodes with fixed quantity to construct message pairs; mapping the co-occurrence condition of the relation node between the message and the sampling message to a prefix sequence, and introducing a new label to represent whether two messages of the message pair belong to the same category;
S2, mapping the prefix relation sequence of the message pair to different discrete values, and obtaining a sequence embedded vector through an embedded layer; simultaneously, splicing two messages of the message pair together, and using a pre-training language model PLM to segment and embed the text to obtain sentence embedded vectors, masks and type identifiers; then, splicing the prefix relation embedded vector and the message-to-sentence embedded vector together to serve as input of a PLM coding layer of the pre-training language model, so that a coding vector is obtained;
S3, dividing the coding vector into a prefix coding vector, a current message coding vector and a sampling message coding vector by using the mask and the type identifier; averaging the three vectors to obtain an averaged prefix coding vector, a current message coding vector and a sampling message coding vector, averaging the averaged prefix coding vector and the current message coding vector again to be used as an updated current message coding vector, and splicing the updated current message coding vector and the averaged sampling message coding vector; then, mapping the spliced vector to a scalar through a linear layer to obtain a similarity score of the event pair;
S4, performing sigmoid activation function processing on the similarity score of the message pair, and then performing cross entropy loss calculation on the similarity score and the real label of the message pair to construct pair loss; then, creating a central characteristic matrix with the same number as the message categories in the message block, and calculating the distance between the message characteristics and the central characteristics to construct intra-cluster loss; for the central feature matrix, calculating the distance between the central features to form inter-cluster loss; guiding and constraining training of the model by comprehensively considering pairwise loss, intra-cluster loss and inter-cluster loss;
S5, when predicting the message of the next message block, predicting test set data of the next message block, and for a group of updated coding vectors obtained by each message, processing similarity scores corresponding to the coding vectors through a sigmoid activation function to reserve message feature representations with similarity scores higher than a threshold value; for a message pair for which the similarity does not reach the threshold, selecting as the final feature representation of the message the one of the set of message feature representations having the highest similarity score; the obtained characteristic representations of the messages are used as the input of a clustering algorithm to realize the detection of social media events;
S6, under the condition that the social media message stream and the processed message blocks are given, continuously training and updating the model through S1-S4, and simultaneously continuously predicting the social media message through S5.
Further, the specific implementation of S1 includes:
Definition agency communication: A set of consecutive time series blocks of social media messages, Is formed by a Time block Time block= [, ) Social media message blocks composed of all messages of (a); will beExpressed as:| I } wherein Is thatIn the number of messages to be sent,Is a single message; will beExpressed as: }, wherein Representing the associated text document, user and timestamp, respectively; the same type of social media message is expressed as: Wherein Representing the total number of messages in e, each social media message belonging only to one category;
For each message block, the following applies 7:2:1, dividing a training set, a testing set and a verification set in proportion, respectively constructing heterogeneous message graphs among messages, taking message contents as message nodes, and taking topic labels, entities, users and release time among the messages as relational nodes; for each message in the training set, sampling other message in the message block to construct a message pair, sampling X/2 positive samples in the training set, constructing X message pairs by X/2 negative samples, and randomly sampling Y message construction message pairs for each message in the verification set and the test set; the relationship node co-occurrence between each message and its sampled messages is then mapped to a prefix sequence:
For each message pair, their cluster relationship is:
; representing whether two messages of a message pair belong to the same category through a cluster relation;
wherein, To aim at messageThe pair of messages that are constructed is composed,Is a sampled message; For a sequence of prefix relationships between messages, , , , Respectively messagesAnd sampled messages thereofTopic labels, entities, users and distribution time relations between the users; representing the direct cluster relationship of two messages in a message pair, Representing a set of k-type messages.
Further, the specific implementation of S2 includes:
Defining batch size, and mapping the sampled and relation-mapped message blocks The definition is as follows: For the following First, the prefix relation sequence is carried out on each data in the databaseIn (a) and (b)Mapping to different discrete values, the rules are as follows: if there is a topic label relationship between the messages1, Otherwise 0; if there is an entity relationship between the messages3, Otherwise 2; if there is a user relationship between the messages5, Otherwise 4; if the release time between the messages is within 4 hours7, Otherwise 6; in each batch, the prefix relation embedded vector among the messages is obtained through the operation of an embedded layer of the pre-training language model PLM by the updated prefix relation sequence:
wherein, embedded represents the embedding operation; , , , Respectively messages And sampled messages thereofTopic labels, entities, users and distribution time relationships between,To aim at messageThe pair of messages that are constructed is composed,Representing a direct cluster relationship of two messages in a message pair; Representation of The total number of relationships in (a); For messages to be fast The total number of messages in (a); Representation of The first of (3)A personal relationship;
Then the word segmentation operation of the message pair through the pre-training language model PLM is carried out to obtain sentence marks, sentence attention masks and type identifiers ) ; Sentence embedding vectors are obtained through embedding operation:
wherein Token represents a word segmentation operation; ebed represents a pre-training model embedding operation; and Respectively the obtained sentence mark, sentence attention mask and type identifier; Embedding vectors for the message to the sentence;
Embedding the prefix relationship between the obtained messages into a vector And message-to-sentence embedding vectorsSplicing; at the same time, in order to embed the prefix relation into the vectorParticipate in the attention score calculation, generating a length for itIs a mask of (2)) Mask with sentence attentionSplicing; obtaining a coding vector through a coding layer of the pre-training model:
Wherein Encoder denotes a coding operation of the PLM; the term "is used to denote a stitching operation, Representing prefix relationship embedding vectorsIs a mask of attention of (a).
Further, the specific implementation of S3 includes:
In each batch, for the encoded vector By means ofLength-dependent encoding vectorExtracting prefix coding vector:
through sentence attention mask And type identifierFrom the encoded vectorExtracting a current message coding vector and a sampling message coding vector:
The prefix is then encoded into a vector And current message encoding vectorSample message encoding vectorThe average is:
Wherein [: n ] represents taking the first n columns, [ n: ] represents taking all columns after n; Indicating the length of the prefix(s), Representing prefix encoding vectorsN-th row of (a); representing the length of a sentence; Representing sentence code vectors, respectively N-th row of (a); Representing a relational embedding vector Is a mask of attention;
Will be AndAveraging as updated current messagesIs a coded vector of: ; will be AndMapping the message pair to a scalar through a linear layer after splicing to obtain the similarity score of the message pair
Where link (x, 1) represents mapping vector x to a scalar.
Further, the specific implementation of S4 includes:
For the construction of the pair loss, in each batch, calculating the cross entropy loss by using the similarity score of each message pair of the current batch and the cluster relation of two messages in the message pair; minimizing the difference between the similarity score and the cluster relationship of the message;
wherein, Message blockThe result after the sampling and relational mapping operations,For a sequence of prefix relationships,To aim at messageThe pair of messages that are constructed is composed,Representing the direct cluster relationship of two messages in a message pair, N being the batch size,A similarity score representing the nth 1 message pair in the batch; representing the cluster relationship of the nth 1 message pair in the batch; is a sigmoid function for Conversion to a similarity between 0 and 1;
Representation by central feature Intra-cluster loss is constructed:
wherein, Is thatThe updated representation of the central feature,Representing a set of k-class messages, the central feature representing an initially all zero vector, when a first message is entered, the first message is considered as a central feature vector,Is the center feature vector before updating; alpha is a parameter with a value range of (0, 1); Representing message blocks A message set contained in the message set; Representation of A set of k-th type messages in (a); Is that A characteristic representation of the j-th message; mean represents the average operation; To calculate AndIs the euclidean distance of (2);
Representation by central feature Assembled central feature matrixTo construct inter-cluster loss:
wherein, Representation ofIs a central feature matrix of (a); shuffled denotes a pair ofShuffling, namely randomly scrambling the sequence of the shuffling; mean is the average operation; max (x, y) represents the maximum value taking operation in x and y, x=,y=0;D(,) To calculateAndEuclidean distance of corresponding rows of the matrix, and beta is a distance threshold;
the final penalty is constructed from the pairwise, intra, and inter-cluster penalty as:
wherein, the value range of the gamma and lambda parameters is (0, 1).
Further, the specific implementation of S5 includes:
Will be related to messages Is processed to obtain a set of similarities corresponding to each message pair) And a set of characteristic representations) Selecting therefrom a similarity greater than a certain thresholdIs averaged to obtain the characteristic representation of (a) Is represented by the final features of (a):
But for the case of All of which are less thanIn the above, selectCorresponding to (a)Characterization of the maximum similarity location as a messageThe final characteristics of (2) are expressed as:
wherein, Representing the similarity of the message and the s-th sampled message,Is fromAnd the s-th sampling information is obtained by cutting in the coding vector; y is the number of sampling messages in the prediction stage; is a sigmoid function; select (-) , () ≥ ) A) represents a slaveSelecting similarity equal to or greater thanIs a vector of (2); the representation is selected from a set of feature representations, Representation ofThe number of features in (a); Representation of A characteristic representation of the k-th of (a); max%, (), ) Representing the slaveIs selected fromPersonal (S) () A maximum feature vector; and taking the final characteristic representation of each obtained message as the input of a clustering algorithm, thereby realizing event detection.
Further, the specific implementation of S6 includes:
Given a message block The model is learned through steps S1-S4:
wherein, Is a message blockA set of messages contained in the model, θ representing a model parameter;
given social interaction S, learn a set of models through S1-S4 The model is gradually updated and learned over time t while predicting the next message block:
wherein, Is thatA set of messages contained in the message; And Representing the current message block model parameters and the last message block model parameters respectively; for the followingIts model training does not inherit other message block model parameters, and is referred to as the initial model.
The invention has the beneficial effects that:
1. The invention adopts the heterogeneous message graph relation embedding technology to map complex relations between social media messages into discrete vectors, thereby generating unique relation sequence representation for each pair of messages, and the sequence is used as a prefix sequence. Through the interaction between the sufficient prefix sequence and the message pair, and by combining the pre-training model, the maximum utilization of information is realized. The technology effectively improves the efficiency of detecting the social media event and realizes the detection of the increment event.
2. The invention tests on a large publicly available social Event data set Event 2012; compared with the traditional baseline model, the evaluation index of the experimental group is obviously improved compared with all baseline models.
Drawings
FIG. 1 is an overall frame diagram of an incremental social media event detection method based on heterogeneous message graph relationship embedding provided by an embodiment of the invention;
fig. 2 is a life cycle chart of an incremental social media event detection method based on heterogeneous message graph relation embedding provided by an embodiment of the invention.
Detailed Description
Embodiments of the present invention are described below with reference to the accompanying drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in FIG. 1, the method for detecting the incremental social media event based on the embedding of the heterogeneous message graph relationship provided by the embodiment of the invention is an overall frame diagram of the method for detecting the incremental social media event based on the embedding of the heterogeneous message graph relationship, and comprises the following steps:
The method comprises the following steps:
S1, dividing a social media message stream into different message blocks; for each message block, a training set, a test set and a verification set are divided according to a certain proportion, a heterogeneous message diagram among messages is respectively constructed, message contents are taken as message nodes, topic labels, entities, users (senders and mentioned users) among the messages and release time are taken as relational nodes; then sampling other message content nodes with fixed quantity to construct message pairs; mapping the co-occurrence condition of the relation node between the message and the sampling message to a prefix sequence, and introducing a new label to represent whether two messages of the message pair belong to the same category;
Further, the specific implementation of S1 includes:
Definition agency communication: A set of consecutive time series blocks of social media messages, Is formed by a Time block Time block= [, ) Social media message blocks composed of all messages of (a); will beExpressed as:| I } wherein Is thatIn the number of messages to be sent,For a single message, as shown in fig. 1, m2, m3, m4 are actual social media message examples; will beExpressed as: }, wherein Representing the relevant text document, the user (sender and mentioned user) and the timestamp, respectively; the same type of social media message is expressed as: Wherein Representing the total number of messages in e, each social media message belonging only to one category;
For each message block, the following applies 7:2:1, dividing a training set, a testing set and a verification set in proportion, respectively constructing heterogeneous message graphs among messages, taking message contents as message nodes, and taking topic labels, entities, users and release time among the messages as relational nodes; for each message in the training set, sampling other message in the message block to construct a message pair, sampling X/2 positive samples in the training set, constructing X message pairs by X/2 negative samples, and randomly sampling Y message construction message pairs for each message in the verification set and the test set; the relationship node co-occurrence between each message and its sampled messages is then mapped to a prefix sequence:
For each message pair, their cluster relationship is:
; representing whether two messages of a message pair belong to the same category through a cluster relation;
wherein, To aim at messageThe pair of messages that are constructed is composed,Is a sampled message; For a sequence of prefix relationships between messages, , , , Respectively messagesAnd sampled messages thereofTopic labels, entities, users and distribution time relations between the users; representing the direct cluster relationship of two messages in a message pair, Representing a set of k-type messages.
S2, mapping the prefix relation sequence of the message pair to different discrete values, and obtaining a sequence embedded vector through an embedded layer; simultaneously, splicing two messages of the message pair together, and using a pre-training language model PLM to segment and embed the text to obtain sentence embedded vectors, masks and type identifiers; then, splicing the prefix relation embedded vector and the message-to-sentence embedded vector together to serve as input of a PLM coding layer of the pre-training language model, so that a coding vector is obtained;
Further, the specific implementation of S2 includes:
Defining batch size, and mapping the sampled and relation-mapped message blocks The definition is as follows: For the following First, the prefix relation sequence is carried out on each data in the databaseIn (a) and (b)Mapping to different discrete values, the rules are as follows: if there is a topic label relationship between the messages1, Otherwise 0; if there is an entity relationship between the messages3, Otherwise 2; if there is a user relationship between the messages5, Otherwise 4; if the release time between the messages is within 4 hours7, Otherwise 6; in each batch, the prefix relation embedded vector among the messages is obtained through the operation of an embedded layer of the pre-training language model PLM by the updated prefix relation sequence:
wherein, embedded represents the embedding operation; , , , Respectively messages And sampled messages thereofTopic labels, entities, users and distribution time relationships between,To aim at messageThe pair of messages that are constructed is composed,Representing a direct cluster relationship of two messages in a message pair; Representation of The total number of relationships in (a); For messages to be fast The total number of messages in (a); Representation of The first of (3)A personal relationship;
Then the word segmentation operation of the message pair through the pre-training language model PLM is carried out to obtain sentence marks, sentence attention masks and type identifiers ) ; Sentence embedding vectors are obtained through embedding operation:
wherein Token represents a word segmentation operation; ebed represents a pre-training model embedding operation; and Respectively the obtained sentence mark, sentence attention mask and type identifier; Embedding vectors for the message to the sentence;
Embedding the prefix relationship between the obtained messages into a vector And message-to-sentence embedding vectorsSplicing; at the same time, in order to embed the prefix relation into the vectorParticipate in the attention score calculation, generating a length for itIs a mask of (2)) Mask with sentence attentionSplicing; obtaining a coding vector through a coding layer of the pre-training model:
Wherein Encoder denotes a coding operation of the PLM; the term "is used to denote a stitching operation, Representing prefix relationship embedding vectorsIs a mask of attention of (a).
S3, dividing the coding vector into a prefix coding vector, a current message coding vector and a sampling message coding vector by using the mask and the type identifier; averaging the three vectors to obtain an averaged prefix coding vector, a current message coding vector and a sampling message coding vector, averaging the averaged prefix coding vector and the current message coding vector again to be used as an updated current message coding vector, and splicing the updated current message coding vector and the averaged sampling message coding vector; then, mapping the spliced vector to a scalar through a linear layer to obtain a similarity score of the event pair;
Further, the specific implementation of S3 includes:
In each batch, for the encoded vector By means ofLength-dependent encoding vectorExtracting prefix coding vector:
through sentence attention mask And type identifierFrom the encoded vectorExtracting a current message coding vector and a sampling message coding vector:
The prefix is then encoded into a vector And current message encoding vectorSample message encoding vectorThe average is:
Wherein [: n ] represents taking the first n columns, [ n: ] represents taking all columns after n; Indicating the length of the prefix(s), Representing prefix encoding vectorsN-th row of (a); representing the length of a sentence; Representing sentence code vectors, respectively N-th row of (a); Representing a relational embedding vector Is a mask of attention;
Will be AndAveraging as updated current messagesIs a coded vector of: ; will be AndMapping the message pair to a scalar through a linear layer after splicing to obtain the similarity score of the message pair
Where link (x, 1) represents mapping vector x to a scalar.
S4, performing sigmoid activation function processing on the similarity score of the message pair, and then performing cross entropy loss calculation on the similarity score and the real label of the message pair to construct pair loss; then, creating a central characteristic matrix with the same number as the message categories in the message block, and calculating the distance between the message characteristics and the central characteristics to construct intra-cluster loss; for the central feature matrix, calculating the distance between the central features to form inter-cluster loss; guiding and constraining training of the model by comprehensively considering pairwise loss, intra-cluster loss and inter-cluster loss;
further, the specific implementation of S4 includes:
For the construction of the pair loss, in each batch, calculating the cross entropy loss by using the similarity score of each message pair of the current batch and the cluster relation of two messages in the message pair; minimizing the difference between the similarity score and the cluster relationship of the message;
wherein, Message blockThe result after the sampling and relational mapping operations,For a sequence of prefix relationships,To aim at messageThe pair of messages that are constructed is composed,Representing the direct cluster relationship of two messages in a message pair, N being the batch size,A similarity score representing the nth 1 message pair in the batch; representing the cluster relationship of the nth 1 message pair in the batch; is a sigmoid function for Conversion to a similarity between 0 and 1;
the goal of intra-cluster loss is to bring messages in the same cluster closer to the center of the cluster, thereby increasing the compactness of the cluster; representation by central feature Intra-cluster loss is constructed:
wherein, Is thatThe updated representation of the central feature,Representing a set of k-class messages, the central feature representing an initially all zero vector, when a first message is entered, the first message is considered as a central feature vector,Is the center feature vector before updating; alpha is a parameter with a value range of (0, 1); Representing message blocks A message set contained in the message set; Representation of A set of k-th type messages in (a); Is that A characteristic representation of the j-th message; mean represents the average operation; To calculate AndIs the euclidean distance of (2);
The goal of the inter-cluster penalty is to make clusters more distinct by pulling the distance between clusters. Representation by central feature Assembled central feature matrixTo construct inter-cluster loss:
wherein, Representation ofIs a central feature matrix of (a); shuffled denotes a pair ofShuffling, namely randomly scrambling the sequence of the shuffling; mean is the average operation; max (x, y) represents the maximum value taking operation in x and y, x=,y=0;D(,) To calculateAndEuclidean distance of corresponding rows of the matrix, and beta is a distance threshold;
the final penalty is constructed from the pairwise, intra, and inter-cluster penalty as:
wherein, the value range of the gamma and lambda parameters is (0, 1).
S5, when predicting the message of the next message block, predicting test set data of the next message block, and for a group of updated coding vectors obtained by each message, processing similarity scores corresponding to the coding vectors through a sigmoid activation function to reserve message feature representations with similarity scores higher than a threshold value; for a message pair for which the similarity does not reach the threshold, selecting as the final feature representation of the message the one of the set of message feature representations having the highest similarity score; the obtained characteristic representations of the messages are used as the input of a clustering algorithm to realize the detection of social media events;
further, the specific implementation of S5 includes:
Will be related to messages Is processed to obtain a set of similarities corresponding to each message pair) And a set of characteristic representations) Selecting therefrom a similarity greater than a certain thresholdIs averaged to obtain the characteristic representation of (a) Is represented by the final features of (a):
But for the case of All of which are less thanIn the above, selectCorresponding to (a)Characterization of the maximum similarity location as a messageThe final characteristics of (2) are expressed as:
wherein, Representing the similarity of the message and the s-th sampled message,Is fromAnd the s-th sampling information is obtained by cutting in the coding vector; y is the number of sampling messages in the prediction stage; is a sigmoid function; select (-) , () ≥ ) A) represents a slaveSelecting similarity equal to or greater thanIs a vector of (2); the representation is selected from a set of feature representations, Representation ofThe number of features in (a); Representation of A characteristic representation of the k-th of (a); max%, (), ) Representing the slaveIs selected fromPersonal (S) () A maximum feature vector; and taking the final characteristic representation of each obtained message as the input of a clustering algorithm, thereby realizing event detection.
S6, under the condition that the social media message stream and the processed message blocks are given, continuously training and updating the model through S1-S4, and simultaneously continuously predicting the social media message through S5.
Further, as shown in fig. 2, a life cycle chart of an incremental social media event detection method based on heterogeneous message graph relation embedding provided by an embodiment of the present invention, a specific implementation of S6 includes:
Given a message block The model is learned through steps S1-S4:
wherein, Is a message blockA set of messages contained in the model, θ representing a model parameter;
given social interaction S, learn a set of models through S1-S4 The model is gradually updated and learned over time t while predicting the next message block:
wherein, Is thatA set of messages contained in the message; And Representing the current message block model parameters and the last message block model parameters respectively; for the followingIts model training does not inherit other message block model parameters, and is referred to as the initial model.
To illustrate the effectiveness of the present invention, the present invention compares to existing methods, testing on a large publicly available social Event dataset Event2012, with Events2012 containing 68841 tagged stories within 4 weeks, including class 503 Events. The clustering algorithm is unified as k-means clustering (Kmeans), the evaluation index is consistent with the comparison method, nmi, ami, ari is adopted as an index for evaluating the clustering result, wherein standardized mutual information (Normalized Mutual Information, nmi), adjusted mutual information (Adjusted Mutual Information, ami) and adjusted Rand index (Adjusted Rand Index, ari) are adopted; nmi is used as a measure to measure the quality of clustering, and aims to evaluate the similarity between clustering results generated by our model and actual basic real-phase categories. Ami is a modified version of Nmi that takes into account errors introduced by random clustering and provides a more robust metric. Ari is another metric for evaluating similarity of clustering results. It also considers errors introduced by random clustering and provides an adjusted evaluation method. The experimental results of the data sets used are shown in tables 1-3:
TABLE 1 Event2012 dataset Nmi score (where the optimal score is bolded)
Table 2 Event2012 dataset Ami score (where the optimal score is bolded)
Table 3 Event2012 dataset Ari score (where the optimal score is bolded)
As can be seen from tables 1-3, the test group evaluation index of the present invention was significantly improved compared to all baseline models. It is considered that the complex relationship between social media messages is mapped into discrete vectors through the embedding of the heterogeneous message graph relationship, and the maximum utilization of information is realized by combining a pre-training model through the interaction between a sufficient prefix sequence and a message pair, so that the social media event detection efficiency is improved.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (2)

1. The incremental social media event detection method based on heterogeneous message graph relation embedding is characterized by comprising the following steps of:
mapping complex relations between events to a prefix relation sequence, obtaining prefix relation embedded vectors through independent embedded layers, and obtaining message pair sentence embedded vectors through an embedded layer of a pre-training language model;
the prefix relation embedded vector and the message-to-sentence embedded vector are used as the input of a pre-training language model coding layer together so as to fully interact and maximize the information utilization;
Incremental detection of social media events is achieved by introducing pairwise loss, intra-cluster loss and inter-cluster loss constraints and guiding model training;
The method comprises the following steps:
S1, dividing a social media message stream into different message blocks; for each message block, a training set, a testing set and a verification set are divided according to a certain proportion, a heterogeneous message diagram among messages is respectively constructed, message contents are used as message nodes, and topic labels, entities, users and release time among the messages are used as relational nodes; then sampling other message content nodes with fixed quantity to construct message pairs; mapping the co-occurrence condition of the relation node between the message and the sampling message to a prefix sequence, and introducing a new label to represent whether two messages of the message pair belong to the same category;
S2, mapping the prefix relation sequence of the message pair to different discrete values, and obtaining a sequence embedded vector through an embedded layer; simultaneously, splicing two messages of the message pair together, and using a pre-training language model PLM to segment and embed the text to obtain sentence embedded vectors, masks and type identifiers; then, splicing the prefix relation embedded vector and the message-to-sentence embedded vector together to serve as input of a PLM coding layer of the pre-training language model, so that a coding vector is obtained;
S3, dividing the coding vector into a prefix coding vector, a current message coding vector and a sampling message coding vector by using the mask and the type identifier; averaging the three vectors to obtain an averaged prefix coding vector, a current message coding vector and a sampling message coding vector, averaging the averaged prefix coding vector and the current message coding vector again to be used as an updated current message coding vector, and splicing the updated current message coding vector and the averaged sampling message coding vector; then, mapping the spliced vector to a scalar through a linear layer to obtain a similarity score of the event pair;
S4, performing sigmoid activation function processing on the similarity score of the message pair, and then performing cross entropy loss calculation on the similarity score and the real label of the message pair to construct pair loss; then, creating a central characteristic matrix with the same number as the message categories in the message block, and calculating the distance between the message characteristics and the central characteristics to construct intra-cluster loss; for the central feature matrix, calculating the distance between the central features to form inter-cluster loss; guiding and constraining training of the model by comprehensively considering pairwise loss, intra-cluster loss and inter-cluster loss;
S5, when predicting the message of the next message block, predicting test set data of the next message block, and for a group of updated coding vectors obtained by each message, processing similarity scores corresponding to the coding vectors through a sigmoid activation function to reserve message feature representations with similarity scores higher than a threshold value; for a message pair for which the similarity does not reach the threshold, selecting as the final feature representation of the message the one of the set of message feature representations having the highest similarity score; the obtained characteristic representations of the messages are used as the input of a clustering algorithm to realize the detection of social media events;
S6, under the condition that the social media message stream and the processed message blocks are given, continuously training and updating the model through S1-S4, and simultaneously continuously predicting the social media message through S5;
the specific implementation of the S1 comprises the following steps:
Definition agency communication: a set of consecutive time-series social media message blocks,/> Is formed by the Time block Time = [/>, />) Social media message blocks composed of all messages of (a); will/>Expressed as:|/> I } wherein/> For/>Total number of messages in,/>Is a single message; will/>Expressed as: /(I)"Wherein/>Representing the associated text document, user and timestamp, respectively; the same type of social media message is expressed as: /(I)Wherein/>Representing the total number of messages in e, each social media message belonging only to one category;
For each message block, the following applies 7:2:1, dividing a training set, a testing set and a verification set in proportion, respectively constructing heterogeneous message graphs among messages, taking message contents as message nodes, and taking topic labels, entities, users and release time among the messages as relational nodes; for each message in the training set, sampling other message in the message block to construct a message pair, sampling X/2 positive samples in the training set, constructing X message pairs by X/2 negative samples, and randomly sampling Y message construction message pairs for each message in the verification set and the test set; the relationship node co-occurrence between each message and its sampled messages is then mapped to a prefix sequence:
For each message pair, their cluster relationship is:
; representing whether two messages of a message pair belong to the same category through a cluster relation;
wherein, For message/>Constructed message pair,/>Is a sampled message; /(I)For a sequence of prefix relationships between messages,/>, />, />, />Messages/>, respectivelyAnd its sampled message/>Topic labels, entities, users and distribution time relations between the users; /(I)Representing the direct cluster relationship of two messages in a message pair,/>Representing a set of k-class messages;
the specific implementation of the S2 comprises the following steps:
Defining batch size, and mapping the sampled and relation-mapped message blocks The definition is as follows: for/> First, the prefix relation sequence/>In (a) and (b)Mapping to different discrete values, the rules are as follows: if topic label relation exists between the messages, then/>1, Otherwise 0; if there is an entity relationship between messages/>3, Otherwise 2; if there is a user relationship between messages/>5, Otherwise 4; if the release time between messages is within 4 hours,/>7, Otherwise 6; in each batch, the prefix relation embedded vector among the messages is obtained through the operation of an embedded layer of the pre-training language model PLM by the updated prefix relation sequence:
wherein, embedded represents the embedding operation; , />, />, /> messages/>, respectively And its sampled message/>Topic labels, entities, users and distribution time relations between them,/>For message/>Constructed message pair,/>Representing a direct cluster relationship of two messages in a message pair; /(I)Representation/>The total number of relationships in (a); /(I)For messages fast/>The total number of messages in (a); /(I)Representation/>/>A personal relationship;
Then the word segmentation operation of the message pair through the pre-training language model PLM is carried out to obtain sentence marks, sentence attention masks and type identifiers ) ; Sentence embedding vectors are obtained through embedding operation:
wherein Token represents a word segmentation operation; ebed represents a pre-training model embedding operation; 、/> And/> Respectively the obtained sentence mark, sentence attention mask and type identifier; /(I)Embedding vectors for the message to the sentence;
Embedding the prefix relationship between the obtained messages into a vector And message-to-sentence embedding vector/>Splicing; at the same time, in order to embed the prefix relation into the vector/>Participate in the attention score calculation, generating for it a length/>Mask/>And sentence attention mask/>Splicing,/>; Obtaining a coding vector through a coding layer of the pre-training model:
Wherein Encoder denotes a coding operation of the PLM; the term "is used to denote a stitching operation, Representing prefix relation embedding vector/>Is a mask of attention;
The specific implementation of the S3 comprises the following steps:
In each batch, for the encoded vector By/>Length from the coding vector/>Extracting prefix coding vector:
through sentence attention mask And type identifier/>From the encoding vector/>Extracting a current message coding vector and a sampling message coding vector:
The prefix is then encoded into a vector And current message encoding vector/>Sampled message encoding vector/>The average is:
Wherein [: n ] represents taking the first n columns, [ n: ] represents taking all columns after n; Indicating the length of the prefix(s), Representing prefix encoding vector/>N-th row of (a); /(I)Representing the length of a sentence; /(I)、/>Respectively represent sentence coding vector/> 、/>N-th row of (a); /(I)Representing a relation embedding vector/>Is a mask of attention;
Will be And/>Average as updated current message/>Is a coded vector of: /(I); Will/>And/>Mapping the message pair to a scalar through a linear layer after splicing to obtain a similarity score/>, of the message pair
Wherein, link (x, 1) represents mapping vector x to a scalar;
the specific implementation of S4 includes:
For the construction of the pair loss, in each batch, calculating the cross entropy loss by using the similarity score of each message pair of the current batch and the cluster relation of two messages in the message pair; minimizing the difference between the similarity score and the cluster relationship of the message;
wherein N is the size of the batch, A similarity score representing the nth 1 message pair in the batch; /(I)Representing the cluster relationship of the nth 1 message pair in the batch; /(I)Is a sigmoid function for use in the method ofConversion to a similarity between 0 and 1;
Representation by central feature ,/>Intra-cluster loss is constructed:
wherein, Is/>Updated central feature representation,/>Representing a set of k-class messages, the central feature representing an initially all zero vector, the first message being considered as a central feature vector when it comes in,/>Is the center feature vector before updating; alpha is a parameter with a value range of (0, 1); /(I)Representing message block/>A message set contained in the message set; /(I)Representation/>A set of k-th type messages in (a); /(I)Is/>A characteristic representation of the j-th message; mean represents the average operation; For calculating/> And/>Is the euclidean distance of (2);
Representation by central feature Composed center feature matrix/>To construct inter-cluster loss:
wherein, Representation/>Is a central feature matrix of (a); shuffled represents pair/>Shuffling, namely randomly scrambling the sequence of the shuffling; mean is the average operation; max (x, y) represents the maximum value taking operation in x and y, x=,y=0;D(/>,/>) For calculating/>And/>Euclidean distance of corresponding rows of the matrix, and beta is a distance threshold;
the final penalty is constructed from the pairwise, intra, and inter-cluster penalty as:
Wherein, the value range of the gamma and lambda parameters is (0, 1);
The specific implementation of S5 includes:
Will be related to messages Is processed to obtain a set of similarity/>, corresponding to each message pairAnd a set of feature representations/>,/>,/>From which a similarity greater than a specific threshold/>, is selectedThe characteristic representation of the obtained product is averaged to obtain/>Is represented by the final features of (a):
But for the case of All similarities in (3) are less than/>In the above, select/>Corresponds to/>The feature representation of the greatest similarity location is taken as message/>The final characteristics of (2) are expressed as:
wherein, Representing the similarity of the message and the s-th sampled message,/>Is from/>And the s-th sampling information is obtained by cutting in the coding vector; y is the number of sampling messages in the prediction stage; /(I)Is a sigmoid function; select (/ >), /> () ≥ />) The expression slave/>The selected similarity is greater than or equal to/>Is a vector of (2); /(I)Representing a selected set of feature representations,/>Representation/>The number of features in (a); /(I)Representation/>A characteristic representation of the k-th of (a); max (/ >), /> (/>), />) Representing the slave/>Selected/>Personal/> (/>) A maximum feature vector; and taking the final characteristic representation of each obtained message as the input of a clustering algorithm, thereby realizing event detection.
2. The incremental social media event detection method based on heterogeneous message graph relation embedding of claim 1, wherein the specific implementation of S6 includes:
Given a message block The model is learned through steps S1-S4:
wherein, Is message block/>A set of messages contained in the model, θ representing a model parameter;
given social interaction S, learn a set of models through S1-S4 The model is gradually updated and learned over time t while predicting the next message block:
wherein, Is/>A set of messages contained in the message; /(I)And/>Representing the current message block model parameters and the last message block model parameters respectively; for/>Its model training does not inherit other message block model parameters, and is referred to as the initial model.
CN202410125597.6A 2024-01-30 2024-01-30 Incremental social media event detection method based on heterogeneous message graph relation embedding Active CN117670571B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410125597.6A CN117670571B (en) 2024-01-30 2024-01-30 Incremental social media event detection method based on heterogeneous message graph relation embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410125597.6A CN117670571B (en) 2024-01-30 2024-01-30 Incremental social media event detection method based on heterogeneous message graph relation embedding

Publications (2)

Publication Number Publication Date
CN117670571A CN117670571A (en) 2024-03-08
CN117670571B true CN117670571B (en) 2024-04-19

Family

ID=90064359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410125597.6A Active CN117670571B (en) 2024-01-30 2024-01-30 Incremental social media event detection method based on heterogeneous message graph relation embedding

Country Status (1)

Country Link
CN (1) CN117670571B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974340B (en) * 2024-03-29 2024-06-18 昆明理工大学 Social media event detection method combining deep learning classification and graph clustering

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN111598710A (en) * 2020-05-11 2020-08-28 北京邮电大学 Method and device for detecting social network events
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112395539A (en) * 2020-11-26 2021-02-23 格美安(北京)信息技术有限公司 Public opinion risk monitoring method and system based on natural language processing
CN112949281A (en) * 2021-01-28 2021-06-11 北京航空航天大学 Incremental social event detection method for graph neural network
CN113688203A (en) * 2021-08-12 2021-11-23 北京航空航天大学 Multi-language event detection method based on migratable heteromorphic graph
CN114528479A (en) * 2022-01-20 2022-05-24 华南理工大学 Event detection method based on multi-scale different composition embedding algorithm
CN114861004A (en) * 2022-04-27 2022-08-05 哈尔滨工业大学(深圳) Social event detection method, device and system
CN115510236A (en) * 2022-11-23 2022-12-23 中国人民解放军国防科技大学 Chapter-level event detection method based on information fusion and data enhancement
CN116319003A (en) * 2023-03-22 2023-06-23 南京理工大学 Network security event detection method based on knowledge graph and incremental learning
CN117172253A (en) * 2023-09-18 2023-12-05 昆明理工大学 Label information guiding-based social media multi-modal named entity recognition method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11579958B2 (en) * 2021-04-23 2023-02-14 Capital One Services, Llc Detecting system events based on user sentiment in social media messages
CN113254803B (en) * 2021-06-24 2021-10-22 暨南大学 Social recommendation method based on multi-feature heterogeneous graph neural network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN111598710A (en) * 2020-05-11 2020-08-28 北京邮电大学 Method and device for detecting social network events
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112395539A (en) * 2020-11-26 2021-02-23 格美安(北京)信息技术有限公司 Public opinion risk monitoring method and system based on natural language processing
CN112949281A (en) * 2021-01-28 2021-06-11 北京航空航天大学 Incremental social event detection method for graph neural network
CN113688203A (en) * 2021-08-12 2021-11-23 北京航空航天大学 Multi-language event detection method based on migratable heteromorphic graph
CN114528479A (en) * 2022-01-20 2022-05-24 华南理工大学 Event detection method based on multi-scale different composition embedding algorithm
CN114861004A (en) * 2022-04-27 2022-08-05 哈尔滨工业大学(深圳) Social event detection method, device and system
CN115510236A (en) * 2022-11-23 2022-12-23 中国人民解放军国防科技大学 Chapter-level event detection method based on information fusion and data enhancement
CN116319003A (en) * 2023-03-22 2023-06-23 南京理工大学 Network security event detection method based on knowledge graph and incremental learning
CN117172253A (en) * 2023-09-18 2023-12-05 昆明理工大学 Label information guiding-based social media multi-modal named entity recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Multi-Semantics Learning for Social Event Detection via Heterogeneous GNNs;Yutao Huang等;2022 International Joint Conference on Neural Networks;20220930;1-9 *

Also Published As

Publication number Publication date
CN117670571A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
CN113128229B (en) Chinese entity relation joint extraction method
CN117670571B (en) Incremental social media event detection method based on heterogeneous message graph relation embedding
CN111709518A (en) Method for enhancing network representation learning based on community perception and relationship attention
CN110751188B (en) User label prediction method, system and storage medium based on multi-label learning
CN111753024A (en) Public safety field-oriented multi-source heterogeneous data entity alignment method
CN112328859B (en) False news detection method based on knowledge-aware attention network
CN115409018B (en) Corporate public opinion monitoring system and method based on big data
CN110263164A (en) A kind of Sentiment orientation analysis method based on Model Fusion
CN113157886A (en) Automatic question and answer generating method, system, terminal and readable storage medium
Lai et al. Transconv: Relationship embedding in social networks
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN111259264B (en) Time sequence scoring prediction method based on generation countermeasure network
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN112084319B (en) Relational network video question-answering system and method based on actions
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
CN117354207A (en) Reverse analysis method and device for unknown industrial control protocol
CN115334179B (en) Unknown protocol reverse analysis method based on named entity recognition
CN113159976B (en) Identification method for important users of microblog network
CN113254688A (en) Trademark retrieval method based on deep hash
Wang et al. Inter-intra information preserving attributed network embedding
CN113707293A (en) Chinese medicine principal symptom selection method based on feature selection
Xu et al. A structure-characteristic-aware network embedding model via differential evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant