CN115017299A - Unsupervised social media summarization method based on de-noised image self-encoder - Google Patents

Unsupervised social media summarization method based on de-noised image self-encoder Download PDF

Info

Publication number
CN115017299A
CN115017299A CN202210393787.7A CN202210393787A CN115017299A CN 115017299 A CN115017299 A CN 115017299A CN 202210393787 A CN202210393787 A CN 202210393787A CN 115017299 A CN115017299 A CN 115017299A
Authority
CN
China
Prior art keywords
post
posts
network
relationship
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210393787.7A
Other languages
Chinese (zh)
Inventor
贺瑞芳
刘焕宇
王浩成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210393787.7A priority Critical patent/CN115017299A/en
Publication of CN115017299A publication Critical patent/CN115017299A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unsupervised social media summarization method based on a denoising image self-encoder, which comprises the steps of constructing a post level social relation network according to a sociology theory, and obtaining content codes of posts by utilizing a pre-training BERT model to be used as initial content representation of the posts; defining two noise relationship types and setting corresponding noise functions to construct a pseudo social relationship network with noise relationships; simultaneously representing the sampled pseudo social relationship network instance and the initial content of the post as the input of a residual map attention network encoder, and encoding the post by the residual map attention network encoder according to the initial content representation and the social relationship of the post to obtain the vector representation of the post; constructing a decoder, wherein the residual image attention network encoder and the decoder jointly form a de-noising image self-encoder structure, and the de-noising image self-encoder can learn to remove the noise relationship in the post level social relationship network, so as to finally obtain accurate post representation; and selecting a final abstract by adopting an abstract extractor based on sparse reconstruction.

Description

Unsupervised social media summarization method based on de-noising graph self-encoder
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to an unsupervised social media summarization method based on a de-noising image self-encoder.
Background
With the development and popularization of internet technology, the social media platform gradually becomes a novel information production and transmission medium, and gradually occupies an increasingly important position in various aspects of social production, life and the like. However, the amount of content on social media is increased sharply, which causes a serious information overload problem, and provides a more serious challenge for efficient retrieval of information, and it is often difficult for a common user to search and obtain effective and interesting information from massive noisy information, which seriously reduces the efficiency of information retrieval.
The technology can effectively relieve the problem of information overload on social media and help improve the retrieval efficiency of effective information of users. The current main summarization method can be generally divided into two modes: abstract and generate abstract. Wherein, the extraction type abstract is mainly to select the most representative text unit words, sentences or segments with large information amount, low redundancy and wide coverage from the input original text to form the final abstract; the generation type abstract method relates to a text generation process, and generates corresponding abstract description by understanding the semantics of an original input text and utilizing a text generation technology. In recent years, the techniques of abstraction and generative automatic text summarization have been significantly advanced due to the development of numerous new technologies such as sequence-to-sequence framework (Seq2Seq), Transformer model, contrast learning, and large-scale pre-training model.
However, the existing method usually needs to rely on large-scale labeled paired training data (i.e. text-abstract pairs), and currently, the acquisition of the labeled training data usually needs to label the data manually, so that the construction cost is huge, and the method cannot be used in large-scale training scenes. In the social media field, the construction of annotation data is more difficult: on one hand, when a annotator marks the abstract of the content of a certain specific topic, all posts related to the topic need to be read, and then corresponding abstracts are written for the posts, and the number of posts on social media is too large, so that the unassailable labor cost is generated by artificial reading; on the other hand, since the content on the social media has high real-time performance and topic sensitivity, the result of tagging under a certain topic cannot be applied to other topic fields, and therefore, data tagging work needs to be performed under each topic, which consumes a lot of manpower and material resources. Further, when the traditional text summarization method is migrated to the social media data, because the text features on the social media are greatly different from the traditional long documents, such as the features of shorter text length, diversified expressions, informality, and the like, the traditional summarization method is generally difficult to obtain satisfactory results.
The existing social media abstract research mainly carries out feature extraction on each post independently based on the content of the post, and then extracts the posts with higher importance as an abstract by adopting algorithms such as graph sorting or clustering and the like. These methods have certain disadvantages: (1) because the length of the post on the social media is usually short, the content of a single post often contains incomplete or ambiguous information and cannot provide sufficient information, so that the characteristics of the post have the problems of sparseness and inaccuracy; (2) the social media rely on users to actively transmit and receive information through social contact, the process effectively promotes the transmission of the information, therefore, posts on the social media are embedded in a social network structure and are not independent of each other, and the previous methods only focus on text content features of the posts and ignore social structure features of the posts, so that social relationship information of the posts is lost.
Some work has attempted to facilitate analysis of content on social media by leveraging some simple social signals provided on social media, such as the number of fans of the author, the number of forwards posts, the number of praises, and so on; further work verifies the influence of social relationships on the content relevance in social networks from the perspective of social theory, and proposes that posts with social relationships tend to contain similar contents and viewpoints in a short time, the social theory indicates the association between the social relationships and text contents macroscopically, but on the microscopic level, there are often noise relationships that do not conform to the social theory, which is divided into two cases: (1) two posts have a social relationship, but have a low relevance in content, and such a noise relationship is defined as a false relationship; (2) the two posts have no directly connected social relationship but have high relevance in content, and the noise relationship is defined as a potential relationship; the existence of these two relationships presents further challenges to the efficient utilization of social relationships.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide an unsupervised extraction type social media summarization method which is more robust to noise relation.
The purpose of the invention is realized by the following technical scheme:
an unsupervised social media summarization method based on a denoising map self-encoder comprises the following steps:
s1, constructing a post level social relationship network according to a sociology theory, defining a noise-free relationship in the post level social relationship network, namely a real social relationship network, and obtaining a content code of the post by utilizing a pre-trained BERT model as an initial content representation of the post;
s2, defining two noise relationship types of false relationship and potential relationship according to the social behaviors and habits of the user; adding instances of false relations and potential relations into the original post-level social relation network by setting a corresponding noise function, and constructing a post-level social relation network with a noise relation, namely a pseudo social relation network; sampling a plurality of generated pseudo social relationship networks, and simultaneously using the sampled pseudo social relationship network instances and the initial content representation of the posts as the input of a residual map attention network encoder, wherein the residual map attention network encoder comprises a multi-head attention mechanism, and the residual map attention network encoder encodes the posts according to the initial content representation and the social relationship of the posts to obtain the vector representation of the posts;
s3, constructing a decoder, wherein the decoder and the residual image attention network encoder jointly form a de-noising image self-encoder; the decoder reconstructs a real social relationship network according to the vector representation of the posts so as to capture the social relationship information among the posts, and simultaneously reconstructs the semantic relationship among the posts and the contained words so as to capture the text content information of the posts; meanwhile, as a real social relationship network without a noise relationship is reconstructed, the residual error map attention network encoder and the residual error map attention network decoder can learn and exclude the noise relationship in the post level social relationship network, and finally accurate post representation is obtained;
s4, selecting a final abstract by adopting an abstract extractor based on sparse reconstruction according to the post representation obtained in the step S3, iteratively selecting posts with the highest reconstruction coefficient, adding the posts into a final abstract set, and repeating the process until the length limit of the abstract is reached.
Further, step S1 is specifically as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.
(101) Formally describing the post-level social relationship network as follows: order to
Figure BDA0003598118480000031
Representing a collection of posts, N being the number of posts, where s i (1 ≦ i ≦ N) representing the ith post;
Figure BDA0003598118480000032
represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us order
Figure BDA0003598118480000033
Representing user u i Of neighbor users, i.e. with user u i A set of users having direct social relationships;
Figure BDA0003598118480000034
representing all users u i A collection of published posts; building a post-level social relationship network according to the following rules
Figure BDA0003598118480000035
Wherein
Figure BDA0003598118480000036
Representing a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if the post is
Figure BDA0003598118480000037
Wherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij E is epsilon; in expressing infectious social relationships: if the post is
Figure BDA0003598118480000038
And is provided with
Figure BDA0003598118480000039
Or
Figure BDA00035981184800000310
Then is post s i And s j An edge e is established between ij E is epsilon; post level social relationship network is constructed according to the two rules
Figure BDA00035981184800000311
Wherein only the post node set is included
Figure BDA00035981184800000312
And the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship network
Figure BDA00035981184800000313
The corresponding adjacency matrix is marked as
Figure BDA00035981184800000314
Wherein A is ij > 0 denotes post node s i And s j Have social relationship connection between them, otherwise A ij =0;
(102) The content code of the post is obtained by utilizing the pre-training BERT model and is used as the initial content representation of the post, and the details are as follows:
for each post s i Inputting a post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):
x i =BERT(s i ) (1)
wherein x i Representing a post s i The initial content representation of all N posts is finally obtained, where X is the initial content representation of all N posts 1 ,…,x N ]。
Further, step S2 specifically includes:
(201) the two noise relationships, spurious and latent, are defined as follows:
in social networks, posts connected by social relationships often have more similar content or opinions between them. However, in actual production life, most of social networks on social media belong to pseudo social relationship networks, noise relationships are included in the pseudo social relationship networks, and according to observation of real social media data, two types of noise relationships are defined:
(a) false relationship: if two posts have a social relationship therebetween, but their relevance in content is less than a set threshold, defining the social relationship between the two posts as a false relationship;
(b) the potential relationship is as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;
setting a noise function corresponding to the false relationship as relationship insertion, and setting a noise function corresponding to the potential relationship as relationship loss, specifically as follows:
(c) relationship insertion: randomly adding an edge for any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;
(d) loss of relationship: randomly removing edges between any two connected post nodes in the post level social relationship network;
a pseudo social relationship network is constructed as training data by adding instances of noisy relationships to a real social relationship network.
(202) Encoding the post using a residual map attention network encoder;
after the initial content representation of the pseudo social relationship network and the posts is obtained, in order to model the social relationship between the posts, a residual graph attention network encoder is adopted to encode the posts according to the social relationship between the initial content representation of the posts and the posts so as to integrate the social relationship information and the text content information of the posts; the residual graph attention network encoder may be viewed as an information propagation model that learns representations of nodes in a post-level social relationship network by aggregating information of neighboring nodes that have edges connected in the post-level social relationship network. Meanwhile, compared with a traditional Graph Convolutional network encoder (GCN), the residual Graph attention network encoder can endow different weights to different neighbor nodes of the same node, so that the attention weight of an important neighbor node is improved, the weight of a neighbor node with low correlation is reduced, and more accurate node representation is learned;
formally, the residual graph attention network encoder represents the initial content of the node
Figure BDA0003598118480000041
Adjacency matrix corresponding to post-level social relationship network
Figure BDA0003598118480000042
As input, where D is the dimension of the node feature representation and N is the number of nodes. The propagation rule formula (2) and formula (3) of the residual map attention network encoder are shown as follows:
Figure BDA0003598118480000051
Figure BDA0003598118480000052
wherein H (l) Is a hidden representation of a residual map attention network encoder at the l-th layer, A is an adjacency matrix corresponding to a post-level social relationship network, A ij Representative post s i And s j I is an identity matrix,
Figure BDA0003598118480000053
is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,
Figure BDA0003598118480000054
indicating post s after increasing attention weight i And s j The weight of the relationship between, σ (-) represents a non-lineA sexual activation function;
Figure BDA0003598118480000055
is a post s i And post s j Attention scores at the l-th level in between; w is a group of (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] The general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representation to K different subspaces, K representing the total number of attention heads in the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:
Figure BDA0003598118480000056
Figure BDA0003598118480000057
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;
Figure BDA0003598118480000058
and
Figure BDA0003598118480000059
respectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is the dimension of the implicit representation in the attention calculation process, where the superscript (l) representing the number of layers is omitted and a superscript head is used k To indicate the kth attention head;
Figure BDA00035981184800000510
and
Figure BDA00035981184800000511
is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); k represents the total number of attention heads, K in total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:
Figure BDA00035981184800000512
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the common graph attention network is replaced by residual connection to form a residual graph attention network, so that the residual graph attention network can directly transmit the input information to the output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
Figure BDA0003598118480000061
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; during the encoding process, the depth of the residual map attention network encoder
Figure BDA0003598118480000062
Determining information transfer in post level social relationship networkThe distance of broadcast, residual map attention network encoder encodes the post according to the encoding rules of formula (7) and formula (8), and the output of the last layer thereof
Figure BDA0003598118480000063
I.e. a vector representation of the encoded post, wherein
Figure BDA0003598118480000064
Post s encoded by network encoder representing residual map attention i A vector representation of (a);
further, step S3 is specifically as follows:
(301) reconstructing the true social relationship network and the content of the post using a reconstruction-based decoder.
In order to make the learned vector representation of posts contain both textual content information and social relationship information between posts, a decoder based on two reconstruction objectives was devised. The decoder reconstructs a real social relationship network without noise relationship to capture the social relationship information among the posts on one hand, and reconstructs the text content contained in the posts on the other hand, thereby capturing the text content information of the posts and enriching the vector representation of the posts.
For reconstruction of a real social relationship network, the decoder predicts whether a social relationship exists between two post nodes according to vector representations of the two nodes, and particularly predicts a probability of the existence of the social relationship between the two nodes by using an inner product of the vector representations of the two nodes:
Figure BDA0003598118480000065
wherein (·) T Representing a transpose operation on a vector representation; for each pair of posts s i And s j The decoder predicts the probability of social relationship among them, where
Figure BDA0003598118480000066
Post level community being decoder outputThe adjacency matrix corresponding to the cross-relationship network,
Figure BDA0003598118480000067
post s representing decoder prediction i And s j The probability of a social relationship existing there between,
Figure BDA0003598118480000068
and
Figure BDA0003598118480000069
respectively representing posts s i And post s j A vector representation encoded by a residual map attention network encoder, σ (-) representing a nonlinear activation function;
for text content reconstruction, the relationship between reconstructed posts and words is proposed, and text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:
Figure BDA00035981184800000610
wherein
Figure BDA00035981184800000611
And
Figure BDA00035981184800000612
is a learning parameter of the decoder, V represents the vocabulary size;
Figure BDA00035981184800000613
is a prediction result of a decoder, wherein
Figure BDA00035981184800000614
Representing a post s i Containing the word w j The probability of (d);
designing corresponding loss functions aiming at the two reconstruction targets respectively, wherein the overall training target comprises two parts, the first partA part of the loss is the loss of reconstructing the real social relationship network, denoted as L g Calculating the predicted result
Figure BDA0003598118480000071
Binary cross-entropy loss between adjacency matrices a corresponding to real post-level social relationship networks:
Figure BDA0003598118480000072
the second partial loss function is the loss of reconstructed post content, denoted L c Calculating the prediction result of the decoder
Figure BDA0003598118480000073
With true result s i Binary cross entropy loss between:
Figure BDA0003598118480000074
s ij for real training labels, posts s are represented i Whether or not to contain the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0. Finally, the two partial losses are combined using an equilibrium parameter λ to obtain a final loss function L:
L=λL g +(1-λ)L c (13)
training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing noise relationship after training is finished 1 ,h 3 ,…,h N ];
Further, step S4 specifically includes:
(401) and performing abstract extraction according to the vector representation of the post by adopting an abstract extractor based on sparse reconstruction.
In order to extract representative important postsAnd taking the son as a final abstract, and extracting the abstract by adopting an abstract extractor based on sparse reconstruction. Formally, the accurate post representation after removing the noise relationship encoded by the residual graph attention network encoder is given as H ═ H 1 ,h 2 ,…,h N ]The abstract extraction process is modeled as a sparse reconstruction process:
Figure BDA0003598118480000075
wherein | | · | | represents a Frobenius norm,
Figure BDA0003598118480000076
is a matrix of reconstruction coefficients, each element V of which i,j Representing a post s j For restructuring posts s i The degree of contribution of (c); to prevent extracting the content of repeated redundancy, a similarity matrix is introduced
Figure BDA0003598118480000077
To remove redundant information, wherein if the post s i And post s j If the cosine similarity is higher than the specific threshold eta, then order
Figure BDA0003598118480000078
Otherwise
Figure BDA0003598118480000079
-hadamard product manipulation is expressed; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind 2,1 Represents the norm L21, defined as follows:
Figure BDA00035981184800000710
adding the L21 constraint to the reconstruction coefficient matrix V can make each row of the reconstruction coefficient matrix have a sparse characteristic, that is, most elements of each row in the reconstruction coefficient matrix are 0, which means that each post can only be reconstructed by a limited number of posts to limit the length of the summary; the final score for each post is defined as the sum of the post's contributions to the reconstruction of all other posts:
Figure BDA0003598118480000081
wherein score(s) i ) Representing a post s i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final summary set, and the process is repeated until the length limit of the summary is reached.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. the invention can extract the abstract under the condition of no labeled data. By introducing the social relationship information among posts, capturing the social relationship characteristics among the posts to solve the problem of sparse characteristics caused by short content of a single post;
2. the invention provides a denoised image self-encoder structure which can automatically identify and remove unreliable noise relations in a social relation network under the condition of no labeled data, and alleviate errors caused by the noise relations, so that the reliability and the accuracy of post representation are improved. After learning the accurate representation of the posts, identifying the importance and the redundancy degree of each post by adopting a sparse reconstruction technology frame, and finally extracting the posts with higher importance and low redundancy to form a final abstract.
3. Compared with the existing abstract model, the social media text content abstract obtained by the invention improves the performance on ROUGE evaluation indexes, and simultaneously, according to the experimental result, the denoising model can effectively reduce the ratio of noise relationship in a social network, improve the network structure and improve the accuracy of the abstract.
4. Compared with the traditional graph convolution network encoder, the residual graph attention network encoder used in the invention can endow different weights to different neighbor nodes of the same node, thereby improving the attention weight of important neighbor nodes, reducing the weight of irrelevant neighbor nodes and learning more accurate post representation; since the post vector obtained by the residual graph attention network encoder simultaneously contains text content information and social relationship information, the extraction process of the abstract can identify the importance and novelty of the post from the two aspects of content and social relationship, so as to generate the abstract with higher quality.
5. Since different attention heads capture relationship information between nodes from different spaces, and the relationships of two nodes in different attention heads may have large differences, the present invention employs a max-pooling operation to automatically select the strongest relationship in all subspaces as the true relationship between two nodes, and unifies the attention weights in the K attention heads into a final attention score.
6. The common graph attention network often has the problem of being too smooth, and the invention further replaces the connection between the layers in the common graph attention network with residual connection to form the residual graph attention network, so that the residual graph attention network can directly transmit input information to an output layer.
7. The invention designs a decoder based on two reconstruction targets, so that the learned vector representation of the posts contains both the content information of texts and the social relationship information among the posts. Because the feature representation of the post simultaneously contains the content information and the social relationship information of the post, the abstract process can identify the importance and novelty of the post from two aspects of text content and social structure, thereby generating the abstract content with large information amount, high diversity and wide coverage.
Drawings
Fig. 1 is a schematic diagram of the overall architecture of an unsupervised social media summarization method based on a denoised image self-encoder provided by the invention.
FIG. 2 shows the performance results achieved by the present invention in a social network under each topic in two data sets.
FIG. 3 shows the distribution of false relationships, potential relationships, and the sum of both relationships in a social network.
Fig. 4a and 4b show different denoising methods in the denoising function and the influence of different noise ratios on the result.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an unsupervised social media summarization method based on a denoising map self-encoder, which adopts two main social media data sets to evaluate the performance of the method. The overall framework of the method is shown in figure 1. In fig. 1, a post-level social relationship network and a post set are input of models, a noise function adds a noise relationship instance to the post-level social relationship network to obtain a pseudo-social relationship network with a noise relationship, the post set is encoded by a BERT model to obtain an initial content representation of a post, the initial content representation of the post and the pseudo-social relationship network with the noise are input to a residual error map attention network encoder together for encoding, and finally a post representation fusing a social relationship and text content is output. And training the whole text model according to the real social relationship network reconstruction and the text content reconstruction, and finally obtaining accurate post representation with the noise relationship removed after training is finished. And inputting the final learned accurate post representation into a summary extraction based on sparse reconstruction, and extracting a final summary.
(1) Post-level social relationship network construction
In the embodiment, mainstream social media network platform data at home and abroad are respectively selected for experimental verification, and Twitter (Twitter) is respectively selected [11] And the Sina microblog [12] The two social media platforms carry out experimental verification, and the posting time of posts under each topic in the data is guaranteed to be within a range of 5 days. For twitter data, the main language of text content in twitter is english, which contains 44,034 posts and 11240 users, wherein each user issues at least one post and each user has at least one social relationship. With 4 standard reference excerpts under each topicTo be used for evaluating the results. In the experiment, links, user names and other special characters in posts are removed, stem extraction and stop word removal are carried out, and posts with the length being less than 3 words are filtered. For microblog data, the Sina microblog is one of the most popular social media platforms in China, so in the embodiment, data collected from the Sina microblog is used, and the data comprises 130k posts and 126k users, which comprise 10 different topics in total, wherein the posts are organized in a tree structure according to interaction relations (such as replying and forwarding), and 3 standard reference abstracts are arranged under each topic for evaluating results. The data statistics details of the two data sets after preprocessing are shown in table 1. In the experiment, the ROUGE evaluation standard is adopted, and four evaluation results of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU are mainly reported.
Formally, make
Figure BDA0003598118480000101
Representing a collection of posts, N being the number of posts, where s i (1 is more than or equal to i and less than or equal to N) represents the ith post;
Figure BDA0003598118480000102
represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us order
Figure BDA0003598118480000103
Representing user u i Of neighbor users, i.e. with user u i A set of users having direct social relationships;
Figure BDA0003598118480000104
representing all users u i A set of published posts; building a post-level social relationship network according to the following rules
Figure BDA0003598118480000105
Wherein
Figure BDA0003598118480000106
Representing a node set, wherein each node represents a post, epsilon represents an edge set between nodes, and each edge represents a social relationship between posts; expressing consistent social relationships: if a post is made
Figure BDA0003598118480000107
Wherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij E is epsilon; in expressing infectious social relationships: if the post is
Figure BDA0003598118480000108
And is provided with
Figure BDA0003598118480000109
Or
Figure BDA00035981184800001010
Then is post s i And s j An edge e is established between ij E is epsilon; post level social relationship network is constructed according to the two rules
Figure BDA00035981184800001011
In which only node sets are included
Figure BDA00035981184800001012
And the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship network
Figure BDA00035981184800001013
The corresponding adjacency matrix is marked as
Figure BDA00035981184800001014
Wherein A is ij > 0 denotes post node s i And s j Have a social relationship with each other, otherwise A ij =0;
Table 1 social media data set details
Figure BDA00035981184800001015
(2) Noise distribution observation
In order to analyze the distribution of noise relationships in the constructed post-level social relationship network, the embodiment provides a method for simply estimating the distribution of noise relationships in the network. Generally speaking, two posts with social relationship are considered to have a false relationship if the relevance of the posts in the content is lower than a set threshold value theta; two posts without social relationship are considered to have a potential relationship with each other if the correlation between the posts in the content is higher than a set threshold theta, and the cosine similarity between the TFIDF representations of the posts is taken as the correlation between the posts in the content. Order to
Figure BDA0003598118480000111
An adjacency matrix representing the constructed post-level social relationship network, wherein A ij > 0 denotes post node s i And post s j Have social relationship connection between them, otherwise A ij 0. For each node pair(s) in the social relationship network i ,s j ) If they have a social relationship to each other (i.e. A) ij > 0) and the correlation phi on the content between them ij Below the threshold θ, the post s is considered i And s j Have a false relationship between them; if post s i And s j There is no social relationship connection between (i.e. A) ij 0) and correlation Φ in content between them ij Above the threshold θ, they are considered to have a potential relationship. In an experiment, TFIDF of posts is calculated, cosine similarity represented by TFIDF between two posts is calculated as correlation on contents between the posts, the higher the cosine similarity represented by TFIDF of two posts is, the higher the correlation on the contents of the two posts is, and conversely, the lower the cosine similarity represented by TFIDF of two posts is, the lower the correlation on the contents of the two posts is. The statistics may preliminarily reflect the score of the noise relationship in the post-level social relationship networkThe situation of cloth. Since the strengths of social relationships in different social relationship networks may be very different, the average value of all social relationships is used as the value of the threshold θ in the embodiment:
Figure BDA0003598118480000112
wherein phi ij Is a post s i And post s j In relation to the content. The final statistics for both twitter and microblog data sets are shown in table 2. The results show the average percentage of noise relationships in the post-level social relationship network under all topics, with the average noise percentage in the table including false relationships and potential relationships.
TABLE 2 twitter and microblog data set noise relationship distribution statistics in social relationship networks
Data set Ratio of false relations Ratio of potential relationships Average noise ratio
Twitter data set 38.61% 55.79% 55.37%
Microblog data set 83.17% 52.66% 52.67%
(3) De-noised image self-encoder
Firstly, a pre-trained BERT model is used for extracting the characteristics of the post text content, and the process is as follows:
x i =BERT(s i ) (1)
wherein s is i Represents the ith post, x i Namely, the initial content representation of the post is obtained, all N posts are coded and represented, and finally, a matrix is obtained
Figure BDA0003598118480000113
Where D is the dimension of each post feature vector. Post-level social relationship network for input
Figure BDA0003598118480000114
Social relationship network to post level using noise function
Figure BDA0003598118480000119
Adding noise relationship example to construct pseudo social relationship network
Figure BDA0003598118480000115
Real social relationship network
Figure BDA0003598118480000116
With corresponding pseudo social relationship network
Figure BDA0003598118480000117
Forming paired training data
Figure BDA0003598118480000118
After obtaining the initial content representation of the post and constructing the pseudo-social relationship network, associating the initial content representation of the post X with the pseudo-social relationship network
Figure BDA0003598118480000121
Together as an input to the residual map attention network encoder, formally, the residual map attention network encoder performs information dissemination according to the following rules:
Figure BDA0003598118480000122
Figure BDA0003598118480000123
wherein H (l) Is a hidden representation of a residual map attention network encoder at the l-th layer, A is an adjacency matrix corresponding to a post-level social relationship network, A ij Representative post s i And s j I is an identity matrix,
Figure BDA0003598118480000124
is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,
Figure BDA0003598118480000125
indicating post s after increasing attention weight i And s j The relation weight between, σ (-) represents the nonlinear activation function;
Figure BDA0003598118480000126
is a post s i And post s j Attention scores at the l-th level in between; w (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] Expanding the common attention mechanism to a multi-head attention mechanism, by mapping the potential representation to K different subspaces, each subspace being called an attention head, and in each subspaceThe attention weights are calculated separately:
Figure BDA0003598118480000127
Figure BDA0003598118480000128
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;
Figure BDA0003598118480000129
and
Figure BDA00035981184800001210
respectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is the dimension of the implicit representation in the attention calculation process, where the superscript (l) representing the number of layers is omitted and a superscript head is used k To indicate the kth attention head;
Figure BDA00035981184800001211
and
Figure BDA00035981184800001212
is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); in the embodiment, K represents the total number of attention heads, and K is total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:
Figure BDA00035981184800001213
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the common graph attention network is replaced by residual connection to form a residual graph attention network, so that the residual graph attention network can directly transmit the input information to the output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
Figure BDA0003598118480000131
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; depth of residual graph attention network encoder during encoding
Figure BDA0003598118480000139
Determining the information propagation distance in the post level social relationship network, encoding the post according to the encoding rules of formula (7) and formula (8) by a residual map attention network encoder, and outputting the post at the last layer
Figure BDA0003598118480000132
I.e. a vector representation of the encoded post, wherein
Figure BDA0003598118480000133
Post s encoded by network encoder representing residual map attention i A vector representation of (a);
after the vector representation of the post is obtained by the residual image attention network encoder, decoding the vector representation of the post by a decoder, and reconstructing a real social relationship network which does not contain a noise relationship, so as to learn, identify and remove the noise relationship in the pseudo social relationship network; in the training process, the model is trained according to the following loss function:
Figure BDA0003598118480000134
Figure BDA0003598118480000135
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003598118480000136
post s representing decoder prediction i And s j Probability of social relationship between them, A ij As posts s i And post s j True social relationship situation between them;
Figure BDA0003598118480000137
is a prediction result of a decoder, wherein
Figure BDA0003598118480000138
Representing a post s i Containing the word w j The probability of (d); s ij For real training labels, posts s are represented i Whether or not to contain the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0. Wherein L is g Is the loss of reconstructing the original network structure, L c Reconstructing the loss of the original post content, and finally combining the two losses by using a balance parameter lambda to obtain the final loss L:
L=λL g +(1-λ)L c (11)
training the whole model of the text according to the loss function, taking the initial content representation of the real social relationship network and the posts as input in a testing stage after the training is finished, and coding the input through a residual image attention network coder to obtain an accurate post vector table which integrates social relationship information and text content information and removes noise relationshipIs given by H ═ H 1 ,h 2 ,…,h N ];
(4) Sparse reconstruction based abstraction extraction
After obtaining accurate post vector representation H ═ H 1 ,h 2 ,…,h N ]And then, identifying the importance of the post by adopting a frame based on sparse reconstruction, wherein the reconstruction process is modeled as follows:
Figure BDA0003598118480000141
wherein the symbols are as described above. The final importance score for each post is calculated as follows:
Figure BDA0003598118480000142
and then sorting the posts according to the importance scores of the posts, iteratively extracting the posts with the highest scores and adding the posts into the candidate abstract set until the length limit of the abstract is reached.
In the specific implementation process, various hyper-parameters are set in advance, the representing dimension D of the post is set to 768, and the probability distribution of two kinds of noise is often different in different social networks, so that the probability of relation insertion and relation loss in a noise function is set to be 0.4 and 0.1 in twitter data; for microblog data, the probability of both noise functions is set to be 0.3. The balance parameter λ of the two part losses in the final loss function is set to 0.8. The super parameter β ═ γ ═ 1 in the digest selection stage, and the redundancy threshold θ is set to 0.1.
To verify the effectiveness of the method of the invention, the method of the invention (DSNSum) was compared with the two types of methods, respectively. The first method only uses text content in social media to extract the abstract, and specifically comprises the following steps:
Centroid [2] the centrality-based features are used to identify sentences that are highly similar to the center of the cluster as a summary.
LSA [3] Decomposing the feature matrix by using SVD technique, andand identifying the importance of the post according to the size of the singular value after matrix decomposition.
Lexrank [4] The method is a graph sorting algorithm similar to the PageRank, firstly, a similarity network is built according to the similarity of contents among posts, then, the graph sorting algorithm similar to the PageRank is adopted in the similarity network to identify the importance of each post node, and the posts with higher importance are extracted as abstracts.
DSDR [5] Considering the summarization process as a reconstruction task, the most representative posts are extracted as summaries by minimizing the reconstruction loss.
MDS-Sparse [6] The multi-document abstract is extracted by adopting a sparse coding-based technology, and the loss of the reconstructed original document is reduced as much as possible under the sparse constraint, so that the simplicity and the importance of the abstract are ensured.
PacSum [8] Is a graph-based abstract method, which uses BERT to extract the features of sentences and models documents as a directed graph structure, while taking into account the relative position information between sentences.
Spectral [9] A spectrum-based hypothesis is provided, a concept of spectrum importance is defined, and sentences with higher importance are extracted as abstracts according to the spectrum importance of the sentences.
The second method not only uses the text content characteristics of posts, but also introduces the social relationship information among posts, and comprises the following steps:
SNSR [7] based on the social theory, the social relation among posts is modeled into a regular term and is introduced into a sparse reconstructed frame, so that the social relation is used for additionally guiding the abstract extraction process.
SCMGR [10] And encoding post representation fusing text content and a social structure on a social relationship network among posts by using a graph volume network, and inputting the learned fused representation into a sparse reconstruction frame for extracting important posts.
The evaluation indexes of the experimental performance adopt a ROUGE evaluation standard, and specifically comprise four indexes of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU. The overlap ratio of the N-element words between the output abstract and the standard abstract is measured by the ROUGE-N, and the evaluation is carried out by adopting ROUGE-1 and ROUGE-2 standards in the experiment; ROUGE-L measures the longest common subsequence between the output summary and the standard summary; the route-SU measures the degree of match between the output summary and the standard summary of 1-gram words and 2-gram phrases, while allowing discontinuities between words. In the subsequent experiments, the four indexes are respectively marked as R-1, R-2, R-L and R-SU.
Table 3 shows the experimental results of the model and all comparison methods on both data sets. Higher value of the ROUGE score indicates better performance of the model. Tables 4 and 5 show the model in twitter, respectively [11] And a degradation experimental result on the microblog data, wherein DSNSum is an experimental result of the complete model, and w/o differentiating represents the performance after the denoising module is removed; w/o GAT then represents the performance after the residual map attention encoder is removed.
TABLE 3 Performance of the method of the present invention and other methods on twitter and microblog data sets
Figure BDA0003598118480000151
TABLE 4 results of degradation experiments on the twitter data by the method of the present invention
Twitter data R-1 R-2 R-L R-SU*
DSNSum 46.51 14.29 44.16 20.76
w/o denoising 45.02 13.33 42.72 19.83
w/o GAT 44.14 12.65 41.68 19.10
TABLE 5 degradation experiment results of the method of the present invention on microblog data
Microblog data R-1 R-2 R-L R-SU*
DSNSum 37.01 10.98 14.22 13.06
w/o denoising 35.31 9.76 13.43 12.07
w/o GAT 34.36 8.93 13.29 11.12
As can be seen from the results in Table 3, the method of the present invention achieves the highest performance on the twitter data, which exceeds all other comparison methods; on microblog data, the microblog data are slightly lower than an SCGR model under an R-L standard and exceed all other comparison models on other standards, and experimental results prove the effectiveness of the method disclosed by the invention. In the degradation experiments, as can be seen from tables 4 and 5, the removal of any one module results in the performance reduction, and it is proved that each module has a certain promoting effect on the whole model. After the denoising module is removed, the performance of the model is reduced, which proves that extra noise information can be introduced into the summarization process by the noise relationship, so that the summarization result is damaged, and the denoising module reduces the influence of the noise relationship on the summary by identifying and removing the noise relationship in the network, so that the quality of the summary is improved. In addition, the performance of the model is greatly reduced after the graph attention network is removed, and the phenomenon shows that the analysis of the content in the social media environment can be effectively promoted by considering the social relationship information in the post level social relationship network. On the one hand, the graph attention network can alleviate the problem of insufficient content of a single post by aggregating related background information from adjacent neighbor nodes in the post-level social relationship network, and on the other hand, the topological feature of the post-level social relationship network can provide an additional clue for the importance identification of the post from a sociological perspective.
In order to further analyze whether the denoised image self-encoder module provided by the method has the relation of removing noise and improve the function of a network structure, additional experimental verification is carried out. With the post representation kept unchanged (post representation encoded using the same pre-trained BERT model), the ratio of the noise relationship in the network was calculated using the network after denoising, and the results are shown in table 6.
Table 6 shows the ratio of noise relationship in the network after denoising in the twitter data and the microblog data, and the value in parentheses represents the amplitude of the drop before denoising
Data set Rate of false relations Rate of potential relationship Mean noise ratio
Twitter data 13.60%(↓25.01%) 54.93%(↓0.86%) 54.50%(↓0.87%)
Microblog data 45.29%(↓37.88%) 49.48%(↓3.18%) 46.57%(↓6.10%)
From the results in the table, under the condition that the representation of the text content of the post is not changed, the overall noise ratio in the network after denoising is reduced, and the effectiveness of the denoising process is proved. The reduction range of the proportion of the false relation in the twitter and microblog data after denoising reaches 25.01% and 37.88% respectively, which shows that the denoising module is more good at removing the false relation in the network.
In order to verify whether the post representation learned by a denoised graph self-encoder (DGAE) is better than the original BERT representation or not, under the condition that the social relationship network at the post level is kept unchanged, comparing the distribution situation of the noise relationship in the network when DGAE representation learned by using the method of the invention and BERT representation are used, because the value of the threshold theta can seriously influence the distribution situation of the noise relationship when the noise relationship is calculated, the noise distribution situation under different theta value situations is shown in an experiment, specifically, the calculation mode of the threshold theta is shown according to the following formula:
θ=minΦ+δ*(maxΦ-minΦ)
where Φ is the semantic similarity matrix between posts and δ is the tuning parameter, the experimental results are shown in fig. 3.
As can be seen from fig. 3, as the value of the threshold θ increases, the potential relationship rate decreases, and the false relationship rate increases. The overall noise relationship rate is generally maintained at a high level. After DGAE denoising, the potential relation rate is greatly reduced, and the false relation rate also presents a lower level. Most importantly, the total noise relation rate has a remarkable descending trend compared with that before denoising, and the DGAE proves that the DGAE can effectively remove the noise relation in the network.
The x-axis in fig. 3 represents the value of the threshold δ. Subgraphs (a) and (c) correspond to the case of representations encoded using the BERT model, and subgraphs (b) and (d) correspond to representations learned from the encoder model using the denoised image.
Additional experiments were performed in order to analyze the order of the two noise relationships in the noise function and the effect of the ratio on the model performance. And observing the change trend of the model performance by adjusting the probability of the relation between two kinds of noise in the noise function. In addition, whether the adding sequence of the two noise relationships influences the performance of the model or not is further adjusted, the adding sequence of the two noise relationships is recorded as inserting first and then losing and inserting first and then losing, and the change condition of the performance of the model is observed. The results are shown in FIGS. 4a and 4 b.
Fig. 4a and 4b show the influence of different denoising modes and different noise relation addition probabilities in the denoising function on the experimental result. Wherein fig. 4a shows the case where the spurious relation is added first and then the potential relation is added in the noise function, and fig. 4b shows the case where the spurious relation is added first and then the potential relation is added. The horizontal axis represents the insertion probability in the noise relationship, and the vertical axis represents the loss probability in the noise relationship.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.
Reference documents:
[1]Vaswani A,Shazeer N,Parmar N,et al.Attention is All You Need[C].In Proceedings of the 31st International Conference on Neural Information Processing Systems,2017:6000–6010.
[2]Dragomir Radev,Sasha Blair-Goldensohn,and Zhu Zhang.2001.Experiments in Single and Multi-Document Summarization Using MEAD.In First Document Understanding Conference.1-8
[3]Yihong Gong and Xin Liu.2001.Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis.In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.19–25
[4]Gunes Erkan and Dragomir Radev.2011.LexRank:Graph-based Lexical Centrality As Salience in Text Summarization.Journal of Artifcial Intelligence Research 22(Sept.2011),457–479
[5]Z.He,C.Chen,J.Bu,C.Wang,L.Zhang,D.Cai,and X.He.2012.Document summarization based on data reconstruction.In Twenty-sixth AAAI Conference on Artifcial Intelligence.620–626
[6]He Liu,Hongliang Yu,and Zhi-Hong Deng.2015.Multi-Document Summarization Based on Two-Level Sparse Representation Model.In Proceedings of the Twenty-Ninth AAAI Conference on Artifcial Intelligence.196–202
[7]Ruifang He and Xingyi Duan.2018.Twitter Summarization Based on Social Network and Sparse Reconstruction.In Proceedings of the Thirty-Second AAAI Conference on Artifcial Intelligence.5787–5794
[8]Hao Zheng and Mirella Lapata.2019.Sentence Centrality Revisited for Unsupervised Summarization.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.6236–6247
[9]Baobao Chang Kexiang Wang and Zhifang Sui.2020.A Spectral Method for Unsupervised Multi-Document Summarization.In Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing.435–445
[10]Huanyu Liu,Ruifang He,Liangliang Zhao,Haocheng Wang,and Ruifang Wang.2021.SCMGR:Using Social Context and Multi-Granularity Relations for Unsupervised Social Summarization.In Proceedings of the 30 th ACM International Conference on Information and Knowledge Management.1058-1068
[11]Ruifang He,Liangliang Zhao,and Huanyu Liu.2020.TWEETSUM:Event oriented Social Summarization Dataset.In Proceedings of the 28th International Conference on Computational Linguistics.5731–5736
[12]Jing Li,Wei Gao,Zhongyu Wei,Baolin Peng,and Kam-Fai Wong.2015.Using Content-level Structures for Summarizing Microblog Repost Trees.In Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing.2168–2178.
the present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. An unsupervised social media summarization method based on a denoised graph self-encoder is characterized by comprising the following steps of:
s1, constructing a post level social relationship network according to a sociology theory, defining a noise-free relationship in the post level social relationship network, namely a real social relationship network, and obtaining a content code of the post by utilizing a pre-trained BERT model as an initial content representation of the post;
s2, defining two noise relationship types of false relationship and potential relationship according to the social behaviors and habits of the user; adding instances of false relations and potential relations into the original post-level social relation network by setting a corresponding noise function, and constructing a post-level social relation network with a noise relation, namely a pseudo social relation network; sampling a plurality of generated pseudo social relationship networks, and simultaneously using the sampled pseudo social relationship network instances and the initial content representation of the posts as the input of a residual map attention network encoder, wherein the residual map attention network encoder comprises a multi-head attention mechanism, and the residual map attention network encoder encodes the posts according to the initial content representation and the social relationship of the posts to obtain the vector representation of the posts;
s3, constructing a decoder, wherein the decoder and the residual image attention network encoder jointly form a de-noising image self-encoder; the decoder reconstructs a real social relationship network according to the vector representation of the posts so as to capture the social relationship information among the posts, and simultaneously reconstructs the semantic relationship among the posts and the contained words so as to capture the text content information of the posts; meanwhile, as a real social relationship network without a noise relationship is reconstructed, the residual error map attention network encoder and the residual error map attention network decoder can learn and exclude the noise relationship in the post level social relationship network, and finally accurate post representation is obtained;
s4, according to the post representation obtained in the step S3, a summary extractor based on sparse reconstruction is adopted to select a final summary, the posts with the highest score are selected iteratively and added into a final summary set, and the process is repeated until the length limit of the summary is reached.
2. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S1 is as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.
3. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 2, wherein in step S1:
(101) the post-level social relationship network is formally described as follows: order to
Figure FDA0003598118470000011
Representing a collection of posts, N being the number of posts, where s i (1 ≦ i ≦ N) representing the ith post;
Figure FDA0003598118470000012
represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us order
Figure FDA0003598118470000013
Representing user u i Of a neighbor user, i.e. with user u i A set of users having direct social relationships;
Figure FDA0003598118470000014
representing all users u i A collection of published posts; building a post-level social relationship network according to the following rules
Figure FDA0003598118470000021
Wherein
Figure FDA0003598118470000022
Representing a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if a post is made
Figure FDA0003598118470000023
Wherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij Epsilon is; in expressing infectious social relationships: if the post is
Figure FDA0003598118470000024
And is provided with
Figure FDA0003598118470000025
Or
Figure FDA0003598118470000026
Then is post s i And s j BetweenEstablishing a side e ij E is epsilon; post level social relationship network is constructed according to the two rules
Figure FDA0003598118470000027
Wherein only the post node set is included
Figure FDA0003598118470000028
And the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship network
Figure FDA0003598118470000029
The corresponding adjacency matrix is marked as
Figure FDA00035981184700000210
Wherein A is ij > 0 denotes post node s i And s j Have social relationship connection between them, otherwise A ij =0;
(102) The content code of the post is obtained by utilizing the pre-training BERT model and is used as the initial content representation of the post, and the details are as follows:
for each post s i Inputting the post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):
x i =BERT(s i ) (1)
wherein x i Representing a post s i The initial content representation of all N posts is finally obtained as X ═ X 1 ,…,x N ]。
4. The unsupervised social media summarization method based on a denoised image self-encoder according to claim 1, wherein in step S2,
(201) the two noise relationships, spurious and latent, are defined as follows:
(a) false relationship: if two posts have a social relationship therebetween, but their relevance in content is less than a set threshold, defining the social relationship between the two posts as a false relationship;
(b) potential relationships are as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;
setting a noise function corresponding to the false relation as relation insertion, and setting a noise function of the potential relation as relation loss, wherein the specific steps are as follows:
(c) relationship insertion: randomly adding an edge to any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;
(d) loss of relationship: randomly removing edges between any two connected post nodes in the post level social relationship network;
a pseudo social relationship network is constructed as training data by adding instances of noisy relationships to a real social relationship network.
5. The unsupervised social media summarization method based on a denoising map self-encoder as claimed in claim 4, wherein in step S2, a residual map attention network encoder is used to encode the post according to the social relationship between the initial content representation of the post and the post, so as to integrate the text content information and the social relationship information of the post; the residual graph attention network encoder is considered to be an information propagation model that learns the representation of nodes in the post-level social relationship network by aggregating information of neighboring nodes connected to the nodes with edges, wherein the neighboring nodes refer to nodes connected with edges in the post-level social relationship network; the method comprises the following specific steps:
residual graph attention network encoder represents initial content of nodes
Figure FDA0003598118470000031
Adjacency matrix corresponding to post-level social relationship network
Figure FDA0003598118470000032
As input, where D is the dimension of the node feature representation and N is the number of posts; the propagation rules of the residual map attention network encoder are shown in equations (2) and (3):
Figure FDA0003598118470000033
Figure FDA0003598118470000034
wherein H (l) Is the hidden representation of the residual image attention network encoder in the l-th layer, A is the adjacency matrix corresponding to the post level social relationship network, A ij Representative post s i And s j I is an identity matrix,
Figure FDA0003598118470000035
is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,
Figure FDA0003598118470000036
indicating post s after increasing attention weight i And s j The relation weight between, σ (·) represents a nonlinear activation function;
Figure FDA0003598118470000037
is a post s i And post s j Attention scores at stratum i; w (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] General willThe general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representations to K different subspaces, K representing the total number of heads of the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:
Figure FDA0003598118470000038
Figure FDA0003598118470000039
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;
Figure FDA00035981184700000310
and
Figure FDA00035981184700000311
respectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is a dimension of the hidden representation in the attention calculation process; the superscript (l) indicating the number of layers is omitted here and a superscript head is used k To indicate the kth attention head;
Figure FDA00035981184700000312
and
Figure FDA00035981184700000313
is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and taking attention in K attention headsThe force weights are unified into a final attention score:
Figure FDA0003598118470000041
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the ordinary graph attention network is replaced by residual connection to form a residual graph attention network encoder, so that the residual graph attention network encoder can directly transmit input information to an output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
Figure FDA0003598118470000042
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; during the encoding process, the depth of the residual map attention network encoder
Figure FDA00035981184700000410
Determining the information propagation distance in the post level social relationship network, encoding the post according to the encoding rules of formula (7) and formula (8) by a residual map attention network encoder, and outputting the post at the last layer
Figure FDA00035981184700000411
I.e. a vector representation of the encoded post, wherein
Figure FDA0003598118470000044
Post s encoded by network encoder representing residual map attention i For a subsequent sparse reconstruction based summarization extraction process.
6. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S3 is as follows:
by setting a decoder based on two reconstruction targets; enabling a decoder to reconstruct a real social relationship network without a noise relationship to capture social relationship information among posts on one hand, and reconstructing text content contained in the posts on the other hand, thereby capturing the text content information of the posts and further enriching vector representation of the posts;
for reconstruction of a real social relationship network, the decoder predicts whether a social relationship exists between two post nodes according to vector representations of the two nodes, and particularly predicts a probability of the existence of the social relationship between the two nodes by using an inner product of the vector representations of the two nodes:
Figure FDA0003598118470000045
wherein (·) T Representing a transpose operation on a vector representation; for each pair of posts s i And s j The decoder predicts the probability of social relationship among them, where
Figure FDA0003598118470000046
Is an adjacency matrix corresponding to the post-level social relationship network output by the decoder,
Figure FDA0003598118470000047
post s representing decoder prediction i And s j The probability of a social relationship existing between them,
Figure FDA0003598118470000048
and
Figure FDA0003598118470000049
respectively representing posts s i And post s j The vector representation coded by the residual graph attention network coder, sigma (·) represents the nonlinear activation function;
for text content reconstruction, the relationship between reconstructed posts and words is proposed, and the text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:
Figure FDA0003598118470000051
wherein
Figure FDA0003598118470000052
And with
Figure FDA0003598118470000053
Is the learning parameter of the decoder, Z is the dimension represented by the vector of the post obtained by the encoder, V represents the vocabulary size;
Figure FDA0003598118470000054
is a prediction result of a decoder, wherein
Figure FDA0003598118470000055
Representing a post s i Containing the word w j The probability of (d);
corresponding loss functions are respectively designed aiming at the two reconstruction targets, the overall training target comprises two parts, the loss of the first part is the loss of reconstructing a real social relationship network and is recorded as L g Calculating the predicted result
Figure FDA0003598118470000056
Binary cross entropy loss between adjacency matrices a corresponding to real social relationship networks:
Figure FDA0003598118470000057
the second partial loss function is the loss of reconstructed post content, denoted L c Calculating the prediction result of the decoder
Figure FDA0003598118470000058
With true result s i Binary cross entropy loss between:
Figure FDA0003598118470000059
s ij for real training labels, posts s are represented i Whether or not to include the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0; finally, the two losses are combined using the balance parameter λ to obtain the final loss function L:
L=λL g +(1-λ)L c (13)
training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing a noise relationship after the training is finished 1 ,h 2 ,…,h N ]。
7. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 1, wherein step S4 is as follows:
accurate post representation after removing noise relation encoded by given residual graph attention network encoder H ═ H 1 ,h 2 ,…,h N ]The abstract extraction process is modeled as a sparse reconstruction process:
Figure FDA00035981184700000510
wherein | | · | | represents the Frobenius norm,
Figure FDA00035981184700000511
is a matrix of reconstruction coefficients, each element V of which i,j Representing a post s j For restructuring posts s i The degree of contribution of (c); to prevent extracting the content of repeated redundancy, a similarity matrix is introduced
Figure FDA00035981184700000512
To remove redundant information, wherein if the post s i And post s j If the cosine similarity is higher than the specific threshold eta, then order
Figure FDA00035981184700000513
Otherwise
Figure FDA00035981184700000514
Figure FDA00035981184700000515
Representing a Hadamard product operation; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind 2,1 Represents the norm L21, defined as follows:
Figure FDA0003598118470000061
adding the L21 constraint to the reconstruction coefficient matrix V can make each row of the reconstruction coefficient matrix have a sparse characteristic, that is, most elements of each row in the reconstruction coefficient matrix are 0, which means that each post can only be reconstructed by a limited number of posts to limit the length of the summary; the final score for each post is defined as the sum of the post's contributions to the reconstruction of all other posts:
Figure FDA0003598118470000062
wherein score(s) i ) Representing a post s i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final abstract set, and the process is repeated until the length limit of the abstract is reached.
CN202210393787.7A 2022-04-15 2022-04-15 Unsupervised social media summarization method based on de-noised image self-encoder Pending CN115017299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210393787.7A CN115017299A (en) 2022-04-15 2022-04-15 Unsupervised social media summarization method based on de-noised image self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210393787.7A CN115017299A (en) 2022-04-15 2022-04-15 Unsupervised social media summarization method based on de-noised image self-encoder

Publications (1)

Publication Number Publication Date
CN115017299A true CN115017299A (en) 2022-09-06

Family

ID=83066492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210393787.7A Pending CN115017299A (en) 2022-04-15 2022-04-15 Unsupervised social media summarization method based on de-noised image self-encoder

Country Status (1)

Country Link
CN (1) CN115017299A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545349A (en) * 2022-11-24 2022-12-30 天津师范大学 Time sequence social media popularity prediction method and device based on attribute sensitive interaction
CN115934933A (en) * 2023-03-09 2023-04-07 合肥工业大学 Text abstract generation method and system based on double-end comparison learning
CN117131187A (en) * 2023-10-26 2023-11-28 中国科学技术大学 Dialogue abstracting method based on noise binding diffusion model
US20240004907A1 (en) * 2022-06-30 2024-01-04 International Business Machines Corporation Knowledge graph question answering with neural machine translation
CN117372631A (en) * 2023-12-07 2024-01-09 之江实验室 Training method and application method of multi-view image generation model
US12013884B2 (en) * 2022-06-30 2024-06-18 International Business Machines Corporation Knowledge graph question answering with neural machine translation

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240004907A1 (en) * 2022-06-30 2024-01-04 International Business Machines Corporation Knowledge graph question answering with neural machine translation
US12013884B2 (en) * 2022-06-30 2024-06-18 International Business Machines Corporation Knowledge graph question answering with neural machine translation
CN115545349A (en) * 2022-11-24 2022-12-30 天津师范大学 Time sequence social media popularity prediction method and device based on attribute sensitive interaction
CN115545349B (en) * 2022-11-24 2023-04-07 天津师范大学 Time sequence social media popularity prediction method and device based on attribute sensitive interaction
CN115934933A (en) * 2023-03-09 2023-04-07 合肥工业大学 Text abstract generation method and system based on double-end comparison learning
CN115934933B (en) * 2023-03-09 2023-07-04 合肥工业大学 Text abstract generation method and system based on double-end contrast learning
CN117131187A (en) * 2023-10-26 2023-11-28 中国科学技术大学 Dialogue abstracting method based on noise binding diffusion model
CN117131187B (en) * 2023-10-26 2024-02-09 中国科学技术大学 Dialogue abstracting method based on noise binding diffusion model
CN117372631A (en) * 2023-12-07 2024-01-09 之江实验室 Training method and application method of multi-view image generation model
CN117372631B (en) * 2023-12-07 2024-03-08 之江实验室 Training method and application method of multi-view image generation model

Similar Documents

Publication Publication Date Title
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
CN107766324B (en) Text consistency analysis method based on deep neural network
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN110134946B (en) Machine reading understanding method for complex data
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN111651198B (en) Automatic code abstract generation method and device
CN112395393B (en) Remote supervision relation extraction method based on multitask and multiple examples
CN109840324B (en) Semantic enhancement topic model construction method and topic evolution analysis method
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN114218389A (en) Long text classification method in chemical preparation field based on graph neural network
CN115329088B (en) Robustness analysis method of graph neural network event detection model
CN113378573A (en) Content big data oriented small sample relation extraction method and device
CN113988075A (en) Network security field text data entity relation extraction method based on multi-task learning
CN116127099A (en) Combined text enhanced table entity and type annotation method based on graph rolling network
CN113158659B (en) Case-related property calculation method based on judicial text
CN114742069A (en) Code similarity detection method and device
Sandhan et al. Evaluating neural word embeddings for Sanskrit
CN114218921A (en) Problem semantic matching method for optimizing BERT
CN113901813A (en) Event extraction method based on topic features and implicit sentence structure
CN115629800A (en) Code abstract generation method and system based on multiple modes
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
CN115617981A (en) Information level abstract extraction method for short text of social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination