CN115017299A

CN115017299A - Unsupervised social media summarization method based on de-noised image self-encoder

Info

Publication number: CN115017299A
Application number: CN202210393787.7A
Authority: CN
Inventors: 贺瑞芳; 刘焕宇; 王浩成
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-09-06

Abstract

The invention discloses an unsupervised social media summarization method based on a denoising image self-encoder, which comprises the steps of constructing a post level social relation network according to a sociology theory, and obtaining content codes of posts by utilizing a pre-training BERT model to be used as initial content representation of the posts; defining two noise relationship types and setting corresponding noise functions to construct a pseudo social relationship network with noise relationships; simultaneously representing the sampled pseudo social relationship network instance and the initial content of the post as the input of a residual map attention network encoder, and encoding the post by the residual map attention network encoder according to the initial content representation and the social relationship of the post to obtain the vector representation of the post; constructing a decoder, wherein the residual image attention network encoder and the decoder jointly form a de-noising image self-encoder structure, and the de-noising image self-encoder can learn to remove the noise relationship in the post level social relationship network, so as to finally obtain accurate post representation; and selecting a final abstract by adopting an abstract extractor based on sparse reconstruction.

Description

Unsupervised social media summarization method based on de-noising graph self-encoder

Technical Field

The invention relates to the technical field of natural language processing and social media data mining, in particular to an unsupervised social media summarization method based on a de-noising image self-encoder.

Background

With the development and popularization of internet technology, the social media platform gradually becomes a novel information production and transmission medium, and gradually occupies an increasingly important position in various aspects of social production, life and the like. However, the amount of content on social media is increased sharply, which causes a serious information overload problem, and provides a more serious challenge for efficient retrieval of information, and it is often difficult for a common user to search and obtain effective and interesting information from massive noisy information, which seriously reduces the efficiency of information retrieval.

The technology can effectively relieve the problem of information overload on social media and help improve the retrieval efficiency of effective information of users. The current main summarization method can be generally divided into two modes: abstract and generate abstract. Wherein, the extraction type abstract is mainly to select the most representative text unit words, sentences or segments with large information amount, low redundancy and wide coverage from the input original text to form the final abstract; the generation type abstract method relates to a text generation process, and generates corresponding abstract description by understanding the semantics of an original input text and utilizing a text generation technology. In recent years, the techniques of abstraction and generative automatic text summarization have been significantly advanced due to the development of numerous new technologies such as sequence-to-sequence framework (Seq2Seq), Transformer model, contrast learning, and large-scale pre-training model.

However, the existing method usually needs to rely on large-scale labeled paired training data (i.e. text-abstract pairs), and currently, the acquisition of the labeled training data usually needs to label the data manually, so that the construction cost is huge, and the method cannot be used in large-scale training scenes. In the social media field, the construction of annotation data is more difficult: on one hand, when a annotator marks the abstract of the content of a certain specific topic, all posts related to the topic need to be read, and then corresponding abstracts are written for the posts, and the number of posts on social media is too large, so that the unassailable labor cost is generated by artificial reading; on the other hand, since the content on the social media has high real-time performance and topic sensitivity, the result of tagging under a certain topic cannot be applied to other topic fields, and therefore, data tagging work needs to be performed under each topic, which consumes a lot of manpower and material resources. Further, when the traditional text summarization method is migrated to the social media data, because the text features on the social media are greatly different from the traditional long documents, such as the features of shorter text length, diversified expressions, informality, and the like, the traditional summarization method is generally difficult to obtain satisfactory results.

The existing social media abstract research mainly carries out feature extraction on each post independently based on the content of the post, and then extracts the posts with higher importance as an abstract by adopting algorithms such as graph sorting or clustering and the like. These methods have certain disadvantages: (1) because the length of the post on the social media is usually short, the content of a single post often contains incomplete or ambiguous information and cannot provide sufficient information, so that the characteristics of the post have the problems of sparseness and inaccuracy; (2) the social media rely on users to actively transmit and receive information through social contact, the process effectively promotes the transmission of the information, therefore, posts on the social media are embedded in a social network structure and are not independent of each other, and the previous methods only focus on text content features of the posts and ignore social structure features of the posts, so that social relationship information of the posts is lost.

Some work has attempted to facilitate analysis of content on social media by leveraging some simple social signals provided on social media, such as the number of fans of the author, the number of forwards posts, the number of praises, and so on; further work verifies the influence of social relationships on the content relevance in social networks from the perspective of social theory, and proposes that posts with social relationships tend to contain similar contents and viewpoints in a short time, the social theory indicates the association between the social relationships and text contents macroscopically, but on the microscopic level, there are often noise relationships that do not conform to the social theory, which is divided into two cases: (1) two posts have a social relationship, but have a low relevance in content, and such a noise relationship is defined as a false relationship; (2) the two posts have no directly connected social relationship but have high relevance in content, and the noise relationship is defined as a potential relationship; the existence of these two relationships presents further challenges to the efficient utilization of social relationships.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide an unsupervised extraction type social media summarization method which is more robust to noise relation.

The purpose of the invention is realized by the following technical scheme:

an unsupervised social media summarization method based on a denoising map self-encoder comprises the following steps:

s1, constructing a post level social relationship network according to a sociology theory, defining a noise-free relationship in the post level social relationship network, namely a real social relationship network, and obtaining a content code of the post by utilizing a pre-trained BERT model as an initial content representation of the post;

s2, defining two noise relationship types of false relationship and potential relationship according to the social behaviors and habits of the user; adding instances of false relations and potential relations into the original post-level social relation network by setting a corresponding noise function, and constructing a post-level social relation network with a noise relation, namely a pseudo social relation network; sampling a plurality of generated pseudo social relationship networks, and simultaneously using the sampled pseudo social relationship network instances and the initial content representation of the posts as the input of a residual map attention network encoder, wherein the residual map attention network encoder comprises a multi-head attention mechanism, and the residual map attention network encoder encodes the posts according to the initial content representation and the social relationship of the posts to obtain the vector representation of the posts;

s3, constructing a decoder, wherein the decoder and the residual image attention network encoder jointly form a de-noising image self-encoder; the decoder reconstructs a real social relationship network according to the vector representation of the posts so as to capture the social relationship information among the posts, and simultaneously reconstructs the semantic relationship among the posts and the contained words so as to capture the text content information of the posts; meanwhile, as a real social relationship network without a noise relationship is reconstructed, the residual error map attention network encoder and the residual error map attention network decoder can learn and exclude the noise relationship in the post level social relationship network, and finally accurate post representation is obtained;

s4, selecting a final abstract by adopting an abstract extractor based on sparse reconstruction according to the post representation obtained in the step S3, iteratively selecting posts with the highest reconstruction coefficient, adding the posts into a final abstract set, and repeating the process until the length limit of the abstract is reached.

Further, step S1 is specifically as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.

(101) Formally describing the post-level social relationship network as follows: order to

Representing a collection of posts, N being the number of posts, where s _i (1 ≦ i ≦ N) representing the ith post;

represents a set of users, containing a total of M users, where u _i (1 ≦ i ≦ M) for the ith user; for user u _i Let us order

Representing user u _i Of neighbor users, i.e. with user u _i A set of users having direct social relationships;

representing all users u _i A collection of published posts; building a post-level social relationship network according to the following rules

Wherein

Representing a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if the post is

Wherein u is _k Indicating the kth user, then post s _i And post s _j Between them establishes a side e _ij E is epsilon; in expressing infectious social relationships: if the post is

And is provided with

Or

Then is post s _i And s _j An edge e is established between _ij E is epsilon; post level social relationship network is constructed according to the two rules

Wherein only the post node set is included

And the relationship between nodes, i.e., the set of edges ε ═ e ₁₁ ,e ₁₂ ,…,e _NN }; constructed post-level social relationship network

The corresponding adjacency matrix is marked as

Wherein A is _ij > 0 denotes post node s _i And s _j Have social relationship connection between them, otherwise A _ij ＝0；

(102) The content code of the post is obtained by utilizing the pre-training BERT model and is used as the initial content representation of the post, and the details are as follows:

for each post s _i Inputting a post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):

x _i ＝BERT(s _i ) (1)

wherein x _i Representing a post s _i The initial content representation of all N posts is finally obtained, where X is the initial content representation of all N posts ₁ ,…,x _N ]。

Further, step S2 specifically includes:

(201) the two noise relationships, spurious and latent, are defined as follows:

in social networks, posts connected by social relationships often have more similar content or opinions between them. However, in actual production life, most of social networks on social media belong to pseudo social relationship networks, noise relationships are included in the pseudo social relationship networks, and according to observation of real social media data, two types of noise relationships are defined:

(a) false relationship: if two posts have a social relationship therebetween, but their relevance in content is less than a set threshold, defining the social relationship between the two posts as a false relationship;

(b) the potential relationship is as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;

setting a noise function corresponding to the false relationship as relationship insertion, and setting a noise function corresponding to the potential relationship as relationship loss, specifically as follows:

(c) relationship insertion: randomly adding an edge for any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;

(d) loss of relationship: randomly removing edges between any two connected post nodes in the post level social relationship network;

a pseudo social relationship network is constructed as training data by adding instances of noisy relationships to a real social relationship network.

(202) Encoding the post using a residual map attention network encoder;

after the initial content representation of the pseudo social relationship network and the posts is obtained, in order to model the social relationship between the posts, a residual graph attention network encoder is adopted to encode the posts according to the social relationship between the initial content representation of the posts and the posts so as to integrate the social relationship information and the text content information of the posts; the residual graph attention network encoder may be viewed as an information propagation model that learns representations of nodes in a post-level social relationship network by aggregating information of neighboring nodes that have edges connected in the post-level social relationship network. Meanwhile, compared with a traditional Graph Convolutional network encoder (GCN), the residual Graph attention network encoder can endow different weights to different neighbor nodes of the same node, so that the attention weight of an important neighbor node is improved, the weight of a neighbor node with low correlation is reduced, and more accurate node representation is learned;

formally, the residual graph attention network encoder represents the initial content of the node

Adjacency matrix corresponding to post-level social relationship network

As input, where D is the dimension of the node feature representation and N is the number of nodes. The propagation rule formula (2) and formula (3) of the residual map attention network encoder are shown as follows:

wherein H ^(l) Is a hidden representation of a residual map attention network encoder at the l-th layer, A is an adjacency matrix corresponding to a post-level social relationship network, A _ij Representative post s _i And s _j I is an identity matrix,

is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,

indicating post s after increasing attention weight _i And s _j The weight of the relationship between, σ (-) represents a non-lineA sexual activation function;

is a post s _i And post s _j Attention scores at the l-th level in between; w is a group of ^(l) And b ^(l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X ₁ ,…,x _N ]As input to the residual graph attention network encoder, let H ⁽⁰⁾ X; wherein the calculation of the attention weight employs scaling the dot product attention ^[1] The general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representation to K different subspaces, K representing the total number of attention heads in the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:

wherein h is _i And h _j Respectively representing posts s _i And post s _j Vector representation encoded by a residual map attention network encoder;

and

respectively representing posts s in the kth attention head _i And s _j Attention scores therebetween and normalized attention weights; (.) ^T Representing a transpose operation; d _h Is the dimension of the implicit representation in the attention calculation process, where the superscript (l) representing the number of layers is omitted and a superscript head is used _k To indicate the kth attention head;

and

is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); k represents the total number of attention heads, K in total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:

α _ij representing a post s _i And s _j Final attention weight in between; the connection between each layer in the common graph attention network is replaced by residual connection to form a residual graph attention network, so that the residual graph attention network can directly transmit the input information to the output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:

where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:

f(H ^(l) )＝σ(W _f H ^(l) +b _f ) (8)

wherein W _f And b _f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; during the encoding process, the depth of the residual map attention network encoder

Determining information transfer in post level social relationship networkThe distance of broadcast, residual map attention network encoder encodes the post according to the encoding rules of formula (7) and formula (8), and the output of the last layer thereof

I.e. a vector representation of the encoded post, wherein

Post s encoded by network encoder representing residual map attention _i A vector representation of (a);

further, step S3 is specifically as follows:

(301) reconstructing the true social relationship network and the content of the post using a reconstruction-based decoder.

In order to make the learned vector representation of posts contain both textual content information and social relationship information between posts, a decoder based on two reconstruction objectives was devised. The decoder reconstructs a real social relationship network without noise relationship to capture the social relationship information among the posts on one hand, and reconstructs the text content contained in the posts on the other hand, thereby capturing the text content information of the posts and enriching the vector representation of the posts.

For reconstruction of a real social relationship network, the decoder predicts whether a social relationship exists between two post nodes according to vector representations of the two nodes, and particularly predicts a probability of the existence of the social relationship between the two nodes by using an inner product of the vector representations of the two nodes:

wherein (·) ^T Representing a transpose operation on a vector representation; for each pair of posts s _i And s _j The decoder predicts the probability of social relationship among them, where

Post level community being decoder outputThe adjacency matrix corresponding to the cross-relationship network,

post s representing decoder prediction _i And s _j The probability of a social relationship existing there between,

and

respectively representing posts s _i And post s _j A vector representation encoded by a residual map attention network encoder, σ (-) representing a nonlinear activation function;

for text content reconstruction, the relationship between reconstructed posts and words is proposed, and text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:

wherein

And

is a learning parameter of the decoder, V represents the vocabulary size;

is a prediction result of a decoder, wherein

Representing a post s _i Containing the word w _j The probability of (d);

designing corresponding loss functions aiming at the two reconstruction targets respectively, wherein the overall training target comprises two parts, the first partA part of the loss is the loss of reconstructing the real social relationship network, denoted as L _g Calculating the predicted result

Binary cross-entropy loss between adjacency matrices a corresponding to real post-level social relationship networks:

the second partial loss function is the loss of reconstructed post content, denoted L _c Calculating the prediction result of the decoder

With true result s _i Binary cross entropy loss between:

s _ij for real training labels, posts s are represented _i Whether or not to contain the word w _j If post s _i Containing the word w _j Then s _ij 1, otherwise s _ij 0. Finally, the two partial losses are combined using an equilibrium parameter λ to obtain a final loss function L:

L＝λL _g +(1-λ)L _c (13)

training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing noise relationship after training is finished ₁ ,h ₃ ,…,h _N ]；

Further, step S4 specifically includes:

(401) and performing abstract extraction according to the vector representation of the post by adopting an abstract extractor based on sparse reconstruction.

In order to extract representative important postsAnd taking the son as a final abstract, and extracting the abstract by adopting an abstract extractor based on sparse reconstruction. Formally, the accurate post representation after removing the noise relationship encoded by the residual graph attention network encoder is given as H ═ H ₁ ,h ₂ ,…,h _N ]The abstract extraction process is modeled as a sparse reconstruction process:

wherein | | · | | represents a Frobenius norm,

is a matrix of reconstruction coefficients, each element V of which _i,j Representing a post s _j For restructuring posts s _i The degree of contribution of (c); to prevent extracting the content of repeated redundancy, a similarity matrix is introduced

To remove redundant information, wherein if the post s _i And post s _j If the cosine similarity is higher than the specific threshold eta, then order

Otherwise

-hadamard product manipulation is expressed; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind _2,1 Represents the norm L21, defined as follows:

adding the L21 constraint to the reconstruction coefficient matrix V can make each row of the reconstruction coefficient matrix have a sparse characteristic, that is, most elements of each row in the reconstruction coefficient matrix are 0, which means that each post can only be reconstructed by a limited number of posts to limit the length of the summary; the final score for each post is defined as the sum of the post's contributions to the reconstruction of all other posts:

wherein score(s) _i ) Representing a post s _i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final summary set, and the process is repeated until the length limit of the summary is reached.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. the invention can extract the abstract under the condition of no labeled data. By introducing the social relationship information among posts, capturing the social relationship characteristics among the posts to solve the problem of sparse characteristics caused by short content of a single post;

2. the invention provides a denoised image self-encoder structure which can automatically identify and remove unreliable noise relations in a social relation network under the condition of no labeled data, and alleviate errors caused by the noise relations, so that the reliability and the accuracy of post representation are improved. After learning the accurate representation of the posts, identifying the importance and the redundancy degree of each post by adopting a sparse reconstruction technology frame, and finally extracting the posts with higher importance and low redundancy to form a final abstract.

3. Compared with the existing abstract model, the social media text content abstract obtained by the invention improves the performance on ROUGE evaluation indexes, and simultaneously, according to the experimental result, the denoising model can effectively reduce the ratio of noise relationship in a social network, improve the network structure and improve the accuracy of the abstract.

4. Compared with the traditional graph convolution network encoder, the residual graph attention network encoder used in the invention can endow different weights to different neighbor nodes of the same node, thereby improving the attention weight of important neighbor nodes, reducing the weight of irrelevant neighbor nodes and learning more accurate post representation; since the post vector obtained by the residual graph attention network encoder simultaneously contains text content information and social relationship information, the extraction process of the abstract can identify the importance and novelty of the post from the two aspects of content and social relationship, so as to generate the abstract with higher quality.

5. Since different attention heads capture relationship information between nodes from different spaces, and the relationships of two nodes in different attention heads may have large differences, the present invention employs a max-pooling operation to automatically select the strongest relationship in all subspaces as the true relationship between two nodes, and unifies the attention weights in the K attention heads into a final attention score.

6. The common graph attention network often has the problem of being too smooth, and the invention further replaces the connection between the layers in the common graph attention network with residual connection to form the residual graph attention network, so that the residual graph attention network can directly transmit input information to an output layer.

7. The invention designs a decoder based on two reconstruction targets, so that the learned vector representation of the posts contains both the content information of texts and the social relationship information among the posts. Because the feature representation of the post simultaneously contains the content information and the social relationship information of the post, the abstract process can identify the importance and novelty of the post from two aspects of text content and social structure, thereby generating the abstract content with large information amount, high diversity and wide coverage.

Drawings

Fig. 1 is a schematic diagram of the overall architecture of an unsupervised social media summarization method based on a denoised image self-encoder provided by the invention.

FIG. 2 shows the performance results achieved by the present invention in a social network under each topic in two data sets.

FIG. 3 shows the distribution of false relationships, potential relationships, and the sum of both relationships in a social network.

Fig. 4a and 4b show different denoising methods in the denoising function and the influence of different noise ratios on the result.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an unsupervised social media summarization method based on a denoising map self-encoder, which adopts two main social media data sets to evaluate the performance of the method. The overall framework of the method is shown in figure 1. In fig. 1, a post-level social relationship network and a post set are input of models, a noise function adds a noise relationship instance to the post-level social relationship network to obtain a pseudo-social relationship network with a noise relationship, the post set is encoded by a BERT model to obtain an initial content representation of a post, the initial content representation of the post and the pseudo-social relationship network with the noise are input to a residual error map attention network encoder together for encoding, and finally a post representation fusing a social relationship and text content is output. And training the whole text model according to the real social relationship network reconstruction and the text content reconstruction, and finally obtaining accurate post representation with the noise relationship removed after training is finished. And inputting the final learned accurate post representation into a summary extraction based on sparse reconstruction, and extracting a final summary.

(1) Post-level social relationship network construction

In the embodiment, mainstream social media network platform data at home and abroad are respectively selected for experimental verification, and Twitter (Twitter) is respectively selected ^[11] And the Sina microblog ^[12] The two social media platforms carry out experimental verification, and the posting time of posts under each topic in the data is guaranteed to be within a range of 5 days. For twitter data, the main language of text content in twitter is english, which contains 44,034 posts and 11240 users, wherein each user issues at least one post and each user has at least one social relationship. With 4 standard reference excerpts under each topicTo be used for evaluating the results. In the experiment, links, user names and other special characters in posts are removed, stem extraction and stop word removal are carried out, and posts with the length being less than 3 words are filtered. For microblog data, the Sina microblog is one of the most popular social media platforms in China, so in the embodiment, data collected from the Sina microblog is used, and the data comprises 130k posts and 126k users, which comprise 10 different topics in total, wherein the posts are organized in a tree structure according to interaction relations (such as replying and forwarding), and 3 standard reference abstracts are arranged under each topic for evaluating results. The data statistics details of the two data sets after preprocessing are shown in table 1. In the experiment, the ROUGE evaluation standard is adopted, and four evaluation results of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU are mainly reported.

Formally, make

Representing a collection of posts, N being the number of posts, where s _i (1 is more than or equal to i and less than or equal to N) represents the ith post;

representing all users u _i A set of published posts; building a post-level social relationship network according to the following rules

Wherein

Representing a node set, wherein each node represents a post, epsilon represents an edge set between nodes, and each edge represents a social relationship between posts; expressing consistent social relationships: if a post is made

And is provided with

Or

In which only node sets are included

The corresponding adjacency matrix is marked as

Wherein A is _ij > 0 denotes post node s _i And s _j Have a social relationship with each other, otherwise A _ij ＝0；

Table 1 social media data set details

(2) Noise distribution observation

In order to analyze the distribution of noise relationships in the constructed post-level social relationship network, the embodiment provides a method for simply estimating the distribution of noise relationships in the network. Generally speaking, two posts with social relationship are considered to have a false relationship if the relevance of the posts in the content is lower than a set threshold value theta; two posts without social relationship are considered to have a potential relationship with each other if the correlation between the posts in the content is higher than a set threshold theta, and the cosine similarity between the TFIDF representations of the posts is taken as the correlation between the posts in the content. Order to

An adjacency matrix representing the constructed post-level social relationship network, wherein A _ij > 0 denotes post node s _i And post s _j Have social relationship connection between them, otherwise A _ij 0. For each node pair(s) in the social relationship network _i ,s _j ) If they have a social relationship to each other (i.e. A) _ij > 0) and the correlation phi on the content between them _ij Below the threshold θ, the post s is considered _i And s _j Have a false relationship between them; if post s _i And s _j There is no social relationship connection between (i.e. A) _ij 0) and correlation Φ in content between them _ij Above the threshold θ, they are considered to have a potential relationship. In an experiment, TFIDF of posts is calculated, cosine similarity represented by TFIDF between two posts is calculated as correlation on contents between the posts, the higher the cosine similarity represented by TFIDF of two posts is, the higher the correlation on the contents of the two posts is, and conversely, the lower the cosine similarity represented by TFIDF of two posts is, the lower the correlation on the contents of the two posts is. The statistics may preliminarily reflect the score of the noise relationship in the post-level social relationship networkThe situation of cloth. Since the strengths of social relationships in different social relationship networks may be very different, the average value of all social relationships is used as the value of the threshold θ in the embodiment:

wherein phi _ij Is a post s _i And post s _j In relation to the content. The final statistics for both twitter and microblog data sets are shown in table 2. The results show the average percentage of noise relationships in the post-level social relationship network under all topics, with the average noise percentage in the table including false relationships and potential relationships.

TABLE 2 twitter and microblog data set noise relationship distribution statistics in social relationship networks

Data set	Ratio of false relations	Ratio of potential relationships	Average noise ratio
				Twitter data set	38.61％	55.79％	55.37％
Microblog data set	83.17％	52.66％	52.67％

(3) De-noised image self-encoder

Firstly, a pre-trained BERT model is used for extracting the characteristics of the post text content, and the process is as follows:

x _i ＝BERT(s _i ) (1)

wherein s is _i Represents the ith post, x _i Namely, the initial content representation of the post is obtained, all N posts are coded and represented, and finally, a matrix is obtained

Where D is the dimension of each post feature vector. Post-level social relationship network for input

Social relationship network to post level using noise function

Adding noise relationship example to construct pseudo social relationship network

Real social relationship network

With corresponding pseudo social relationship network

Forming paired training data

After obtaining the initial content representation of the post and constructing the pseudo-social relationship network, associating the initial content representation of the post X with the pseudo-social relationship network

Together as an input to the residual map attention network encoder, formally, the residual map attention network encoder performs information dissemination according to the following rules:

indicating post s after increasing attention weight _i And s _j The relation weight between, σ (-) represents the nonlinear activation function;

is a post s _i And post s _j Attention scores at the l-th level in between; w ^(l) And b ^(l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X ₁ ,…,x _N ]As input to the residual graph attention network encoder, let H ⁽⁰⁾ X; wherein the calculation of the attention weight employs scaling the dot product attention ^[1] Expanding the common attention mechanism to a multi-head attention mechanism, by mapping the potential representation to K different subspaces, each subspace being called an attention head, and in each subspaceThe attention weights are calculated separately:

and

and

is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); in the embodiment, K represents the total number of attention heads, and K is total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:

f(H ^(l) )＝σ(W _f H ^(l) +b _f ) (8)

wherein W _f And b _f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; depth of residual graph attention network encoder during encoding

Determining the information propagation distance in the post level social relationship network, encoding the post according to the encoding rules of formula (7) and formula (8) by a residual map attention network encoder, and outputting the post at the last layer

I.e. a vector representation of the encoded post, wherein

after the vector representation of the post is obtained by the residual image attention network encoder, decoding the vector representation of the post by a decoder, and reconstructing a real social relationship network which does not contain a noise relationship, so as to learn, identify and remove the noise relationship in the pseudo social relationship network; in the training process, the model is trained according to the following loss function:

wherein, the first and the second end of the pipe are connected with each other,

post s representing decoder prediction _i And s _j Probability of social relationship between them, A _ij As posts s _i And post s _j True social relationship situation between them;

is a prediction result of a decoder, wherein

Representing a post s _i Containing the word w _j The probability of (d); s _ij For real training labels, posts s are represented _i Whether or not to contain the word w _j If post s _i Containing the word w _j Then s _ij 1, otherwise s _ij 0. Wherein L is _g Is the loss of reconstructing the original network structure, L _c Reconstructing the loss of the original post content, and finally combining the two losses by using a balance parameter lambda to obtain the final loss L:

L＝λL _g +(1-λ)L _c (11)

training the whole model of the text according to the loss function, taking the initial content representation of the real social relationship network and the posts as input in a testing stage after the training is finished, and coding the input through a residual image attention network coder to obtain an accurate post vector table which integrates social relationship information and text content information and removes noise relationshipIs given by H ═ H ₁ ,h ₂ ,…,h _N ]；

(4) Sparse reconstruction based abstraction extraction

After obtaining accurate post vector representation H ═ H ₁ ,h ₂ ,…,h _N ]And then, identifying the importance of the post by adopting a frame based on sparse reconstruction, wherein the reconstruction process is modeled as follows:

wherein the symbols are as described above. The final importance score for each post is calculated as follows:

and then sorting the posts according to the importance scores of the posts, iteratively extracting the posts with the highest scores and adding the posts into the candidate abstract set until the length limit of the abstract is reached.

In the specific implementation process, various hyper-parameters are set in advance, the representing dimension D of the post is set to 768, and the probability distribution of two kinds of noise is often different in different social networks, so that the probability of relation insertion and relation loss in a noise function is set to be 0.4 and 0.1 in twitter data; for microblog data, the probability of both noise functions is set to be 0.3. The balance parameter λ of the two part losses in the final loss function is set to 0.8. The super parameter β ═ γ ═ 1 in the digest selection stage, and the redundancy threshold θ is set to 0.1.

To verify the effectiveness of the method of the invention, the method of the invention (DSNSum) was compared with the two types of methods, respectively. The first method only uses text content in social media to extract the abstract, and specifically comprises the following steps:

Centroid ^[2] the centrality-based features are used to identify sentences that are highly similar to the center of the cluster as a summary.

LSA ^[3] Decomposing the feature matrix by using SVD technique, andand identifying the importance of the post according to the size of the singular value after matrix decomposition.

Lexrank ^[4] The method is a graph sorting algorithm similar to the PageRank, firstly, a similarity network is built according to the similarity of contents among posts, then, the graph sorting algorithm similar to the PageRank is adopted in the similarity network to identify the importance of each post node, and the posts with higher importance are extracted as abstracts.

DSDR ^[5] Considering the summarization process as a reconstruction task, the most representative posts are extracted as summaries by minimizing the reconstruction loss.

MDS-Sparse ^[6] The multi-document abstract is extracted by adopting a sparse coding-based technology, and the loss of the reconstructed original document is reduced as much as possible under the sparse constraint, so that the simplicity and the importance of the abstract are ensured.

PacSum ^[8] Is a graph-based abstract method, which uses BERT to extract the features of sentences and models documents as a directed graph structure, while taking into account the relative position information between sentences.

Spectral ^[9] A spectrum-based hypothesis is provided, a concept of spectrum importance is defined, and sentences with higher importance are extracted as abstracts according to the spectrum importance of the sentences.

The second method not only uses the text content characteristics of posts, but also introduces the social relationship information among posts, and comprises the following steps:

SNSR ^[7] based on the social theory, the social relation among posts is modeled into a regular term and is introduced into a sparse reconstructed frame, so that the social relation is used for additionally guiding the abstract extraction process.

SCMGR ^[10] And encoding post representation fusing text content and a social structure on a social relationship network among posts by using a graph volume network, and inputting the learned fused representation into a sparse reconstruction frame for extracting important posts.

The evaluation indexes of the experimental performance adopt a ROUGE evaluation standard, and specifically comprise four indexes of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU. The overlap ratio of the N-element words between the output abstract and the standard abstract is measured by the ROUGE-N, and the evaluation is carried out by adopting ROUGE-1 and ROUGE-2 standards in the experiment; ROUGE-L measures the longest common subsequence between the output summary and the standard summary; the route-SU measures the degree of match between the output summary and the standard summary of 1-gram words and 2-gram phrases, while allowing discontinuities between words. In the subsequent experiments, the four indexes are respectively marked as R-1, R-2, R-L and R-SU.

Table 3 shows the experimental results of the model and all comparison methods on both data sets. Higher value of the ROUGE score indicates better performance of the model. Tables 4 and 5 show the model in twitter, respectively ^[11] And a degradation experimental result on the microblog data, wherein DSNSum is an experimental result of the complete model, and w/o differentiating represents the performance after the denoising module is removed; w/o GAT then represents the performance after the residual map attention encoder is removed.

TABLE 3 Performance of the method of the present invention and other methods on twitter and microblog data sets

TABLE 4 results of degradation experiments on the twitter data by the method of the present invention

Twitter data	R-1	R-2	R-L	R-SU*
					DSNSum	46.51	14.29	44.16	20.76
w/o denoising	45.02	13.33	42.72	19.83
					w/o GAT	44.14	12.65	41.68	19.10

TABLE 5 degradation experiment results of the method of the present invention on microblog data

Microblog data	R-1	R-2	R-L	R-SU*
					DSNSum	37.01	10.98	14.22	13.06
w/o denoising	35.31	9.76	13.43	12.07
					w/o GAT	34.36	8.93	13.29	11.12

As can be seen from the results in Table 3, the method of the present invention achieves the highest performance on the twitter data, which exceeds all other comparison methods; on microblog data, the microblog data are slightly lower than an SCGR model under an R-L standard and exceed all other comparison models on other standards, and experimental results prove the effectiveness of the method disclosed by the invention. In the degradation experiments, as can be seen from tables 4 and 5, the removal of any one module results in the performance reduction, and it is proved that each module has a certain promoting effect on the whole model. After the denoising module is removed, the performance of the model is reduced, which proves that extra noise information can be introduced into the summarization process by the noise relationship, so that the summarization result is damaged, and the denoising module reduces the influence of the noise relationship on the summary by identifying and removing the noise relationship in the network, so that the quality of the summary is improved. In addition, the performance of the model is greatly reduced after the graph attention network is removed, and the phenomenon shows that the analysis of the content in the social media environment can be effectively promoted by considering the social relationship information in the post level social relationship network. On the one hand, the graph attention network can alleviate the problem of insufficient content of a single post by aggregating related background information from adjacent neighbor nodes in the post-level social relationship network, and on the other hand, the topological feature of the post-level social relationship network can provide an additional clue for the importance identification of the post from a sociological perspective.

In order to further analyze whether the denoised image self-encoder module provided by the method has the relation of removing noise and improve the function of a network structure, additional experimental verification is carried out. With the post representation kept unchanged (post representation encoded using the same pre-trained BERT model), the ratio of the noise relationship in the network was calculated using the network after denoising, and the results are shown in table 6.

Table 6 shows the ratio of noise relationship in the network after denoising in the twitter data and the microblog data, and the value in parentheses represents the amplitude of the drop before denoising

Data set	Rate of false relations	Rate of potential relationship	Mean noise ratio
				Twitter data	13.60％(↓25.01％)	54.93％(↓0.86％)	54.50％(↓0.87％)
Microblog data	45.29％(↓37.88％)	49.48％(↓3.18％)	46.57％(↓6.10％)

From the results in the table, under the condition that the representation of the text content of the post is not changed, the overall noise ratio in the network after denoising is reduced, and the effectiveness of the denoising process is proved. The reduction range of the proportion of the false relation in the twitter and microblog data after denoising reaches 25.01% and 37.88% respectively, which shows that the denoising module is more good at removing the false relation in the network.

In order to verify whether the post representation learned by a denoised graph self-encoder (DGAE) is better than the original BERT representation or not, under the condition that the social relationship network at the post level is kept unchanged, comparing the distribution situation of the noise relationship in the network when DGAE representation learned by using the method of the invention and BERT representation are used, because the value of the threshold theta can seriously influence the distribution situation of the noise relationship when the noise relationship is calculated, the noise distribution situation under different theta value situations is shown in an experiment, specifically, the calculation mode of the threshold theta is shown according to the following formula:

θ＝minΦ+δ*(maxΦ-minΦ)

where Φ is the semantic similarity matrix between posts and δ is the tuning parameter, the experimental results are shown in fig. 3.

As can be seen from fig. 3, as the value of the threshold θ increases, the potential relationship rate decreases, and the false relationship rate increases. The overall noise relationship rate is generally maintained at a high level. After DGAE denoising, the potential relation rate is greatly reduced, and the false relation rate also presents a lower level. Most importantly, the total noise relation rate has a remarkable descending trend compared with that before denoising, and the DGAE proves that the DGAE can effectively remove the noise relation in the network.

The x-axis in fig. 3 represents the value of the threshold δ. Subgraphs (a) and (c) correspond to the case of representations encoded using the BERT model, and subgraphs (b) and (d) correspond to representations learned from the encoder model using the denoised image.

Additional experiments were performed in order to analyze the order of the two noise relationships in the noise function and the effect of the ratio on the model performance. And observing the change trend of the model performance by adjusting the probability of the relation between two kinds of noise in the noise function. In addition, whether the adding sequence of the two noise relationships influences the performance of the model or not is further adjusted, the adding sequence of the two noise relationships is recorded as inserting first and then losing and inserting first and then losing, and the change condition of the performance of the model is observed. The results are shown in FIGS. 4a and 4 b.

Fig. 4a and 4b show the influence of different denoising modes and different noise relation addition probabilities in the denoising function on the experimental result. Wherein fig. 4a shows the case where the spurious relation is added first and then the potential relation is added in the noise function, and fig. 4b shows the case where the spurious relation is added first and then the potential relation is added. The horizontal axis represents the insertion probability in the noise relationship, and the vertical axis represents the loss probability in the noise relationship.

The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.

Reference documents:

[1]Vaswani A,Shazeer N,Parmar N,et al.Attention is All You Need[C].In Proceedings of the 31st International Conference on Neural Information Processing Systems,2017:6000–6010.

[2]Dragomir Radev,Sasha Blair-Goldensohn,and Zhu Zhang.2001.Experiments in Single and Multi-Document Summarization Using MEAD.In First Document Understanding Conference.1-8

[3]Yihong Gong and Xin Liu.2001.Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis.In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.19–25

[4]Gunes Erkan and Dragomir Radev.2011.LexRank:Graph-based Lexical Centrality As Salience in Text Summarization.Journal of Artifcial Intelligence Research 22(Sept.2011),457–479

[5]Z.He,C.Chen,J.Bu,C.Wang,L.Zhang,D.Cai,and X.He.2012.Document summarization based on data reconstruction.In Twenty-sixth AAAI Conference on Artifcial Intelligence.620–626

[6]He Liu,Hongliang Yu,and Zhi-Hong Deng.2015.Multi-Document Summarization Based on Two-Level Sparse Representation Model.In Proceedings of the Twenty-Ninth AAAI Conference on Artifcial Intelligence.196–202

[7]Ruifang He and Xingyi Duan.2018.Twitter Summarization Based on Social Network and Sparse Reconstruction.In Proceedings of the Thirty-Second AAAI Conference on Artifcial Intelligence.5787–5794

[8]Hao Zheng and Mirella Lapata.2019.Sentence Centrality Revisited for Unsupervised Summarization.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.6236–6247

[9]Baobao Chang Kexiang Wang and Zhifang Sui.2020.A Spectral Method for Unsupervised Multi-Document Summarization.In Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing.435–445

[10]Huanyu Liu,Ruifang He,Liangliang Zhao,Haocheng Wang,and Ruifang Wang.2021.SCMGR:Using Social Context and Multi-Granularity Relations for Unsupervised Social Summarization.In Proceedings of the 30 ^th ACM International Conference on Information and Knowledge Management.1058-1068

[11]Ruifang He,Liangliang Zhao,and Huanyu Liu.2020.TWEETSUM:Event oriented Social Summarization Dataset.In Proceedings of the 28th International Conference on Computational Linguistics.5731–5736

[12]Jing Li,Wei Gao,Zhongyu Wei,Baolin Peng,and Kam-Fai Wong.2015.Using Content-level Structures for Summarizing Microblog Repost Trees.In Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing.2168–2178.

the present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An unsupervised social media summarization method based on a denoised graph self-encoder is characterized by comprising the following steps of:

s4, according to the post representation obtained in the step S3, a summary extractor based on sparse reconstruction is adopted to select a final summary, the posts with the highest score are selected iteratively and added into a final summary set, and the process is repeated until the length limit of the summary is reached.

2. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S1 is as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.

3. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 2, wherein in step S1:

(101) the post-level social relationship network is formally described as follows: order to

Representing user u _i Of a neighbor user, i.e. with user u _i A set of users having direct social relationships;

Wherein

Representing a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if a post is made

Wherein u is _k Indicating the kth user, then post s _i And post s _j Between them establishes a side e _ij Epsilon is; in expressing infectious social relationships: if the post is

And is provided with

Or

Then is post s _i And s _j BetweenEstablishing a side e _ij E is epsilon; post level social relationship network is constructed according to the two rules

Wherein only the post node set is included

The corresponding adjacency matrix is marked as

for each post s _i Inputting the post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):

x _i ＝BERT(s _i ) (1)

wherein x _i Representing a post s _i The initial content representation of all N posts is finally obtained as X ═ X ₁ ,…,x _N ]。

4. The unsupervised social media summarization method based on a denoised image self-encoder according to claim 1, wherein in step S2,

(201) the two noise relationships, spurious and latent, are defined as follows:

(b) potential relationships are as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;

setting a noise function corresponding to the false relation as relation insertion, and setting a noise function of the potential relation as relation loss, wherein the specific steps are as follows:

(c) relationship insertion: randomly adding an edge to any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;

5. The unsupervised social media summarization method based on a denoising map self-encoder as claimed in claim 4, wherein in step S2, a residual map attention network encoder is used to encode the post according to the social relationship between the initial content representation of the post and the post, so as to integrate the text content information and the social relationship information of the post; the residual graph attention network encoder is considered to be an information propagation model that learns the representation of nodes in the post-level social relationship network by aggregating information of neighboring nodes connected to the nodes with edges, wherein the neighboring nodes refer to nodes connected with edges in the post-level social relationship network; the method comprises the following specific steps:

residual graph attention network encoder represents initial content of nodes

Adjacency matrix corresponding to post-level social relationship network

As input, where D is the dimension of the node feature representation and N is the number of posts; the propagation rules of the residual map attention network encoder are shown in equations (2) and (3):

wherein H ^(l) Is the hidden representation of the residual image attention network encoder in the l-th layer, A is the adjacency matrix corresponding to the post level social relationship network, A _ij Representative post s _i And s _j I is an identity matrix,

indicating post s after increasing attention weight _i And s _j The relation weight between, σ (·) represents a nonlinear activation function;

is a post s _i And post s _j Attention scores at stratum i; w ^(l) And b ^(l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X ₁ ,…,x _N ]As input to the residual graph attention network encoder, let H ⁽⁰⁾ X; wherein the calculation of the attention weight employs scaling the dot product attention ^[1] General willThe general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representations to K different subspaces, K representing the total number of heads of the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:

and

respectively representing posts s in the kth attention head _i And s _j Attention scores therebetween and normalized attention weights; (.) ^T Representing a transpose operation; d _h Is a dimension of the hidden representation in the attention calculation process; the superscript (l) indicating the number of layers is omitted here and a superscript head is used _k To indicate the kth attention head;

and

is the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and taking attention in K attention headsThe force weights are unified into a final attention score:

α _ij representing a post s _i And s _j Final attention weight in between; the connection between each layer in the ordinary graph attention network is replaced by residual connection to form a residual graph attention network encoder, so that the residual graph attention network encoder can directly transmit input information to an output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:

f(H ^(l) )＝σ(W _f H ^(l) +b _f ) (8)

I.e. a vector representation of the encoded post, wherein

Post s encoded by network encoder representing residual map attention _i For a subsequent sparse reconstruction based summarization extraction process.

6. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S3 is as follows:

by setting a decoder based on two reconstruction targets; enabling a decoder to reconstruct a real social relationship network without a noise relationship to capture social relationship information among posts on one hand, and reconstructing text content contained in the posts on the other hand, thereby capturing the text content information of the posts and further enriching vector representation of the posts;

Is an adjacency matrix corresponding to the post-level social relationship network output by the decoder,

post s representing decoder prediction _i And s _j The probability of a social relationship existing between them,

and

respectively representing posts s _i And post s _j The vector representation coded by the residual graph attention network coder, sigma (·) represents the nonlinear activation function;

for text content reconstruction, the relationship between reconstructed posts and words is proposed, and the text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:

wherein

And with

Is the learning parameter of the decoder, Z is the dimension represented by the vector of the post obtained by the encoder, V represents the vocabulary size;

is a prediction result of a decoder, wherein

Representing a post s _i Containing the word w _j The probability of (d);

corresponding loss functions are respectively designed aiming at the two reconstruction targets, the overall training target comprises two parts, the loss of the first part is the loss of reconstructing a real social relationship network and is recorded as L _g Calculating the predicted result

Binary cross entropy loss between adjacency matrices a corresponding to real social relationship networks:

With true result s _i Binary cross entropy loss between:

s _ij for real training labels, posts s are represented _i Whether or not to include the word w _j If post s _i Containing the word w _j Then s _ij 1, otherwise s _ij 0; finally, the two losses are combined using the balance parameter λ to obtain the final loss function L:

L＝λL _g +(1-λ)L _c (13)

training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing a noise relationship after the training is finished ₁ ,h ₂ ,…,h _N ]。

7. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 1, wherein step S4 is as follows:

accurate post representation after removing noise relation encoded by given residual graph attention network encoder H ═ H ₁ ,h ₂ ,…,h _N ]The abstract extraction process is modeled as a sparse reconstruction process:

wherein | | · | | represents the Frobenius norm,

Otherwise

Representing a Hadamard product operation; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind _2,1 Represents the norm L21, defined as follows:

wherein score(s) _i ) Representing a post s _i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final abstract set, and the process is repeated until the length limit of the abstract is reached.