CN111639189B

CN111639189B - Text graph construction method based on text content features

Info

Publication number: CN111639189B
Application number: CN202010356482.XA
Authority: CN
Inventors: 杨黎斌; 梅欣; 戴航; 蔡晓妍
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2023-03-21
Anticipated expiration: 2040-04-29
Also published as: CN111639189A

Abstract

The invention discloses a text graph construction method based on text content characteristics, which is used for separating the dependence on the co-occurrence relationship when constructing the edges of a text graph and simultaneously reserving the semantic relationship of word nodes, thereby realizing the accurate expression of text semantic characteristics; the method comprises the steps that a graph obtained by the first method has a global node, namely the node with the largest degree, and the global node and other residual nodes have connecting edges; however, if the number of nodes in the graph is large and the weight difference of the nodes is not large, the difference between the degree set value of the intermediate node and the weight value of the node is too large in this way; the second method can solve the defects of the method to a certain extent, but the constructed graph may not be communicated, if a subsequently adopted learning algorithm has requirements on the connectivity of the graph or a global node feature is used for representing a graph feature method, the method can be flexibly selected according to actual requirements, and therefore the flexibility of text graph construction is improved.

Description

Text graph construction method based on text content features

Technical Field

The invention relates to a text graph construction method, in particular to a text graph construction method based on text content characteristics.

Background

With the continuous development of deep learning, algorithms in the image field become mature, and in recent years, the graph neural network is widely applied in the image field. Many people have therefore attempted to apply graph neural network related algorithms to the text domain for natural language processing. To apply an algorithm for processing structured data to unstructured data, a graph structure representation needs to be generated from unstructured data (e.g., text) first.

Most of the existing graph construction algorithms divide words into texts, take the words as points in the graphs, and add edges between word nodes or add connecting edges between word nodes appearing in the same sentence according to the co-occurrence relation of all the words in the texts in the same window. Firstly, most of the existing graph construction algorithms completely depend on the sequence of text sentences, and a plurality of sentences possibly exist for expressing similar meanings when the sentences are divided by taking the sentences as units, or words with opposite semantics exist in the same sentence; sliding windows are used to determine co-occurrence relationships and the choice of window size will directly affect the effectiveness of the construction algorithm. Furthermore, the graph constructed by the algorithms is likely to have many isolated nodes, and some of the algorithms are suitable for learning algorithms with requirements on graph structures.

Therefore, most of graph construction methods in the prior art have the problems that the graph construction methods depend on text surface word order information excessively, the constructed graph is not a simple graph in mathematical definition, and isolated nodes may exist.

Disclosure of Invention

The invention aims to provide a text graph construction method based on text content characteristics, which is used for solving the problem that semantic characteristics of a text cannot be prepared and expressed in the graph construction method in the prior art.

In order to realize the task, the invention adopts the following technical scheme:

a text graph construction method based on text content features is used for converting a text to be converted into a text graph, and the method is executed according to the following steps:

step 1, obtaining a text to be converted;

step 2, performing text preprocessing on the text to be converted to obtain a preprocessed text; the text preprocessing comprises word segmentation processing, cleaning processing and standardization processing which are sequentially carried out;

wherein the preprocessed text comprises a plurality of words;

step 3, extracting the characteristics of the preprocessed text obtained in the step 2 to obtain the weight value of each word in the preprocessed text;

step 4, acquiring degree values of nodes corresponding to each word according to the weight values of the words acquired in the step 3; the degree value of the node corresponding to the word with the highest weight value is highest;

obtaining degree values of a plurality of nodes;

and 5, obtaining a text graph according to the degree values of the plurality of nodes obtained in the step 4.

Further, when the feature extraction is performed on the preprocessed text obtained in step 2 in step 3, a weight value of each word in the preprocessed text is obtained by using a TextRank algorithm or a Tf-idf algorithm.

Further, the step 4 specifically includes:

step 4.1, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weighted value of each word obtained in the step 3, sorting nodes corresponding to each word in a descending order to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;

step 4.2, setting the degree value of the first node in the node sequence as n-1, and establishing a node-degree value linear model after setting the degree value of the last node in the node sequence as 1; the horizontal axis unit of the node-degree value linear model is a node, and the vertical axis unit of the node-degree value linear model is a degree value;

and 4.3, obtaining the degree value of the node corresponding to each word according to the node-degree value linear model.

Further, the step 4 specifically includes:

step I, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;

step II, obtaining the degree value d of the ith node by adopting the formula I _i ：

Wherein omega _i Represents the weight value, ω, of the word corresponding to the ith node _sum Representing the sum of the weighted values of the words corresponding to all the nodes;

and III, repeating the step II until the degree value of the node corresponding to each word is obtained.

Compared with the prior art, the invention has the following technical characteristics:

1. the text graph construction method based on the text content characteristics provided by the invention breaks away from the dependence on the co-occurrence relationship when constructing the edges of the text graph and simultaneously retains the semantic relationship of word nodes, thereby realizing the accurate expression of the text semantic characteristics;

2. the text graph construction method based on the text content features can select a proper feature extraction algorithm according to the actual application requirements, can select a TextRank algorithm when the complete semantic features are required, can assign a weight value according to co-occurrence information of words, can select Tf-idf to extract the features when the overall topic features sufficiently express semantic information, and uses the word frequency to measure the importance, thereby improving the flexibility of text graph construction.

3. The text graph construction method based on the text content features provides two methods for obtaining the degree value, and a proper method can be selected according to practical application, wherein the graph obtained by the first method has a global node, namely the node with the maximum degree value, and the global node and other residual nodes have connecting edges; however, if the number of nodes in the graph is large and the weight difference of the nodes is not large, the difference between the degree set value of the middle node and the weight value of the node is too large by the method; the second method can solve the defects of the method to a certain extent, but the constructed graph may not be communicated, if a subsequently adopted learning algorithm has requirements on the connectivity of the graph or a global node feature is used for representing a graph feature method, if the subsequent algorithm has strong dependency on the weight degree, the second method can be selected, and therefore the flexibility of text graph construction is improved.

Drawings

FIG. 1 is a textual diagram constructed in one embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples. So that those skilled in the art can better understand the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

The following definitions or conceptual connotations relating to the present invention are provided for illustration:

text graph: the network graph is constructed through text contents, the text graph is composed of nodes and undirected edges, and the nodes are words in the text.

Word segmentation processing: i.e. word segmentation, in particular to chinese text, i.e. segmenting the text into a set of words by a word segmentation tool.

Cleaning treatment: since in most cases there are many useless parts in the text that we have prepared. The cleaning is to remove the unnecessary punctuation marks, stop words and the like.

And (3) standardization treatment: in principle, the stem extraction mainly adopts a 'reduction' method to convert words into stems, for example, to process "cats" as "cat" and "effective" as "effective". The word shape reduction mainly adopts a method of 'conversion' to convert a word into an original shape, for example, 'drove' is processed into 'drive', and 'driving' is processed into 'drive'.

Example one

The embodiment discloses a text graph construction method based on text content characteristics, which is used for converting a text to be converted into a text graph.

The text graph construction method provided by the embodiment breaks away from the dependency on the co-occurrence relationship when constructing the edges, and simultaneously retains the semantic relationship of the word nodes.

The method is executed according to the following steps:

step 1, obtaining a text to be converted;

the text can be a sentence or an article. The method can be used in both Chinese and English, and a corresponding text processing method is provided below;

wherein the preprocessed text comprises a plurality of words;

in this embodiment, the text is first preprocessed, which mainly includes word segmentation, washing, and normalization. In terms of word segmentation, the thinking of word segmentation is different due to the particularity of the language. In most cases, english can be segmented by directly using a blank space, but in Chinese, because grammar is more complex, a third-party library such as jieba and the like is usually used for segmenting words; text cleaning is to remove unnecessary punctuation marks, stop words and the like; the final standardization is word shape reduction and stem extraction (for english).

The preprocessing operations are those conventionally possible, and are selected as appropriate, and if the selected text is too short, they are not necessarily all used. We assume that the preprocessing is performed with all the operations described above, for example. For Chinese text, only the operations of word segmentation and word stop removal are needed. Such as: "what episodes or historical backgrounds in the water enteromorpha are different from the smith? "which plot or history background is different from the history in the water and enteromorpha after word segmentation? The method is characterized in that stop words are removed, namely certain auxiliary words without practical meaning are removed, and the operated text is 'different historical backgrounds of plots transferred by the water entermorphism', and is a word bag library. Taking an English text as an example, word segmentation is not needed, but English words are transformed according to tenses and parts of speech, and stem extraction or word shape reduction can be performed if necessary, namely, suffixes added due to tense and part of speech transformation are removed, and the original forms of the words are reduced. Such as: "The file is a vertical dull duel between two shifted versions formats". The text after operation becomes "file vertical dull field text former".

in this embodiment, a suitable feature extraction algorithm is selected according to the actual application requirements, when the complete semantic features are required, a TextRank algorithm may be selected, a weight value is assigned according to co-occurrence information of words, when the overall topic features sufficiently express semantic information, tf-idf may be selected for feature extraction, and the importance is measured by using word frequency.

Optionally, when the feature extraction is performed on the preprocessed text obtained in step 2 in step 3, a weight value of each word in the preprocessed text is obtained by using a TextRank algorithm or a Tf-idf algorithm.

In this embodiment, the algorithm for TextRank to obtain the word node weight is as follows:

(1) The given text T is segmented in complete sentences, i.e. T = [ S = [ [ S ] ₁ ,S ₂ ,…,S _m ]

(2) And for each sentence, performing word segmentation and part-of-speech tagging, filtering out stop words, and only reserving words with specified parts-of-speech, such as nouns, verbs and adjectives, wherein ti and j are reserved candidate keywords. Si = [ t ] _i,1 ,t _i,2 ,...,t _i,n ]

(3) And (3) constructing a candidate keyword graph G = (V, E), wherein V is a node set and consists of the candidate keywords generated in the step (2), then constructing an edge between any two points by adopting a Co-Occurrence relation (Co-Occurence), wherein the edges exist between the two nodes only when the corresponding words Co-occur in a window with the length of K, and K represents the window size, namely, at most K words Co-occur.

(4) And (5) iteratively propagating the weight of each node according to a formula of TextRank until convergence.

In this embodiment, the main idea of TF-IDF is that if a word occurs in an article with high frequency TF and rarely occurs in other articles, the word or phrase is considered to have good category discrimination capability and is suitable for classification. TF is Term Frequency (Term Frequency): the word frequency (TF) represents the frequency with which terms (keywords) appear in text. Inverse file frequency (IDF): the IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. If the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. The calculation process of the TF-IDF weight is as follows:

TF-IDF＝TF*IDF

obtaining degree values of a plurality of nodes;

the invention provides two methods for obtaining the degree value, the structures of the graphs obtained by the two methods are different, and a proper method can be selected according to different application scenes.

Optionally, the step 4 specifically includes:

step 4.1, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;

step 4.2, setting the degree value of the first node in the node sequence as n-1, and establishing a node-degree value linear model after setting the degree value of the last node in the node sequence as 1; the horizontal axis in the node-degree value linear model is a node, and the vertical axis in the node-degree value linear model is a degree value;

In this embodiment, the nodes are arranged according to the magnitude of the weight valueAnd sequentially, assuming that n nodes exist in the current graph, the node degree with the largest weight value is assigned to be n-1, and the node degree with the smallest weight value is assigned to be 1. For coordinate pair (x, y), x is the weight value of the node, y is the degree of the corresponding node, and the data point (w) _max N-1) and a point (w) _min ,1)(w _max And w _min Maximum weight value and minimum weight value respectively) can determine a straight line y = kx + b, that is, the degrees of the rest nodes can be determined according to a straight line equation. The graph obtained in this way is a fully connected graph, and there is one global node connected to all the remaining nodes.

In this example, it is assumed that the article after preprocessing has 5 words, W1, W2, W3, W4, W5. The weight values are 5, 4, 3, 2, respectively, in descending order. After step 4.2, the degree of W1 is 4, the degree of W5 is 1, and the equation of the two-point straight line is y = x-1. In 4.3, the degree values of the nodes are 4, 3, 2, 1 and 1 according to the linear equation.

Optionally, the step 4 specifically includes:

step I, obtaining a node corresponding to each word according to a plurality of words included in the preprocessed text obtained in the step 2; according to the weighted value of each word obtained in the step 3, sorting nodes corresponding to each word in a descending order to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;

Wherein ω is _i Represents the weight value, ω, of the word corresponding to the ith node _sum Representing the sum of the weighted values of the words corresponding to all the nodes;

In this embodiment, the nodes are sorted according to the weight value, and assuming that there are n nodes in the current graph, the degree of the node i is

Wherein w _i A weight value of i, w _sum Is the sum of all node weights. The degree of each node in the graph obtained in this way reflects the importance of the node in the whole graph, but the obtained graph may not be a fully connected graph, and if the connectivity is not strictly required, the method can achieve good effect.

In this example, it is assumed that the article after preprocessing has 5 words, W1, W2, W3, W4, W5. The weight values are 5, 4, 3, 2, respectively, in descending order. Omega _sum In order to 16, the degree values of the nodes are 2, 1 and 1 through the steps II and III.

In the two degree value obtaining methods provided by the invention, in the first method, the maximum degree of a node is n-1, and the minimum degree is 1, so that the graph obtained subsequently is a fully connected graph, and the graph obtained in this way has a global node, namely the node with the maximum degree, which has connecting edges with other rest nodes. However, if the number of nodes in the graph is large and the weight difference between the nodes is not large, this way will make the difference between the degree setting value of the middle node and the weight value of the node too large. The second method solves the above-mentioned drawbacks to some extent, but the constructed figures may not be connected. If the subsequently adopted learning algorithm has a requirement on the connectivity of the graph or a global node feature is used for representing a graph feature method, a second method can be selected according to the situation if the subsequent algorithm has strong dependency on the weight degree.

In this embodiment, the edges are connected according to the node degrees obtained in step 4, starting from the node with the highest degree, the node is connected with the next n-1 nodes with the degree greater than 0, then the connected node degree is decreased by 1, if the node degree is decreased by 1 and then is less than 0, the connection edge between the current node and the node is cancelled, and meanwhile, the degree of the current node is decreased by 1. And then reordering the nodes according to the updated degrees, and repeating the operation until the degrees of all the nodes are reduced to 0.

Assume that there are 5 nodes to be processed: w1, W2, W3, W4, W5. And 4, calculating degrees of 4, 3, 2, 1 and 1 respectively. W1 is connected with the rest nodes, the node degrees are changed into 0, 2, 1, 0 and 0 at the moment, and the nodes are rearranged in descending order according to the degrees and changed into W2, W3, W1, W4 and W5. Let W2 and W3 be connected, and all node degrees become 0, and the completion is completed. The connected pattern is shown in figure 1.

In the invention, the unstructured representation of a text content, namely a graph structure, can be obtained through the steps, and then the text features can be learned by utilizing the relevant properties of the processing graph and the deep learning algorithm of the processing graph, so as to process relevant practical problems, such as classification, recommendation and the like.

Example two

In this embodiment, the method provided by the present invention is experimentally verified, and taking classification as an example, a text graph is constructed by using the method provided by the present invention, and then classification is performed by using Graph Attention Network (GAN) learning. GAN is a Graph neural network based on Attention mechanism, referred to in the paper Graph Attention Networks. Text-GAN (1) is the first method used in constructing the graph (step 4.1-step 4.3 in example one), pre-training the word vectors; text-GAN (2) is the second method used in constructing graphs (step I-step III method in example one), pre-training word vectors; text-GAN (2) -rand also uses the second graph construction method (step I-step III in the first embodiment) to randomly initialize the word vector. Text-GCN is used as a comparison algorithm, the algorithm is from the paper Graph conditional Networks for Text Classification, a Graph is constructed based on texts at first, but texts of all training sets and test sets are converted into an integral large Graph, the chapter points in the Graph also have word nodes, and the algorithm requires that articles in the test sets are known and are more like a clustering process. In addition, two variants of Convolutional Neural Networks (CNN) were also used for comparison: CNN-non-static is a pre-trained word vector; the CNN-rand word vectors are randomly initialized. CNN comes from the paper "volumetric Neural Networks for Session Classification", which uses convolution kernels to extract text features. DBLP and MR are two common classification data sets, there are six categories of DBLP data and two categories of MR, and the Accuracy (Accuracy), precision (Precision) and Recall (Recall) of the test results are compared respectively, and the experimental results are shown in table 1.

Table 1 comparison of the method of the invention with the prior art

According to experimental results, compared with a conventional Text processing algorithm (CNN) or a Text-GCN which is converted into a graph structure and then learned through a graph neural network, the neural network processing algorithm applying the Text graph construction algorithm provided by the invention has better performance in the field of Text classification.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and certainly may also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be substantially implemented or a part of the technical solutions contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a readable storage medium, such as a floppy disk, a hard disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Claims

1. A text graph construction method based on text content features is used for converting texts to be converted into text graphs, and is characterized by comprising the following steps:

step 1, acquiring a text to be converted;

wherein the preprocessed text comprises a plurality of words;

obtaining degree values of a plurality of nodes;

step 5, obtaining a text graph according to the degree values of the plurality of nodes obtained in the step 4;

the step 4 specifically comprises:

4.3, obtaining the degree value of the node corresponding to each word according to the node-degree value linear model;

or, the step 4 specifically includes:

step I, obtaining a node corresponding to each word according to a plurality of words included in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;

2. The text diagram construction method based on text content features according to claim 1, wherein when the feature extraction is performed on the preprocessed text obtained in the step 2 in the step 3, a TextRank algorithm or a Tf-idf algorithm is adopted to obtain a weight value of each word in the preprocessed text.