CN111639189B - Text graph construction method based on text content features - Google Patents

Text graph construction method based on text content features Download PDF

Info

Publication number
CN111639189B
CN111639189B CN202010356482.XA CN202010356482A CN111639189B CN 111639189 B CN111639189 B CN 111639189B CN 202010356482 A CN202010356482 A CN 202010356482A CN 111639189 B CN111639189 B CN 111639189B
Authority
CN
China
Prior art keywords
text
node
word
nodes
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010356482.XA
Other languages
Chinese (zh)
Other versions
CN111639189A (en
Inventor
杨黎斌
梅欣
戴航
蔡晓妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010356482.XA priority Critical patent/CN111639189B/en
Publication of CN111639189A publication Critical patent/CN111639189A/en
Application granted granted Critical
Publication of CN111639189B publication Critical patent/CN111639189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text graph construction method based on text content characteristics, which is used for separating the dependence on the co-occurrence relationship when constructing the edges of a text graph and simultaneously reserving the semantic relationship of word nodes, thereby realizing the accurate expression of text semantic characteristics; the method comprises the steps that a graph obtained by the first method has a global node, namely the node with the largest degree, and the global node and other residual nodes have connecting edges; however, if the number of nodes in the graph is large and the weight difference of the nodes is not large, the difference between the degree set value of the intermediate node and the weight value of the node is too large in this way; the second method can solve the defects of the method to a certain extent, but the constructed graph may not be communicated, if a subsequently adopted learning algorithm has requirements on the connectivity of the graph or a global node feature is used for representing a graph feature method, the method can be flexibly selected according to actual requirements, and therefore the flexibility of text graph construction is improved.

Description

Text graph construction method based on text content features
Technical Field
The invention relates to a text graph construction method, in particular to a text graph construction method based on text content characteristics.
Background
With the continuous development of deep learning, algorithms in the image field become mature, and in recent years, the graph neural network is widely applied in the image field. Many people have therefore attempted to apply graph neural network related algorithms to the text domain for natural language processing. To apply an algorithm for processing structured data to unstructured data, a graph structure representation needs to be generated from unstructured data (e.g., text) first.
Most of the existing graph construction algorithms divide words into texts, take the words as points in the graphs, and add edges between word nodes or add connecting edges between word nodes appearing in the same sentence according to the co-occurrence relation of all the words in the texts in the same window. Firstly, most of the existing graph construction algorithms completely depend on the sequence of text sentences, and a plurality of sentences possibly exist for expressing similar meanings when the sentences are divided by taking the sentences as units, or words with opposite semantics exist in the same sentence; sliding windows are used to determine co-occurrence relationships and the choice of window size will directly affect the effectiveness of the construction algorithm. Furthermore, the graph constructed by the algorithms is likely to have many isolated nodes, and some of the algorithms are suitable for learning algorithms with requirements on graph structures.
Therefore, most of graph construction methods in the prior art have the problems that the graph construction methods depend on text surface word order information excessively, the constructed graph is not a simple graph in mathematical definition, and isolated nodes may exist.
Disclosure of Invention
The invention aims to provide a text graph construction method based on text content characteristics, which is used for solving the problem that semantic characteristics of a text cannot be prepared and expressed in the graph construction method in the prior art.
In order to realize the task, the invention adopts the following technical scheme:
a text graph construction method based on text content features is used for converting a text to be converted into a text graph, and the method is executed according to the following steps:
step 1, obtaining a text to be converted;
step 2, performing text preprocessing on the text to be converted to obtain a preprocessed text; the text preprocessing comprises word segmentation processing, cleaning processing and standardization processing which are sequentially carried out;
wherein the preprocessed text comprises a plurality of words;
step 3, extracting the characteristics of the preprocessed text obtained in the step 2 to obtain the weight value of each word in the preprocessed text;
step 4, acquiring degree values of nodes corresponding to each word according to the weight values of the words acquired in the step 3; the degree value of the node corresponding to the word with the highest weight value is highest;
obtaining degree values of a plurality of nodes;
and 5, obtaining a text graph according to the degree values of the plurality of nodes obtained in the step 4.
Further, when the feature extraction is performed on the preprocessed text obtained in step 2 in step 3, a weight value of each word in the preprocessed text is obtained by using a TextRank algorithm or a Tf-idf algorithm.
Further, the step 4 specifically includes:
step 4.1, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weighted value of each word obtained in the step 3, sorting nodes corresponding to each word in a descending order to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step 4.2, setting the degree value of the first node in the node sequence as n-1, and establishing a node-degree value linear model after setting the degree value of the last node in the node sequence as 1; the horizontal axis unit of the node-degree value linear model is a node, and the vertical axis unit of the node-degree value linear model is a degree value;
and 4.3, obtaining the degree value of the node corresponding to each word according to the node-degree value linear model.
Further, the step 4 specifically includes:
step I, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step II, obtaining the degree value d of the ith node by adopting the formula I i
Figure BDA0002473656160000031
Wherein omega i Represents the weight value, ω, of the word corresponding to the ith node sum Representing the sum of the weighted values of the words corresponding to all the nodes;
and III, repeating the step II until the degree value of the node corresponding to each word is obtained.
Compared with the prior art, the invention has the following technical characteristics:
1. the text graph construction method based on the text content characteristics provided by the invention breaks away from the dependence on the co-occurrence relationship when constructing the edges of the text graph and simultaneously retains the semantic relationship of word nodes, thereby realizing the accurate expression of the text semantic characteristics;
2. the text graph construction method based on the text content features can select a proper feature extraction algorithm according to the actual application requirements, can select a TextRank algorithm when the complete semantic features are required, can assign a weight value according to co-occurrence information of words, can select Tf-idf to extract the features when the overall topic features sufficiently express semantic information, and uses the word frequency to measure the importance, thereby improving the flexibility of text graph construction.
3. The text graph construction method based on the text content features provides two methods for obtaining the degree value, and a proper method can be selected according to practical application, wherein the graph obtained by the first method has a global node, namely the node with the maximum degree value, and the global node and other residual nodes have connecting edges; however, if the number of nodes in the graph is large and the weight difference of the nodes is not large, the difference between the degree set value of the middle node and the weight value of the node is too large by the method; the second method can solve the defects of the method to a certain extent, but the constructed graph may not be communicated, if a subsequently adopted learning algorithm has requirements on the connectivity of the graph or a global node feature is used for representing a graph feature method, if the subsequent algorithm has strong dependency on the weight degree, the second method can be selected, and therefore the flexibility of text graph construction is improved.
Drawings
FIG. 1 is a textual diagram constructed in one embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples. So that those skilled in the art can better understand the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
The following definitions or conceptual connotations relating to the present invention are provided for illustration:
text graph: the network graph is constructed through text contents, the text graph is composed of nodes and undirected edges, and the nodes are words in the text.
Word segmentation processing: i.e. word segmentation, in particular to chinese text, i.e. segmenting the text into a set of words by a word segmentation tool.
Cleaning treatment: since in most cases there are many useless parts in the text that we have prepared. The cleaning is to remove the unnecessary punctuation marks, stop words and the like.
And (3) standardization treatment: in principle, the stem extraction mainly adopts a 'reduction' method to convert words into stems, for example, to process "cats" as "cat" and "effective" as "effective". The word shape reduction mainly adopts a method of 'conversion' to convert a word into an original shape, for example, 'drove' is processed into 'drive', and 'driving' is processed into 'drive'.
Example one
The embodiment discloses a text graph construction method based on text content characteristics, which is used for converting a text to be converted into a text graph.
The text graph construction method provided by the embodiment breaks away from the dependency on the co-occurrence relationship when constructing the edges, and simultaneously retains the semantic relationship of the word nodes.
The method is executed according to the following steps:
step 1, obtaining a text to be converted;
the text can be a sentence or an article. The method can be used in both Chinese and English, and a corresponding text processing method is provided below;
step 2, performing text preprocessing on the text to be converted to obtain a preprocessed text; the text preprocessing comprises word segmentation processing, cleaning processing and standardization processing which are sequentially carried out;
wherein the preprocessed text comprises a plurality of words;
in this embodiment, the text is first preprocessed, which mainly includes word segmentation, washing, and normalization. In terms of word segmentation, the thinking of word segmentation is different due to the particularity of the language. In most cases, english can be segmented by directly using a blank space, but in Chinese, because grammar is more complex, a third-party library such as jieba and the like is usually used for segmenting words; text cleaning is to remove unnecessary punctuation marks, stop words and the like; the final standardization is word shape reduction and stem extraction (for english).
The preprocessing operations are those conventionally possible, and are selected as appropriate, and if the selected text is too short, they are not necessarily all used. We assume that the preprocessing is performed with all the operations described above, for example. For Chinese text, only the operations of word segmentation and word stop removal are needed. Such as: "what episodes or historical backgrounds in the water enteromorpha are different from the smith? "which plot or history background is different from the history in the water and enteromorpha after word segmentation? The method is characterized in that stop words are removed, namely certain auxiliary words without practical meaning are removed, and the operated text is 'different historical backgrounds of plots transferred by the water entermorphism', and is a word bag library. Taking an English text as an example, word segmentation is not needed, but English words are transformed according to tenses and parts of speech, and stem extraction or word shape reduction can be performed if necessary, namely, suffixes added due to tense and part of speech transformation are removed, and the original forms of the words are reduced. Such as: "The file is a vertical dull duel between two shifted versions formats". The text after operation becomes "file vertical dull field text former".
Step 3, extracting the characteristics of the preprocessed text obtained in the step 2 to obtain the weight value of each word in the preprocessed text;
in this embodiment, a suitable feature extraction algorithm is selected according to the actual application requirements, when the complete semantic features are required, a TextRank algorithm may be selected, a weight value is assigned according to co-occurrence information of words, when the overall topic features sufficiently express semantic information, tf-idf may be selected for feature extraction, and the importance is measured by using word frequency.
Optionally, when the feature extraction is performed on the preprocessed text obtained in step 2 in step 3, a weight value of each word in the preprocessed text is obtained by using a TextRank algorithm or a Tf-idf algorithm.
In this embodiment, the algorithm for TextRank to obtain the word node weight is as follows:
(1) The given text T is segmented in complete sentences, i.e. T = [ S = [ [ S ] 1 ,S 2 ,…,S m ]
(2) And for each sentence, performing word segmentation and part-of-speech tagging, filtering out stop words, and only reserving words with specified parts-of-speech, such as nouns, verbs and adjectives, wherein ti and j are reserved candidate keywords. Si = [ t ] i,1 ,t i,2 ,...,t i,n ]
(3) And (3) constructing a candidate keyword graph G = (V, E), wherein V is a node set and consists of the candidate keywords generated in the step (2), then constructing an edge between any two points by adopting a Co-Occurrence relation (Co-Occurence), wherein the edges exist between the two nodes only when the corresponding words Co-occur in a window with the length of K, and K represents the window size, namely, at most K words Co-occur.
(4) And (5) iteratively propagating the weight of each node according to a formula of TextRank until convergence.
In this embodiment, the main idea of TF-IDF is that if a word occurs in an article with high frequency TF and rarely occurs in other articles, the word or phrase is considered to have good category discrimination capability and is suitable for classification. TF is Term Frequency (Term Frequency): the word frequency (TF) represents the frequency with which terms (keywords) appear in text. Inverse file frequency (IDF): the IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. If the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. The calculation process of the TF-IDF weight is as follows:
Figure BDA0002473656160000081
Figure BDA0002473656160000082
TF-IDF=TF*IDF
step 4, acquiring degree values of nodes corresponding to each word according to the weight values of the words acquired in the step 3; the degree value of the node corresponding to the word with the highest weight value is highest;
obtaining degree values of a plurality of nodes;
the invention provides two methods for obtaining the degree value, the structures of the graphs obtained by the two methods are different, and a proper method can be selected according to different application scenes.
Optionally, the step 4 specifically includes:
step 4.1, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step 4.2, setting the degree value of the first node in the node sequence as n-1, and establishing a node-degree value linear model after setting the degree value of the last node in the node sequence as 1; the horizontal axis in the node-degree value linear model is a node, and the vertical axis in the node-degree value linear model is a degree value;
and 4.3, obtaining the degree value of the node corresponding to each word according to the node-degree value linear model.
In this embodiment, the nodes are arranged according to the magnitude of the weight valueAnd sequentially, assuming that n nodes exist in the current graph, the node degree with the largest weight value is assigned to be n-1, and the node degree with the smallest weight value is assigned to be 1. For coordinate pair (x, y), x is the weight value of the node, y is the degree of the corresponding node, and the data point (w) max N-1) and a point (w) min ,1)(w max And w min Maximum weight value and minimum weight value respectively) can determine a straight line y = kx + b, that is, the degrees of the rest nodes can be determined according to a straight line equation. The graph obtained in this way is a fully connected graph, and there is one global node connected to all the remaining nodes.
In this example, it is assumed that the article after preprocessing has 5 words, W1, W2, W3, W4, W5. The weight values are 5, 4, 3, 2, respectively, in descending order. After step 4.2, the degree of W1 is 4, the degree of W5 is 1, and the equation of the two-point straight line is y = x-1. In 4.3, the degree values of the nodes are 4, 3, 2, 1 and 1 according to the linear equation.
Optionally, the step 4 specifically includes:
step I, obtaining a node corresponding to each word according to a plurality of words included in the preprocessed text obtained in the step 2; according to the weighted value of each word obtained in the step 3, sorting nodes corresponding to each word in a descending order to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step II, obtaining the degree value d of the ith node by adopting the formula I i
Figure BDA0002473656160000101
Wherein ω is i Represents the weight value, ω, of the word corresponding to the ith node sum Representing the sum of the weighted values of the words corresponding to all the nodes;
and III, repeating the step II until the degree value of the node corresponding to each word is obtained.
In this embodiment, the nodes are sorted according to the weight value, and assuming that there are n nodes in the current graph, the degree of the node i is
Figure BDA0002473656160000102
Wherein w i A weight value of i, w sum Is the sum of all node weights. The degree of each node in the graph obtained in this way reflects the importance of the node in the whole graph, but the obtained graph may not be a fully connected graph, and if the connectivity is not strictly required, the method can achieve good effect.
In this example, it is assumed that the article after preprocessing has 5 words, W1, W2, W3, W4, W5. The weight values are 5, 4, 3, 2, respectively, in descending order. Omega sum In order to 16, the degree values of the nodes are 2, 1 and 1 through the steps II and III.
In the two degree value obtaining methods provided by the invention, in the first method, the maximum degree of a node is n-1, and the minimum degree is 1, so that the graph obtained subsequently is a fully connected graph, and the graph obtained in this way has a global node, namely the node with the maximum degree, which has connecting edges with other rest nodes. However, if the number of nodes in the graph is large and the weight difference between the nodes is not large, this way will make the difference between the degree setting value of the middle node and the weight value of the node too large. The second method solves the above-mentioned drawbacks to some extent, but the constructed figures may not be connected. If the subsequently adopted learning algorithm has a requirement on the connectivity of the graph or a global node feature is used for representing a graph feature method, a second method can be selected according to the situation if the subsequent algorithm has strong dependency on the weight degree.
And 5, obtaining a text graph according to the degree values of the plurality of nodes obtained in the step 4.
In this embodiment, the edges are connected according to the node degrees obtained in step 4, starting from the node with the highest degree, the node is connected with the next n-1 nodes with the degree greater than 0, then the connected node degree is decreased by 1, if the node degree is decreased by 1 and then is less than 0, the connection edge between the current node and the node is cancelled, and meanwhile, the degree of the current node is decreased by 1. And then reordering the nodes according to the updated degrees, and repeating the operation until the degrees of all the nodes are reduced to 0.
Assume that there are 5 nodes to be processed: w1, W2, W3, W4, W5. And 4, calculating degrees of 4, 3, 2, 1 and 1 respectively. W1 is connected with the rest nodes, the node degrees are changed into 0, 2, 1, 0 and 0 at the moment, and the nodes are rearranged in descending order according to the degrees and changed into W2, W3, W1, W4 and W5. Let W2 and W3 be connected, and all node degrees become 0, and the completion is completed. The connected pattern is shown in figure 1.
In the invention, the unstructured representation of a text content, namely a graph structure, can be obtained through the steps, and then the text features can be learned by utilizing the relevant properties of the processing graph and the deep learning algorithm of the processing graph, so as to process relevant practical problems, such as classification, recommendation and the like.
Example two
In this embodiment, the method provided by the present invention is experimentally verified, and taking classification as an example, a text graph is constructed by using the method provided by the present invention, and then classification is performed by using Graph Attention Network (GAN) learning. GAN is a Graph neural network based on Attention mechanism, referred to in the paper Graph Attention Networks. Text-GAN (1) is the first method used in constructing the graph (step 4.1-step 4.3 in example one), pre-training the word vectors; text-GAN (2) is the second method used in constructing graphs (step I-step III method in example one), pre-training word vectors; text-GAN (2) -rand also uses the second graph construction method (step I-step III in the first embodiment) to randomly initialize the word vector. Text-GCN is used as a comparison algorithm, the algorithm is from the paper Graph conditional Networks for Text Classification, a Graph is constructed based on texts at first, but texts of all training sets and test sets are converted into an integral large Graph, the chapter points in the Graph also have word nodes, and the algorithm requires that articles in the test sets are known and are more like a clustering process. In addition, two variants of Convolutional Neural Networks (CNN) were also used for comparison: CNN-non-static is a pre-trained word vector; the CNN-rand word vectors are randomly initialized. CNN comes from the paper "volumetric Neural Networks for Session Classification", which uses convolution kernels to extract text features. DBLP and MR are two common classification data sets, there are six categories of DBLP data and two categories of MR, and the Accuracy (Accuracy), precision (Precision) and Recall (Recall) of the test results are compared respectively, and the experimental results are shown in table 1.
Table 1 comparison of the method of the invention with the prior art
Figure BDA0002473656160000121
According to experimental results, compared with a conventional Text processing algorithm (CNN) or a Text-GCN which is converted into a graph structure and then learned through a graph neural network, the neural network processing algorithm applying the Text graph construction algorithm provided by the invention has better performance in the field of Text classification.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and certainly may also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be substantially implemented or a part of the technical solutions contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a readable storage medium, such as a floppy disk, a hard disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Claims (2)

1. A text graph construction method based on text content features is used for converting texts to be converted into text graphs, and is characterized by comprising the following steps:
step 1, acquiring a text to be converted;
step 2, performing text preprocessing on the text to be converted to obtain a preprocessed text; the text preprocessing comprises word segmentation processing, cleaning processing and standardization processing which are sequentially carried out;
wherein the preprocessed text comprises a plurality of words;
step 3, extracting the characteristics of the preprocessed text obtained in the step 2 to obtain the weight value of each word in the preprocessed text;
step 4, acquiring degree values of nodes corresponding to each word according to the weight values of the words acquired in the step 3; the degree value of the node corresponding to the word with the highest weight value is highest;
obtaining degree values of a plurality of nodes;
step 5, obtaining a text graph according to the degree values of the plurality of nodes obtained in the step 4;
the step 4 specifically comprises:
step 4.1, obtaining nodes corresponding to each word according to the plurality of words in the preprocessed text obtained in the step 2; according to the weighted value of each word obtained in the step 3, sorting nodes corresponding to each word in a descending order to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step 4.2, setting the degree value of the first node in the node sequence as n-1, and establishing a node-degree value linear model after setting the degree value of the last node in the node sequence as 1; the horizontal axis unit of the node-degree value linear model is a node, and the vertical axis unit of the node-degree value linear model is a degree value;
4.3, obtaining the degree value of the node corresponding to each word according to the node-degree value linear model;
or, the step 4 specifically includes:
step I, obtaining a node corresponding to each word according to a plurality of words included in the preprocessed text obtained in the step 2; according to the weight value of each word obtained in the step 3, performing descending ordering on the nodes corresponding to each word to obtain a node sequence; the node sequence comprises n nodes, wherein n is a positive integer;
step II, obtaining the degree value d of the ith node by adopting the formula I i
Figure FDA0004063118220000021
Wherein ω is i Represents the weight value, ω, of the word corresponding to the ith node sum Representing the sum of the weighted values of the words corresponding to all the nodes;
and III, repeating the step II until the degree value of the node corresponding to each word is obtained.
2. The text diagram construction method based on text content features according to claim 1, wherein when the feature extraction is performed on the preprocessed text obtained in the step 2 in the step 3, a TextRank algorithm or a Tf-idf algorithm is adopted to obtain a weight value of each word in the preprocessed text.
CN202010356482.XA 2020-04-29 2020-04-29 Text graph construction method based on text content features Active CN111639189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010356482.XA CN111639189B (en) 2020-04-29 2020-04-29 Text graph construction method based on text content features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010356482.XA CN111639189B (en) 2020-04-29 2020-04-29 Text graph construction method based on text content features

Publications (2)

Publication Number Publication Date
CN111639189A CN111639189A (en) 2020-09-08
CN111639189B true CN111639189B (en) 2023-03-21

Family

ID=72332001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010356482.XA Active CN111639189B (en) 2020-04-29 2020-04-29 Text graph construction method based on text content features

Country Status (1)

Country Link
CN (1) CN111639189B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487024A (en) * 2021-06-29 2021-10-08 任立椋 Alternate sequence generation model training method and method for extracting graph from text
CN116150509B (en) * 2023-04-24 2023-08-04 齐鲁工业大学(山东省科学院) Threat information identification method, system, equipment and medium for social media network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure
CN109241377A (en) * 2018-08-30 2019-01-18 山西大学 A kind of text document representation method and device based on the enhancing of deep learning topic information
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data
EP3528144A1 (en) * 2018-02-20 2019-08-21 INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência Device and method for keyword extraction from a text stream

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631882B (en) * 2013-11-14 2017-01-18 北京邮电大学 Semantization service generation system and method based on graph mining technique
US20170193393A1 (en) * 2016-01-04 2017-07-06 International Business Machines Corporation Automated Knowledge Graph Creation
CN107291760A (en) * 2016-04-05 2017-10-24 阿里巴巴集团控股有限公司 Unsupervised feature selection approach, device
US10503791B2 (en) * 2017-09-04 2019-12-10 Borislav Agapiev System for creating a reasoning graph and for ranking of its nodes
CN108595425A (en) * 2018-04-20 2018-09-28 昆明理工大学 Based on theme and semantic dialogue language material keyword abstraction method
CN109726402B (en) * 2019-01-11 2022-12-23 中国电子科技集团公司第七研究所 Automatic extraction method for document subject term
CN110717042A (en) * 2019-09-24 2020-01-21 北京工商大学 Method for constructing document-keyword heterogeneous network model
CN110874396B (en) * 2019-11-07 2024-02-09 腾讯科技(深圳)有限公司 Keyword extraction method and device and computer storage medium
CN111061839B (en) * 2019-12-19 2024-01-23 过群 Keyword joint generation method and system based on semantics and knowledge graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202042A (en) * 2016-07-06 2016-12-07 中央民族大学 A kind of keyword abstraction method based on figure
EP3528144A1 (en) * 2018-02-20 2019-08-21 INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência Device and method for keyword extraction from a text stream
CN109241377A (en) * 2018-08-30 2019-01-18 山西大学 A kind of text document representation method and device based on the enhancing of deep learning topic information
CN110059311A (en) * 2019-03-27 2019-07-26 银江股份有限公司 A kind of keyword extracting method and system towards judicial style data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Aman Mehta 等.Scalable Knowledge Graph Construction over Text using Deep Learning based Predicate Mapping.2019,第705-713页. *
赵京胜 ; 张丽 ; 肖娜 ; .基于复杂网络的中文文本关键词提取研究.2018,(03),第102-108页. *

Also Published As

Publication number Publication date
CN111639189A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN111930929A (en) Article title generation method and device and computing equipment
CN111639189B (en) Text graph construction method based on text content features
CN110705247A (en) Based on x2-C text similarity calculation method
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN111177375A (en) Electronic document classification method and device
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN115545041A (en) Model construction method and system for enhancing semantic vector representation of medical statement
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN111681731A (en) Method for automatically marking colors of inspection report
CN114943220B (en) Sentence vector generation method and duplicate checking method for scientific research establishment duplicate checking
CN117291190A (en) User demand calculation method based on emotion dictionary and LDA topic model
CN116629238A (en) Text enhancement quality evaluation method, electronic device and storage medium
CN111199154B (en) Fault-tolerant rough set-based polysemous word expression method, system and medium
CN115345158A (en) New word discovery method, device, equipment and storage medium based on unsupervised learning
CN112182159B (en) Personalized search type dialogue method and system based on semantic representation
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
CN114138936A (en) Text abstract generation method and device, electronic equipment and storage medium
CN113792546A (en) Corpus construction method, apparatus, device and storage medium
CN112800211A (en) Method for extracting critical information of criminal process in legal document based on TextRank algorithm
CN113377901B (en) Mongolian text emotion analysis method based on multi-size CNN and LSTM models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant