CN108090049B - Multi-document abstract automatic extraction method and system based on sentence vectors - Google Patents

Multi-document abstract automatic extraction method and system based on sentence vectors Download PDF

Info

Publication number
CN108090049B
CN108090049B CN201810045090.4A CN201810045090A CN108090049B CN 108090049 B CN108090049 B CN 108090049B CN 201810045090 A CN201810045090 A CN 201810045090A CN 108090049 B CN108090049 B CN 108090049B
Authority
CN
China
Prior art keywords
sentence
document
sentences
vectors
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810045090.4A
Other languages
Chinese (zh)
Other versions
CN108090049A (en
Inventor
窦全胜
朱翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Technology and Business University
Original Assignee
Shandong Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Technology and Business University filed Critical Shandong Technology and Business University
Priority to CN201810045090.4A priority Critical patent/CN108090049B/en
Publication of CN108090049A publication Critical patent/CN108090049A/en
Application granted granted Critical
Publication of CN108090049B publication Critical patent/CN108090049B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for automatically extracting multiple document abstracts based on sentence vectors, which comprises the following steps: s1, preprocessing the document set; s2, training by adopting a doc2vec model to generate a sentence vector; s3, clustering into each sub-topic document; s4, establishing a sentence relation graph model in each subtopic document; s5, calculating sentence weight; and S6, extracting sentences and sequencing to form extracts. According to the method, all sentences in a target document set are expressed by vectors through a large corpus training doc2vec model; clustering is carried out by using the spectrum as each subtopic, and a sentence is extracted from each subtopic, so that the problem of sentence redundancy is avoided; and (4) forming a abstract according to the position sequence of the sentences in the original document, and improving the front-back continuity of the sentences of the abstract.

Description

Multi-document abstract automatic extraction method and system based on sentence vectors
Technical Field
The invention relates to the field of computer text mining, in particular to a method and a system for automatically extracting multiple document abstracts based on sentence vectors.
Background
The automatic document summarization technology provides the general information of the text for the user to summarize and refine the text through a computer. The user can preliminarily snoop the key contents of the full text only by briefly reading the abstract, and the efficiency of obtaining or understanding information by the user is greatly improved. The single document automatic summarization is a computer automatically generates a summary of the main content of a document through an algorithm, and since Luhn proposed a method for automatically generating a summary of a document in 1958, research based on the single document automatic summarization is actively developed, so that the result of the single document automatic summarization reaches the generally accepted degree so far. And the multi-document automatic summarization is a comprehensive summarization of main contents generated by different documents. So far, the multi-document automatic summarization technology is closely combined with artificial intelligence science and related algorithms, and more recently, the multi-document automatic summarization technology is combined with evolution algorithms and deep learning algorithms.
Yan et al will use deep learning for text summarization for the first time, with the input layer being word frequency vectors, the hidden layer consisting of a constrained boltzmann machine, and the summary consisting of dynamically planning and selecting important sentences. Rush uses deep learning to summarize the original document, uses convolutional networks to encode the original document, and uses context attention feedforward neural networks to generate the summary. Google in 2016 originated the deep learning based automatic digest module Textsum in the deep learning framework, tensoflow. The multi-document automatic digest may be divided into a decimated digest and an abstract digest according to whether or not a sentence forming the digest is derived from an original text. The abstract type abstract is mainly used for evaluating the importance of sentences of an original document and selecting key sentences from the sentences to form the abstract. The abstract is mainly used for extracting word information from an original document and then organizing word tandem sentences to form the abstract.
At present, the implementation method of the abstract is too complex, the machine has insufficient understanding of natural language, more manual participation parts are needed, and the development and the improvement are slow and are still in the starting stage. The extraction type abstract is a commonly used method at present, and a maximum public sub-graph method and an edge weight similarity method in text classification based on a graph model are commonly used similarity measurement methods. Similarity measurement is also performed based on the eigenvector corresponding to the left singular value of the text map matrix, which is essentially the PCA dimension reduction assuming that the sample mean is 0. The conventional extraction type summarization method mainly has the problems of sentence redundancy and unsmooth sentence connection.
Disclosure of Invention
Aiming at the defects of redundancy, disordered sentence sequence and the like of the sentences extracted in the prior art, the invention provides a multi-document abstract automatic extraction method based on sentence vectors to provide a document abstract with higher accuracy and readability.
The technical scheme adopted by the invention is as follows:
the method for automatically extracting the multi-document abstract based on the sentence vector comprises the following steps:
s1: preprocessing a document set of the abstract to be extracted;
s2: training by adopting a doc2vec model to generate a sentence vector;
s3: clustering sentence vectors and storing corresponding sentences as sub-topic documents;
s4: establishing a sentence relation graph model in each subtopic document;
s5: calculating sentence weight in each subtopic document according to the relation graph model established in the step S4;
s6: the sentences are extracted and sorted to form the abstract.
Further, S1 includes the steps of:
step S101: dividing sentences of each document of the document set according to the sentence end characters, and recording the divided sentences in lines, wherein one sentence occupies one line;
step S102: recording the corresponding position of each sentence;
step S103: copying each document content in the document set after the sentences are divided into the same document to carry out document set combination, wherein one sentence of the combined document occupies one line;
step S104: and cutting words and removing stop words from sentences in each row in the merged document.
Further, the corresponding position of the sentence in step S102 in step S1 is represented as:
Figure GDA0002849190890000021
wherein h isn,iRepresenting the position of the ith sentence in the nth document, textnRepresents the nth document, len (text)n) Indicating the number of sentences contained in the nth document.
Further, step S2 includes the following steps:
step S201: preprocessing all documents in the large corpus through steps S101 to S104 in step S1, inputting the preprocessed documents in the large corpus into a sentence vector distribution memory model PV-DM in doc2vec, and training the sentence vector distribution memory model PV-DM;
step S202: inputting the target document preprocessed in the steps S101 to S104 in the step S1 into the trained sentence vector distribution memory model PV-DM to obtain the sentence vector.
In step S201, the training of the sentence vector distribution memory model PV-DM includes the following steps:
step S2011: initializing sentences and all words in each row in the document preprocessed by the large corpus into k-dimensional vectors, and inputting word vectors corresponding to the context of the words w and sentence vectors corresponding to the sentences in which the words are located into the deep neural network model;
step S2012: summing and accumulating the input vectors in a hidden layer of the deep neural network model, wherein the accumulated vectors are used as the input of an output layer;
step S2013: the output layer of the deep neural network model corresponds to a binary tree, the binary tree uses words in the corpus as leaf nodes, the number of times of the appearance of each word in the corpus is used as a weight to construct a Huffman tree, each word corresponds to a leaf node in the tree, each branch in the tree is regarded as one-time two-classification, and a Label corresponding to each tree node in a path from a root node to the leaf node corresponding to the word w is 1-pj,pjFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to an auxiliary vector with the same length as the sentence vector for assisting in training the model;
step S2014: and continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to finally obtain a trained distributed memory model PV-DM of the sentence vectors.
The context of the word w is the C words before and after the word w.
The objective function of the neural network training is
Figure GDA0002849190890000031
Wherein, the sensor is a sentence, doc is a preprocessed document, w is a word, and Context (w) is a Context word of the word w and the sentence where w is located.
Further, clustering generation of sentence vectors in step S3 adopts a spectral clustering manner;
further, step S3 includes the following steps:
step S301: constructing a similarity matrix W among all sentence vectors, using a Gaussian kernel function as a kernel function,
Figure GDA0002849190890000032
wherein, Wi,jAs a sentence xiAnd sentence xjThe similarity between the two is shown in the specification, wherein sigma is a Gaussian radius;
step S302: calculating a Laplace matrix L;
L=D-W
wherein D is a diagonal matrix, and the nth row element of D is the sum of the nth row elements of W;
step S303: construction of a standardized Laplace matrix
Figure GDA0002849190890000033
Step S304: computing
Figure GDA0002849190890000034
K minimum eigenvalues and corresponding eigenvectors V;
step S305: arranging the feature vectors in columns to form a feature matrix, unitizing each row in the feature matrix to form a matrix F,
that is, the modulus of the vector formed by each row of the matrix F is 1;
step S306: taking each row of the matrix F as a k-dimensional sample, clustering by using a Kmeans algorithm, and clustering into a C class;
step S307: and storing sentences corresponding to the vectors in the C type as C subtopic documents.
Further, step S4 specifically includes:
in each sub-topic document, a sentence relation graph model is established by taking sentences as nodes and the similarity between the sentences as the similarity between edges and the sentences;
further, the similarity between sentences is calculated by cosine value of included angle between sentence vectors, cosine value sim (x)i,xj) The calculation formula of (2) is as follows:
Figure GDA0002849190890000041
wherein x isi,xjTwo sentence vectors.
Further, the step S5 specifically includes:
initializing each sentence weight, and iteratively updating the sentence weight according to the relationship graph model established in the step S4:
Figure GDA0002849190890000042
wherein s (i) is the weight of the sentence i, δ (i) is all sentences in the same sub-topic document whose similarity with the sentence i is higher than the set threshold, | δ (j) | is the number of sentences in the same sub-topic document whose similarity with the sentence i is higher than the set threshold, s (j) is the weight of the sentence j, and d is the damping coefficient which is always set to 0.85.
Further, the sentence weight is initialized to 1 in step S5, and the similarity threshold set in | δ (i) | is 0.05.
Further, step S6 specifically includes: the sentence with the largest weight is extracted from each sub-topic document, and the sentences are sorted according to the positions of the sentences obtained in step S102 in step S1 in the document and combined into an abstract.
A sentence vector-based multi-document abstract automatic extraction system comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
A computer-readable storage medium, having a computer program running thereon, the computer program, when executed by a processor, performing the steps of any of the methods described above.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of the pretreatment steps of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
FIG. 1 is a flow chart of the present invention, comprising the steps of:
s1: preprocessing a document set;
s2: training by adopting a doc2vec model to generate a sentence vector;
s3: clustering sentence vectors and storing corresponding sentences as sub-topic documents;
s4: establishing a sentence relation graph model in each subtopic document;
s5: calculating sentence weights in the sub-topic documents;
s6: the sentences are extracted and sorted to form the abstract.
Specifically, the specific implementation steps of step S1 are shown in fig. 2, and include the following steps:
step S101: dividing sentences of each document in the document set by using sentence end symbols, wherein one sentence occupies one line;
step S102: recording the corresponding position of each sentence;
step S103: copying each document content in the document set after the sentences are divided into the same document to carry out document set combination, wherein one sentence of the combined document occupies one line;
step S104: and cutting words and removing stop words from sentences in each row in the merged document.
Further, the sentence positions in step S102 are expressed as:
Figure GDA0002849190890000051
wherein h isn,iRepresenting the position of the ith sentence in the nth document, textnRepresents the nth document, and len represents the number of sentences contained in the nth document.
Step S2 specifically includes the following steps:
step S201: preprocessing all documents in the large corpus through steps S101 to S104 in step S1, and training the preprocessed documents in the large corpus through a PV-DM (sentence vector distribution memory model) in doc2 vec;
step S202: and importing the target document preprocessed in the steps S101 to S104 in the step S1 into the trained model to obtain a sentence vector.
The training of the PV-DM model in step S201 specifically includes the following steps:
(1) initializing sentences and all words in each row in a document preprocessed by a large corpus into k-dimensional vectors, and inputting word vectors corresponding to the context of a word w (C words before and after the word) and sentence vectors corresponding to the sentence in which the word is positioned into a deep neural network model;
(2) summing and accumulating the input vectors in the hidden layer, wherein the accumulated vectors are used as the input of the output layer;
(3) the output layer corresponds to a binary tree, which is a Huffman tree constructed by taking words appearing in the corpus as leaf nodes and the times of each word appearing in the corpus as weight, each word corresponds to leaf nodes in the tree, each branch in the tree can be classified once and twice, and the Label corresponding to each tree node in the path from the root node to the leaf node corresponding to the word w is 1-pj,pjFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to a vector with the same length as the sentence vector and is used for assisting in training the model, and the vector is called as an auxiliary vector;
(4) and (3) continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to train the model, and finally obtaining a distribution memory model of the trained sentence vectors.
The objective function of the neural network training is
Figure GDA0002849190890000061
Wherein, the sensor is a sentence, doc is a preprocessed document, w is a word, and Context (w) is a Context word of the word w and the sentence where w is located.
Clustering generation of each sub-topic document in the step S3 adopts a spectral clustering mode, and the method specifically comprises the following steps:
step S301: constructing a similarity matrix W among all sentences, using a Gaussian kernel function as a kernel function,
Figure GDA0002849190890000062
wherein Wi,jAs a sentence xi,xjThe similarity between the two is shown in the specification, wherein sigma is a Gaussian radius;
step S302: the laplacian matrix L is calculated,
L=D-W
wherein D is a diagonal matrix, and the nth row element of the diagonal matrix is the sum of the nth row elements of W;
step S303: construction of a normalized Laplace matrix
Figure GDA0002849190890000063
Step S304: computing
Figure GDA0002849190890000064
K minimum eigenvalues and corresponding eigenvectors V;
step S305: arranging the feature vectors in columns to form a feature matrix, unitizing each row in the feature matrix to form a matrix F,
that is, the modulus of the vector formed by each row of the matrix F is 1;
step S306: each row of F is regarded as a k-dimensional sample, and is clustered by using a Kmeans algorithm to be gathered into a C class;
step S307: and storing sentences corresponding to the vectors in the C type as C documents.
Step S4 specifically includes:
in each sub-topic document, sentences are taken as nodes, the similarity between the sentences is taken as an edge, and a sentence relation graph model is established;
further, the cosine similarity sim (x)i,xj) The calculation formula of (2) is as follows:
Figure GDA0002849190890000071
wherein x isi,xjTwo sentence vectors.
The step S5 includes the following steps:
each sentence weight is initialized, and the sentence weight is iteratively updated according to the relationship graph model established in step S4 by using the following formula:
Figure GDA0002849190890000072
where s (i) is the weight of the sentence i, δ (i) is all sentences in the same sub-topic document whose similarity to the sentence i is higher than the set threshold, | δ (j) | is the number of sentences in the same sub-topic document whose similarity to the sentence i is higher than the set threshold, s (j) is the weight of the sentence j, and d is the damping coefficient, which is always set to 0.85.
Further, the sentence weight is initialized to 1 in step S5, and the similarity threshold set in | δ (i) | is 0.05.
Further, step S6 specifically includes:
the sentences with the largest weight are extracted from the sub-topic documents, and the sentences are sorted according to the positions of the sentences in the original document according to the method in the step (2) in the step S1 and combined into the abstract.
To further describe the multi-document summarization automatic extraction method of the present invention, the following are the results of summarization for three documents related to "wuqing source died" as follows:
xinlang sports news Baisui Qing Yuan eclosion returning to the fairy and eternal legends leaving the kingdom of chess. Wuqing Yuan, a third son in home, originated in Fujian province in China in 1914, 6 months and 12 days. Three segments were awarded by the Japanese chess institute in the second year, and nine segments were obtained in 1950. Wu Qing Yuan is the first to develop the chess course and to guide the descendants without losing vigor. He was the first man in the golden age of go, japan and was called "showa chess holy" in japan. In 1961, Wu Qing Yuan was subjected to car accidents and gradually faded out to the first line until formally disappearing in 1984. Before Wuqing Yuan, the I-go kingdom never had any player able to reach his height. In 2014, the middle-aged chess world is a great prosperous celebration ceremony held by mr. wu, and weiqi chess players with these honor can only be obtained by wu in the world. Family members of Wu Qingyuan prepared to summon the distinguished ceremony for Wu Qingyuan in 12 months and 3 days. Wu Qing Yuan Zheng saying: "I also need to play chess after one hundred years old, and I also need to play chess in the universe after two hundred years old. In pursuing the way of go, mr. wu has seen the death of life and death thoroughly.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (7)

1. The method for automatically extracting the multi-document abstract based on the sentence vector is characterized by comprising the following steps of:
s1: preprocessing a document set of the abstract to be extracted;
the step S1 includes the steps of:
step S101: dividing sentences of each document of the document set according to the sentence end characters, and recording the divided sentences in lines, wherein one sentence occupies one line;
step S102: recording the corresponding position of each sentence;
step S103: copying each document content in the document set after the sentences are divided into the same document to carry out document set combination, wherein one sentence of the combined document occupies one line;
step S104: cutting words of sentences in each row in the combined document and removing stop words;
s2: training by adopting a doc2vec model to generate a sentence vector;
the step S2 includes the steps of:
step S201: preprocessing all documents in the large corpus through steps S101 to S104 in step S1, inputting the preprocessed documents in the large corpus into a sentence vector distribution memory model PV-DM in doc2vec, and training the sentence vector distribution memory model PV-DM;
step S202: inputting the target document preprocessed in the steps S101 to S104 in the step S1 into a trained sentence vector distribution memory model PV-DM to obtain a sentence vector;
in step S201, training a sentence vector distribution memory model PV-DM includes the following steps:
step S2011: initializing sentences and all words in each row in the document preprocessed by the large corpus into k-dimensional vectors, and inputting word vectors corresponding to the context of the words w and sentence vectors corresponding to the sentences in which the words are located into the deep neural network model;
step S2012: summing and accumulating the input vectors in a hidden layer of the deep neural network model, wherein the accumulated vectors are used as the input of an output layer;
step S2013: the output layer of the deep neural network model corresponds to a binary tree, the binary tree uses words in the corpus as leaf nodes, the number of times of the appearance of each word in the corpus is used as a weight to construct a Huffman tree, each word corresponds to a leaf node in the tree, each branch in the tree is regarded as one-time two-classification, and a Label corresponding to each tree node in a path from a root node to the leaf node corresponding to the word w is 1-pj,pjFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to an auxiliary vector with the same length as the sentence vector for assisting in training the model;
step S2014: continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to finally obtain a trained distributed memory model PV-DM of the sentence vectors;
s3: clustering sentence vectors and storing corresponding sentences as sub-topic documents;
s4: establishing a sentence relation graph model in each subtopic document;
s5, according to the relation graph model established in the step S4, sentence weight is calculated in each sub-topic document;
s6: the sentences are extracted and sorted to form the abstract.
2. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the clustering generation of sentence vectors in step S3 adopts a spectral clustering method.
3. The sentence vector-based multi-document digest automatic extraction method of claim 2, wherein the step S3 comprises the steps of:
step S301: constructing a similarity matrix W among all sentence vectors, using a Gaussian kernel function as a kernel function,
step S302: calculating a Laplace matrix L;
step S303: constructing a standardized Laplace matrix;
step S304: calculating k minimum eigenvalues of the Laplace matrix and corresponding eigenvectors V;
step S305: arranging the characteristic vectors according to columns to form a characteristic matrix, unitizing each row in the characteristic matrix to form a matrix F, namely the modulus value of the vector formed by each row of the matrix F is 1;
step S306: taking each row of the matrix F as a k-dimensional sample, clustering by using a Kmeans algorithm, and clustering into a C class;
step S307: and storing sentences corresponding to the vectors in the C type as C subtopic documents.
4. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the step S4 specifically comprises:
in each sub-topic document, sentences are used as nodes, the similarity between the sentences is used as the similarity between the edges and the sentences, and a sentence relation graph model is established.
5. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the step S6 specifically comprises: the sentence with the largest weight is extracted from each sub-topic document, and the sentences are sorted according to the positions of the sentences obtained in step S102 in step S1 in the document and combined into an abstract.
6. The system for automatically extracting the multiple document abstracts based on sentence vectors is characterized by comprising the following steps: memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-5.
7. A computer-readable storage medium, on which a computer program is run, which computer program, when being executed by a processor, performs the steps of the method of any one of claims 1 to 5.
CN201810045090.4A 2018-01-17 2018-01-17 Multi-document abstract automatic extraction method and system based on sentence vectors Expired - Fee Related CN108090049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810045090.4A CN108090049B (en) 2018-01-17 2018-01-17 Multi-document abstract automatic extraction method and system based on sentence vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810045090.4A CN108090049B (en) 2018-01-17 2018-01-17 Multi-document abstract automatic extraction method and system based on sentence vectors

Publications (2)

Publication Number Publication Date
CN108090049A CN108090049A (en) 2018-05-29
CN108090049B true CN108090049B (en) 2021-02-05

Family

ID=62181661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810045090.4A Expired - Fee Related CN108090049B (en) 2018-01-17 2018-01-17 Multi-document abstract automatic extraction method and system based on sentence vectors

Country Status (1)

Country Link
CN (1) CN108090049B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897896B (en) * 2018-07-13 2020-06-02 深圳追一科技有限公司 Keyword extraction method based on reinforcement learning
CN108959269B (en) * 2018-07-27 2019-07-05 首都师范大学 A kind of sentence auto ordering method and device
CN109325109B (en) * 2018-08-27 2021-11-19 中国人民解放军国防科技大学 Attention encoder-based extraction type news abstract generating device
CN109582967B (en) * 2018-12-03 2023-08-18 深圳前海微众银行股份有限公司 Public opinion abstract extraction method, device, equipment and computer readable storage medium
CN110399606B (en) * 2018-12-06 2023-04-07 国网信息通信产业集团有限公司 Unsupervised electric power document theme generation method and system
CN109902284A (en) * 2018-12-30 2019-06-18 中国科学院软件研究所 A kind of unsupervised argument extracting method excavated based on debate
CN109902168B (en) * 2019-01-25 2022-02-11 北京创新者信息技术有限公司 Patent evaluation method and system
CN109885683B (en) * 2019-01-29 2022-12-02 桂林远望智能通信科技有限公司 Method for generating text abstract based on K-means model and neural network model
CN109829161B (en) * 2019-01-30 2023-08-04 延边大学 Method for automatically abstracting multiple languages
CN109977196A (en) * 2019-03-29 2019-07-05 云南电网有限责任公司电力科学研究院 A kind of detection method and device of magnanimity document similarity
CN110162778B (en) * 2019-04-02 2023-05-26 创新先进技术有限公司 Text abstract generation method and device
CN111914083B (en) * 2019-05-10 2024-07-09 腾讯科技(深圳)有限公司 Statement processing method, device and storage medium
CN110362823B (en) * 2019-06-21 2023-07-28 北京百度网讯科技有限公司 Training method and device for descriptive text generation model
US10902191B1 (en) * 2019-08-05 2021-01-26 International Business Machines Corporation Natural language processing techniques for generating a document summary
CN110717333B (en) * 2019-09-02 2024-01-16 平安科技(深圳)有限公司 Automatic generation method and device for article abstract and computer readable storage medium
CN110737768B (en) * 2019-10-16 2022-04-08 信雅达科技股份有限公司 Text abstract automatic generation method and device based on deep learning and storage medium
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN112784043B (en) * 2021-01-18 2024-05-10 辽宁工程技术大学 Aspect-level emotion classification method based on gating convolutional neural network
CN112949299A (en) * 2021-02-26 2021-06-11 深圳市北科瑞讯信息技术有限公司 Method and device for generating news manuscript, storage medium and electronic device
CN113220853B (en) * 2021-05-12 2022-10-04 燕山大学 Automatic generation method and system for legal questions
CN113836295B (en) * 2021-09-28 2024-07-19 平安科技(深圳)有限公司 Text abstract extraction method, system, terminal and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set
US7398196B1 (en) * 2000-09-07 2008-07-08 Intel Corporation Method and apparatus for summarizing multiple documents using a subsumption model
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN104778157A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document abstract sentence generating method
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN107357899A (en) * 2017-07-14 2017-11-17 吉林大学 Based on the short text sentiment analysis method with product network depth autocoder

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7398196B1 (en) * 2000-09-07 2008-07-08 Intel Corporation Method and apparatus for summarizing multiple documents using a subsumption model
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN104778157A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document abstract sentence generating method
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN107357899A (en) * 2017-07-14 2017-11-17 吉林大学 Based on the short text sentiment analysis method with product network depth autocoder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于PV_DM模型的多文档摘要方法;刘欣等;《计算机应用与软件》;20161031;第33卷(第10期);第2页 *

Also Published As

Publication number Publication date
CN108090049A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108090049B (en) Multi-document abstract automatic extraction method and system based on sentence vectors
CN113239181B (en) Scientific and technological literature citation recommendation method based on deep learning
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN109635280A (en) A kind of event extraction method based on mark
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN110263325B (en) Chinese word segmentation system
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN110413768B (en) Automatic generation method of article titles
CN106844349A (en) Comment spam recognition methods based on coorinated training
CN111967258B (en) Method for constructing coreference resolution model, coreference resolution method and medium
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN113869053A (en) Method and system for recognizing named entities oriented to judicial texts
CN107506377A (en) This generation system is painted in interaction based on commending system
CN109344403A (en) A kind of document representation method of enhancing semantic feature insertion
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113505583A (en) Sentiment reason clause pair extraction method based on semantic decision diagram neural network
CN115238026A (en) Medical text subject segmentation method and device based on deep learning
CN106886565A (en) A kind of basic house type auto-polymerization method
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN116522165B (en) Public opinion text matching system and method based on twin structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210205

CF01 Termination of patent right due to non-payment of annual fee