CN105243152A - Graph model-based automatic abstracting method - Google Patents

Graph model-based automatic abstracting method Download PDF

Info

Publication number
CN105243152A
CN105243152A CN201510703353.2A CN201510703353A CN105243152A CN 105243152 A CN105243152 A CN 105243152A CN 201510703353 A CN201510703353 A CN 201510703353A CN 105243152 A CN105243152 A CN 105243152A
Authority
CN
China
Prior art keywords
sentence
document
theme
word
probability distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510703353.2A
Other languages
Chinese (zh)
Other versions
CN105243152B (en
Inventor
王俊丽
魏绍臣
管敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201510703353.2A priority Critical patent/CN105243152B/en
Publication of CN105243152A publication Critical patent/CN105243152A/en
Application granted granted Critical
Publication of CN105243152B publication Critical patent/CN105243152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of automatic abstracting, and discloses a graph model-based automatic abstracting method. According to the technical scheme, an LDA probability topic model is applied to measurement of semantic correlation between sentences and improvement of the measurement effect of sentence correlation; and an idea of topic correlation and position sensitivity of the sentences is provided, so that abstract generation is relatively reasonable and effective. The method comprises the following steps: firstly, obtaining topic probability distribution of a document and word probability distribution of the topic through training the LDA topic model, determining the topic probability distribution of the sentences and effectively converting a semantic similarity measurement between the sentences into a similarity measurement problem of the topic probability distribution of the sentences; with the sentences as nodes, building edges by referring tothe cosine similarity and according to the semantic similarity between the sentences and generating a text graph representing the document; calculating the topic correlation between the sentences according to the topic probability distribution of the sentences and the topic probability distribution of the document; and calculating the position sensitivity and the like of the sentences according to the positions of the sentences in the document.

Description

A kind of automaticabstracting based on graph model
Technical field
The present invention relates to automatic abstract field, be related specifically to a kind of automaticabstracting based on graph model.
Background technology
Automatic Summarization Technique utilizes computing machine automatically to process document, generate the summary comprising original text shelves core content, realize the compression to document, people just can be found with the less time and obtain the information needed, effectively can solve information overload problem.
Although automatic abstract is from 1960 existing quite long research histories so far, from original text, namely directly extract based on the digest mode extracted the method that crucial sentence generation digest is still the most main flow in this field.Core concept based on the automatic abstract extracted is: first carry out statistical study to the various features of sentence in a section or many sections of documents, calculates the significance level of sentence, then selects summary sentence by suitable method of abstracting, forms article abstract.5 classifications can be divided into: the technology of Corpus--based Method based on current the applied technology of automatic abstract extracted; Based on the technology of theme; Based on the technology of chapter relation; Machine learning techniques; Based on the technology of graph model.
Statistical model is widely used in natural language processing field, and statistical technique is also the technology that automatic abstract is applied the earliest, and relative to other technologies, statistical technique does not need complicated modeling, simple and be easy to realize; The importance of text unit is not only decided by that literal word repeats, and also depends on word semantic association behind, and the method based on theme is by excavating semantic association, thematic knowledge being incorporated the calculating to text unit weight; Except above-mentioned method, automatic abstract problem can be solved equally from philological angle.Chapter relationship analysis is also widely used in automatic abstract field; Two points of sorters, Hidden Markov Model (HMM) and bayes methods are the machine learning methods being applied in automatic abstract field the earliest.In addition, the application of other machines learning method in automatic abstract field is also very extensive.
Automatic Summarization Technique in recent years based on figure sort algorithm receives increasing concern, because these class methods can judge according to the global information of document in the process of text unit weight sequencing, and be not only the information relying on local finite, similar with the thought of manually making a summary.These class methods are using the node of text unit as figure, and generate the limit between node according to the correlativity between text unit, document representation is become a text figure, then utilize figure sort algorithm, such as PageRank sorts to text unit.In this approach, similar with the sequence of webpage, can higher weights be obtained with the tight text unit of other text units.
In figure sort algorithm, the degree of accuracy of the relativity measurement between node directly has influence on the result of figure sequence, and therefore in the automatic abstract based on figure sort algorithm, the relativity measurement of text unit is core missions.In previous many research, use sentence the most general as text node of graph, and the relativity measurement between sentence is confined to the word aspect in sentence mostly, such as utilize the cosine similarity of the co-occurrence of the word between sentence, sentence, utilize WordNet to measure word correlativity etc., but the measure of word aspect is difficult to the semantic dependency accurately weighed between sentence.On the other hand, though have ignored an important indicator based on the abstract method correlativity considered between text unit of figure sequence---the correlativity between text unit and document subject matter, may cause text unit to sort like this and occur the situation of local optimum.Such as, the non-core content that large section theme is irrelevant is had in certain article, but the cross correlation of the sentence in this part content own is close, so after figure sequence, occur the sentence that weight is higher in this section of content possibly, but this local optimum sentence can only represent this part content but can not represent whole document.In addition, the method based on figure sequence also have ignored some attributes of text unit self, such as sentence length, sentence position etc.In a lot of article especially news category article, first section of content can illustrate article general idea usually, ignores the position attribution of sentence, can affect the sequence to sentence weight undoubtedly.
Summary of the invention
The object of the invention will overcome in prior art the problem of the semantic dependency being difficult to accurately weigh between sentence and some attributes that have ignored text unit self, provides a kind of automaticabstracting based on graph model of improvement.
To achieve the above object of the invention, the present invention proposes a kind of double-deck measuring similarity model in conjunction with LDA topic model and cosine similarity, the correlativity between semantic and word two aspects tolerance sentences.And define degree of subject relativity and the location sen-sitivity of sentence, in figure sequence, give sentence initial weight by degree of subject relativity and location sen-sitivity, utilize Biased-Pagerank algorithm to sort, optimize the effect of sentence sequence.
The present invention is achieved through the following technical solutions:
A kind of automaticabstracting based on graph model, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.
Based on an automaticabstracting for graph model, specifically comprise the steps:
(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus.
(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector.
(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus.
(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word.
(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme.
(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution; According to the cosine similarity between sentence TF*IDF vector calculation sentence, supplementing as semantic similarity.
(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure.
(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to the JS distance of the theme probability distribution of sentence theme probability distribution and document.
(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document.
(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7).
(11) Text summarization.The higher combination of sentences of weight is selected to generate digest according to the result of sentence sequence in (10).
In force, calculate the method for the probability topic distribution of sentence in step (6), its computing formula is:
P ( T j | S r ) = Σ W i ∈ S r P ( W i | T j ) * P ( T j | D k ) * Σ k = 1 M P ( T j | D k )
Wherein P (T j| S r) be document D kin sentence S r, it belongs to theme T jprobability; P (W i| T j) represent word W irepresent theme T jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T i| D k) represent document D kbelong to theme T iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.Its beneficial effect is, the semantic dependency tolerance between sentence is effectively converted to the relativity measurement of sentence probability topic distribution.
In force, wherein step (6) calculates the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution, and for theme distribution P and Q of sentence P, Q, computing formula is as follows:
SemSim P Q = 1 - 1 2 ( K L ( P | | M ) + K L ( Q | | M ) )
Wherein kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
K L ( P | | M ) = Σ i P ( i ) l n P ( i ) M ( i )
Its beneficial effect is, weighs the semantic dependency between sentence more accurately.
In force, wherein step (6) is according to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity.For sentence P, Q, the final measuring similarity formula in conjunction with cosine similarity is as follows:
Sim PQ=(1-λ)*SemSim PQ+λ*CosSim PQ
Its beneficial effect is, weighs the similarity between sentence, reach complementary effect, more accurately at word and semantic two aspects.
λ value, between 0 to 1, is used for regulating the proportion shared by SemSim and CosSim.Wherein CosSim pQcomputing formula is as follows
CosSim P Q = Σ w ∈ P , Q TF w , P * TF w , Q * ( IDF w ) 2 Σ w ∈ P ( TF w , p * IDF w ) 2 * Σ w ∈ Q ( TF w , p * IDF w ) 2
Wherein TF w, Pfor the word frequency of word w in sentence P, computing formula is as follows:
TF w , P = N w , P N P
N w, Pfor the number of times that word w occurs in sentence P, N pfor the total words of sentence P.
IDF wfor the inverse document frequency of word w, computing formula is as follows:
IDF w = l o g N N w
Wherein N is total number of documents in corpus, N wfor the number of times of word w.
In force, the degree of subject relativity of sentence is wherein calculated in step (8) according to the JS distance of the theme probability distribution of sentence theme probability distribution and document, if the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P pcomputing formula is as follows:
TR P = SemSim P D Σ P ∈ D SemSim P D
Its beneficial effect is, weighs the degree of subject relativity between sentence more accurately at semantic layer.
In force, the wherein computing method of the middle location sen-sitivity of step (9), computing formula is as follows:
PS P = 1 p o s Σ i = 1 l e n ( D ) 1 i
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.Its beneficial effect is, has reasonably measured the location sen-sitivity of sentence, makes forward sentence weight higher.
In force, wherein give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in step (8) and step (9) in step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:
R ( P ) = d * ( TR P + PS P ) 2 + ( 1 - d ) Σ Q ∈ a d j ( P ) Sim P Q Σ Z ∈ a d j ( Q ) Sim Z Q R ( Q )
The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, Q ∈ adj (P)) represent all node Q be connected with P.Its beneficial effect is, merges sentence degree of subject relativity and location sen-sitivity, and ranking results is more reasonable, better effects if.
Accompanying drawing explanation
Fig. 1 is the block scheme of structure of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, further describe according to of the invention process automaticabstracting below.Be to be understood that; specific embodiment described herein is only for explaining the present invention; be not intended to limit the present invention; namely protection scope of the present invention is not limited to following embodiment; on the contrary; according to inventive concept of the present invention, those of ordinary skill in the art can suitably change, and these changes can fall within the invention scope that claims limit.
Automaticabstracting according to the specific embodiment of the invention comprises the steps:
1) document pretreatment module:
Selected specific document sets, document sets should be too not little, an at least hundreds of section, otherwise LDA topic model training result can be had a strong impact on.This example selects the DUC02 meeting document sets in automatic abstract field to carry out digest, is made up of, contains 60 themes, have 10 sections of articles under average each theme 586 sections of news articles.First according to inactive vocabulary, stop words removed to the document in document sets and removes punctuation mark, then using Stemming to carry out stem extraction, after the process of these natural language processing techniques, the cleaned document sets D obtained.Such as sentence " Theexplosionoccurredat8:26a.m.inaloungeinthebarracks. " becomes " explosionoccurloungebarrack " after pretreated, eliminates stop words, and the tense of word obtains recovery.
2) data preparation module:
Carry out document vectorization: the present embodiment, by cleaned corpus D, first carries out symbolism process: all words occurred in statistic document collection D, give unique number, from 0 to N (total word number) to each word; Then by every section of document according to occurring that the numbering that word is corresponding is converted to a vector, every one dimension of vector is the corresponding numbering of the word occurred in the document.This step obtains the document sets after vectorization, be designated as T_D (TokenizedDocuments), T_D is the set of the vector of 586 sections of documents, every one dimension of T_D is corresponding document vector, such as vector (2,3,56,78 ... 120) represent a document, numeral is wherein the numbering that word appears in the document.
Carry out the vectorization of document sentence: each sentence in each document is converted to a N dimensional vector, N is document total words.For each word occurred in sentence, the numerical value of its corresponding dimension in sentence vector is the TF*IDF value of this word, TF*IDF and word frequency * inverse document frequency.Note TF w, Pfor the word frequency of word w in sentence P, computing formula is as follows:
TF w , P = N w , P N P
N w, Pfor the number of times that word w occurs in sentence P, N pfor the total words of sentence P.IDF wfor the inverse document frequency of word w, computing formula is as follows:
IDF w = l o g N N w
Wherein N is total number of documents in corpus, N wfor the number of times that word w occurs in language material.Document sets after sentence vectorization is designated as S_T_D (SentencesTokenizedDocuments) by us.Every one dimension of S_T_D is a corresponding document, and every one dimension of the document is the TF*IDF vector of its sentence.Such as sentence " explosionoccurloungebarrack " vector turns to (4.324,0,0,0.836,0,0 ... 2.563,0,0 ... 3.125 ...), explosion is the vocabulary compared with core, lounge and barrack be outbalance also, so its TF*IDF value is comparatively large, and occur is less as common word TF*IDF value.
3) computing module:
Comprise sentence similarity tolerance, degree of subject relativity calculates and location sen-sitivity calculates.
Use T_D as document sets, and utilize Gibbs sampling to train LDA topic model.For the number of 586 sections of documents, we select the number of topics of LDA topic model to be set to 20, and Study first α is set to 2.5 (50/ numbers of topics), and β is set to 0.01, and this is general method to set up, Gibbs sampling converges faster during training.Theme-word the distribution of each theme and the document-theme distribution of every section of document is obtained after training.We represent theme-word distribution with P (W|T), P (W i| T j) represent word W irepresent theme T jprobability.Document-theme distribution is represented, P (T with P (T|D) i| D k) represent document D kbelong to theme T iprobability.Wherein the dimension of W is the word number in corpus, and the dimension of T is the number of topics of LDA topic model setting, and the dimension of D is the number of files in corpus.
For one section of document, the theme distribution of the sentence required by us is P (T|S), for document D kin sentence S r, it belongs to theme T jprobability calculation formula as follows:
P ( T j | S r ) = Σ W i ∈ S r P ( W i | T j ) * P ( T j | D k ) * Σ k = 1 M P ( T j | D k )
So just obtain the theme distribution P (T|S) of sentence, such as, P (T|S r) be exactly sentence S rtheme probability distribution, conveniently we represent the theme distribution of sentence P, Q with P and Q.For theme distribution P and Q of sentence P, Q, Semantic Similarity Measurement formula is as follows:
SemSim P Q = 1 - 1 2 ( K L ( P | | M ) + K L ( Q | | M ) )
Wherein kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
K L ( P | | M ) = Σ i P ( i ) l n P ( i ) M ( i )
According to the document sets S_T_D after the sentence vectorization obtained, the cosine similarity computing formula of sentence P, Q is as follows:
CosSim P Q = A * B | | A | | * | | B | |
A.B is the TF*IDF vector of sentence P, Q.
For sentence P, Q, final measuring similarity formula is as follows:
Sim PQ=(1-λ)*SemSim PQ+λ*CosSim PQ
According to this formula, we just obtain the similarity in every section of document between any two sentences.We get the similarity in λ=0.2 calculating document between sentence, and following table illustrates the similarity of sentence in certain document:
Sentence S1 S2 S3 S4 S5 ……
S1 1 ……
S2 0.653 1 ……
S3 0.436 0.554 1 ……
S4 0.362 0.432 0.235 1 ……
S5 0.375 0.343 0.482 0.275 1 ……
…… …… …… …… …… …… ……
Semantic similarity between the computing formula of sentence degree of subject relativity and sentence is completely the same, and the theme distribution distribution of two sentences being changed into sentence distribution and document is just passable.If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P pcomputing formula is as follows:
TR P = SemSim P D Σ P ∈ D SemSim P D
According to this formula, we have calculated the degree of subject relativity of each sentence in every section of document.Following table illustrates the degree of subject relativity of sentence in certain document:
Sentence S1 S2 S3 S4 S5 ……
Degree of subject relativity 0.436 0.343 0.276 0.145 0.193 ……
Can see that forward sentence degree of subject relativity is also higher.
Location sen-sitivity computing formula is as follows:
PS P = 1 p o s Σ i = 1 l e n ( D ) 1 i
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.According to this formula, we have calculated the location sen-sitivity of each sentence in every section of document.Such as, certain document is made up of 10 sentences, and following table illustrates the location sen-sitivity of its sentence:
Sentence S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
0.341 0.170 0.113 0.085 0.068 0.056 0.048 0.042 0.037 0.034
4) sentence order module:
Text figure is abstract figure, does not need real building.For sentence iterative sequencing, as long as we have 3) the middle sentence similarity calculated is measured, degree of subject relativity calculates and location sen-sitivity is just passable.
For a document, we represent its weight vectors with R (P), and every one dimension of R (P) represents its weight for sentence, is initially all set to 1, i.e. R (P)=(1,1 ... 1).
Then we are according to formula:
R ( P ) = d * ( TR P + PS P ) 2 + ( 1 - d ) Σ Q ∈ a d j ( P ) Sim P Q Σ Z ∈ a d j ( Q ) Sim Z Q R ( Q )
Upgrade every one dimension of R (P) iteratively, i.e. the weight of each sentence.R (P) upgrades and is once designated as an iteration, until R (P) numerical stability, no longer till change, usual iteration restrains in tens times, so just obtains the final weight of each sentence.We get d=0.85 and sort to sentence, and convergence threshold is set to 0.0001, think and restrain when namely weight sum is less than or equal to 1.0001 in R (P).Such as, as follows for the R (P) after its convergence of certain section of article, retain 4 decimals:
[0.1823,0.1590,0.0901,0.0835,0.080,0.083,0.1685,0.0532,0.0712,0.0823,0.081,0.0855]
Can find out weight higher be 1,2,7.
5) Text summarization
According to 4) in ranking results, limit the highest sentence of requirements selection weight by number of words, just generate summary by original text sequential combination.

Claims (6)

1. the automaticabstracting based on graph model, it is characterized in that, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.
2. a kind of automaticabstracting based on graph model as claimed in claim 1, comprises the following steps:
(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus;
(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector;
(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus;
(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word;
(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme;
(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the JS distance of different sentence theme probability distribution; According to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity;
(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure;
(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to Jensen – Shannon distance (JS distance) of the theme probability distribution of sentence theme probability distribution and document;
(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document;
(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7);
(11) Text summarization.The higher combination of sentences of weight is selected to generate digest according to the result of sentence sequence in (10).
3. method according to claim 2, is characterized in that, calculate the method for the probability topic distribution of sentence in described step (6), its computing formula is:
P ( T j | S r ) = Σ W i ∈ S r P ( W i | T j ) * P ( T j | D k ) * Σ k = 1 M P ( T j | D k )
Wherein P (T j| S r) be document D kin sentence S rbelong to theme T jprobability; P (W i| T j) represent word W irepresent theme T jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T i| D k) represent document D kbelong to theme T iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.
4. method according to claim 2, it is characterized in that, described step (6) is fallen into a trap and is calculated the quantized value of semantic similarity between sentence according to the JS of different sentence theme probability distribution distance, and in described step (8) according to the JS of sentence theme probability distribution and the theme probability distribution of document apart from the degree of subject relativity calculating sentence; For theme distribution P and Q of sentence P, Q, computing formula is as follows:
SemSim P Q = 1 - 1 2 ( K L ( P | | M ) + K L ( Q | | M ) )
Wherein kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
K L ( P | | M ) = Σ i P ( i ) l n P ( i ) M ( i )
If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P pcomputing formula is as follows:
TR P = SemSim P D Σ P ∈ D SemSim P D .
5. method according to claim 2, is characterized in that, the computing method of location sen-sitivity in described step (9), and computing formula is as follows:
PS P = 1 p o s Σ i = 1 l e n ( D ) 1 i
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.
6. method according to claim 2, it is characterized in that, sentence initial weight is given according to the degree of subject relativity in (8) and the location sen-sitivity in (9) in described step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:
R ( P ) = d * ( TR P + PS P ) 2 + ( 1 - d ) Σ Q ∈ a d j [ P ] Sim P Q Σ Z ∈ a d j ( Q ) Sim Z Q R ( Q )
The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, and Q ∈ adj [P] represents all node Q be connected with node P.Sim pQfor the similarity of sentence P, Q, combined by the semantic similarity of sentence and cosine similarity and obtain;
Sim PQ=(1-λ)*SemSim PQ+λ*CosSim PQ
λ value, between 0 to 1, is used for regulating the proportion shared by SemSim and CosSim.Wherein CosSim pQcomputing formula is as follows
CosSim P Q = Σ w ∈ P . Q TF w , P * TF w , Q * ( IDF w ) 2 Σ w ∈ P ( TF w , p * IDF w ) 2 * Σ w ∈ Q ( TF w , p * IDF w ) 2
Wherein TF w,Pfor the word frequency of word w in sentence P, computing formula is as follows:
TF w , P = N w , P N P
N w,Pfor the number of times that word w occurs in sentence P, N pfor the total words of sentence P;
IDF wfor the inverse document frequency of word w, computing formula is as follows:
IDF w = l o g N N w
Wherein N is total number of documents in corpus, N wfor the number of times of word w.
CN201510703353.2A 2015-10-26 2015-10-26 A kind of automaticabstracting based on graph model Active CN105243152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510703353.2A CN105243152B (en) 2015-10-26 2015-10-26 A kind of automaticabstracting based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510703353.2A CN105243152B (en) 2015-10-26 2015-10-26 A kind of automaticabstracting based on graph model

Publications (2)

Publication Number Publication Date
CN105243152A true CN105243152A (en) 2016-01-13
CN105243152B CN105243152B (en) 2018-08-24

Family

ID=55040800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510703353.2A Active CN105243152B (en) 2015-10-26 2015-10-26 A kind of automaticabstracting based on graph model

Country Status (1)

Country Link
CN (1) CN105243152B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488033A (en) * 2016-01-26 2016-04-13 中国人民解放军国防科学技术大学 Preprocessing method and device for correlation calculation
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus
CN105868178A (en) * 2016-03-28 2016-08-17 浙江大学 Multi-document automatic abstract generation method based on phrase subject modeling
CN105938481A (en) * 2016-04-07 2016-09-14 北京航空航天大学 Anomaly detection method of multi-mode text data in cities
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding
CN106708803A (en) * 2016-12-21 2017-05-24 东软集团股份有限公司 Feature extraction method and device
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN107291836A (en) * 2017-05-31 2017-10-24 北京大学 A kind of Chinese text summary acquisition methods based on semantic relevancy model
CN108182247A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Text summarization method and apparatus
CN108985370A (en) * 2018-07-10 2018-12-11 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
CN109213853A (en) * 2018-08-16 2019-01-15 昆明理工大学 A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm
CN109284357A (en) * 2018-08-29 2019-01-29 腾讯科技(深圳)有限公司 Interactive method, device, electronic equipment and computer-readable medium
CN109344280A (en) * 2018-10-13 2019-02-15 中山大学 A kind of flow chart search method and system based on graph model
CN110399606A (en) * 2018-12-06 2019-11-01 国网信息通信产业集团有限公司 A kind of unsupervised electric power document subject matter generation method and system
CN110609997A (en) * 2018-06-15 2019-12-24 北京百度网讯科技有限公司 Method and device for generating abstract of text
CN110728144A (en) * 2019-10-06 2020-01-24 湖北工业大学 Extraction type document automatic summarization method based on context semantic perception
US10599694B2 (en) 2017-02-10 2020-03-24 International Business Machines Corporation Determining a semantic distance between subjects
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111339287A (en) * 2020-02-24 2020-06-26 成都网安科技发展有限公司 Abstract generation method and device
WO2020258948A1 (en) * 2019-06-24 2020-12-30 北京大米科技有限公司 Text generation method and apparatus, storage medium, and electronic device
CN112560479A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Abstract extraction model training method, abstract extraction device and electronic equipment
CN112765344A (en) * 2021-01-12 2021-05-07 哈尔滨工业大学 Method, device and storage medium for generating meeting abstract based on meeting record
CN114064885A (en) * 2021-11-25 2022-02-18 北京航空航天大学 Unsupervised Chinese multi-document extraction type abstract method
US11295219B2 (en) 2017-02-10 2022-04-05 International Business Machines Corporation Answering questions based on semantic distances between subjects
CN116108831A (en) * 2023-04-11 2023-05-12 宁波深擎信息科技有限公司 Method, device, equipment and medium for extracting text abstract based on field words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078090A1 (en) * 2000-06-30 2002-06-20 Hwang Chung Hee Ontological concept-based, user-centric text summarization
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078090A1 (en) * 2000-06-30 2002-06-20 Hwang Chung Hee Ontological concept-based, user-centric text summarization
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUNES ERKAN 等: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", 《JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH》 *
边晋强: "基于LDA 主题模型的文档文摘研究", 《中国硕士学位论文全文数据库信息科技辑》 *
邓光喜: "面向主题的 Web 文档自动文摘生成方法研究", 《中国硕士学位论文全文数据库信息科技辑》 *

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus
CN105488033A (en) * 2016-01-26 2016-04-13 中国人民解放军国防科学技术大学 Preprocessing method and device for correlation calculation
CN105488033B (en) * 2016-01-26 2018-01-02 中国人民解放军国防科学技术大学 Associate the preprocess method and device calculated
CN105740354B (en) * 2016-01-26 2018-11-30 中国人民解放军国防科学技术大学 The method and device of adaptive potential Di Li Cray model selection
CN105868178B (en) * 2016-03-28 2018-07-17 浙江大学 A kind of multi-document auto-abstracting generation method of phrase-based theme modeling
CN105868178A (en) * 2016-03-28 2016-08-17 浙江大学 Multi-document automatic abstract generation method based on phrase subject modeling
CN105938481A (en) * 2016-04-07 2016-09-14 北京航空航天大学 Anomaly detection method of multi-mode text data in cities
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106227722B (en) * 2016-09-12 2019-07-05 中山大学 A kind of extraction method based on listed company's bulletin abstract
CN106708803A (en) * 2016-12-21 2017-05-24 东软集团股份有限公司 Feature extraction method and device
US10621219B2 (en) 2017-02-10 2020-04-14 International Business Machines Corporation Techniques for determining a semantic distance between subjects
US10599694B2 (en) 2017-02-10 2020-03-24 International Business Machines Corporation Determining a semantic distance between subjects
US11295219B2 (en) 2017-02-10 2022-04-05 International Business Machines Corporation Answering questions based on semantic distances between subjects
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN107291836A (en) * 2017-05-31 2017-10-24 北京大学 A kind of Chinese text summary acquisition methods based on semantic relevancy model
CN107291836B (en) * 2017-05-31 2020-06-02 北京大学 Chinese text abstract obtaining method based on semantic relevancy model
CN108182247A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Text summarization method and apparatus
CN110609997A (en) * 2018-06-15 2019-12-24 北京百度网讯科技有限公司 Method and device for generating abstract of text
CN110609997B (en) * 2018-06-15 2023-05-23 北京百度网讯科技有限公司 Method and device for generating abstract of text
CN108985370B (en) * 2018-07-10 2021-04-16 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
CN108985370A (en) * 2018-07-10 2018-12-11 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
CN109213853A (en) * 2018-08-16 2019-01-15 昆明理工大学 A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm
CN109213853B (en) * 2018-08-16 2022-04-12 昆明理工大学 CCA algorithm-based Chinese community question-answer cross-modal retrieval method
CN109284357B (en) * 2018-08-29 2022-07-19 腾讯科技(深圳)有限公司 Man-machine conversation method, device, electronic equipment and computer readable medium
CN109284357A (en) * 2018-08-29 2019-01-29 腾讯科技(深圳)有限公司 Interactive method, device, electronic equipment and computer-readable medium
US11775760B2 (en) 2018-08-29 2023-10-03 Tencent Technology (Shenzhen) Company Limited Man-machine conversation method, electronic device, and computer-readable medium
CN109344280B (en) * 2018-10-13 2021-09-17 中山大学 Method and system for retrieving flow chart based on graph model
CN109344280A (en) * 2018-10-13 2019-02-15 中山大学 A kind of flow chart search method and system based on graph model
CN110399606B (en) * 2018-12-06 2023-04-07 国网信息通信产业集团有限公司 Unsupervised electric power document theme generation method and system
CN110399606A (en) * 2018-12-06 2019-11-01 国网信息通信产业集团有限公司 A kind of unsupervised electric power document subject matter generation method and system
WO2020258948A1 (en) * 2019-06-24 2020-12-30 北京大米科技有限公司 Text generation method and apparatus, storage medium, and electronic device
CN110728144B (en) * 2019-10-06 2023-04-07 湖北工业大学 Extraction type document automatic summarization method based on context semantic perception
CN110728144A (en) * 2019-10-06 2020-01-24 湖北工业大学 Extraction type document automatic summarization method based on context semantic perception
CN111177365B (en) * 2019-12-20 2022-08-02 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN111159393B (en) * 2019-12-30 2023-10-10 电子科技大学 Text generation method for abstract extraction based on LDA and D2V
CN111339287A (en) * 2020-02-24 2020-06-26 成都网安科技发展有限公司 Abstract generation method and device
CN111339287B (en) * 2020-02-24 2023-04-21 成都网安科技发展有限公司 Abstract generation method and device
CN112560479A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Abstract extraction model training method, abstract extraction device and electronic equipment
CN112560479B (en) * 2020-12-24 2024-01-12 北京百度网讯科技有限公司 Abstract extraction model training method, abstract extraction device and electronic equipment
CN112765344A (en) * 2021-01-12 2021-05-07 哈尔滨工业大学 Method, device and storage medium for generating meeting abstract based on meeting record
CN114064885A (en) * 2021-11-25 2022-02-18 北京航空航天大学 Unsupervised Chinese multi-document extraction type abstract method
CN114064885B (en) * 2021-11-25 2024-05-31 北京航空航天大学 Unsupervised Chinese multi-document extraction type abstract method
CN116108831A (en) * 2023-04-11 2023-05-12 宁波深擎信息科技有限公司 Method, device, equipment and medium for extracting text abstract based on field words

Also Published As

Publication number Publication date
CN105243152B (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN105243152A (en) Graph model-based automatic abstracting method
CN100517330C (en) Word sense based local file searching method
CN103235772B (en) A kind of text set character relation extraction method
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN103399901B (en) A kind of keyword abstraction method
CN100353361C (en) New method of characteristic vector weighting for text classification and its device
CN106844424A (en) A kind of file classification method based on LDA
CN106599029A (en) Chinese short text clustering method
CN105868178A (en) Multi-document automatic abstract generation method based on phrase subject modeling
CN106445920A (en) Sentence similarity calculation method based on sentence meaning structure characteristics
CN104268197A (en) Industry comment data fine grain sentiment analysis method
CN106294863A (en) A kind of abstract method for mass text fast understanding
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN104615593A (en) Method and device for automatic detection of microblog hot topics
CN106970910A (en) A kind of keyword extracting method and device based on graph model
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN103631858A (en) Science and technology project similarity calculation method
CN101782898A (en) Method for analyzing tendentiousness of affective words
CN101231634A (en) Autoabstract method for multi-document
CN104484380A (en) Personalized search method and personalized search device
Sharma et al. Proposed stemming algorithm for Hindi information retrieval
CN101295294A (en) Improved Bayes acceptation disambiguation method based on information gain
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant