CN105243152A

CN105243152A - Graph model-based automatic abstracting method

Info

Publication number: CN105243152A
Application number: CN201510703353.2A
Authority: CN
Inventors: 王俊丽; 魏绍臣; 管敏
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-10-26
Filing date: 2015-10-26
Publication date: 2016-01-13
Anticipated expiration: 2035-10-26
Also published as: CN105243152B

Abstract

The invention relates to the field of automatic abstracting, and discloses a graph model-based automatic abstracting method. According to the technical scheme, an LDA probability topic model is applied to measurement of semantic correlation between sentences and improvement of the measurement effect of sentence correlation; and an idea of topic correlation and position sensitivity of the sentences is provided, so that abstract generation is relatively reasonable and effective. The method comprises the following steps: firstly, obtaining topic probability distribution of a document and word probability distribution of the topic through training the LDA topic model, determining the topic probability distribution of the sentences and effectively converting a semantic similarity measurement between the sentences into a similarity measurement problem of the topic probability distribution of the sentences; with the sentences as nodes, building edges by referring tothe cosine similarity and according to the semantic similarity between the sentences and generating a text graph representing the document; calculating the topic correlation between the sentences according to the topic probability distribution of the sentences and the topic probability distribution of the document; and calculating the position sensitivity and the like of the sentences according to the positions of the sentences in the document.

Description

A kind of automaticabstracting based on graph model

Technical field

The present invention relates to automatic abstract field, be related specifically to a kind of automaticabstracting based on graph model.

Background technology

Automatic Summarization Technique utilizes computing machine automatically to process document, generate the summary comprising original text shelves core content, realize the compression to document, people just can be found with the less time and obtain the information needed, effectively can solve information overload problem.

Although automatic abstract is from 1960 existing quite long research histories so far, from original text, namely directly extract based on the digest mode extracted the method that crucial sentence generation digest is still the most main flow in this field.Core concept based on the automatic abstract extracted is: first carry out statistical study to the various features of sentence in a section or many sections of documents, calculates the significance level of sentence, then selects summary sentence by suitable method of abstracting, forms article abstract.5 classifications can be divided into: the technology of Corpus--based Method based on current the applied technology of automatic abstract extracted; Based on the technology of theme; Based on the technology of chapter relation; Machine learning techniques; Based on the technology of graph model.

Statistical model is widely used in natural language processing field, and statistical technique is also the technology that automatic abstract is applied the earliest, and relative to other technologies, statistical technique does not need complicated modeling, simple and be easy to realize; The importance of text unit is not only decided by that literal word repeats, and also depends on word semantic association behind, and the method based on theme is by excavating semantic association, thematic knowledge being incorporated the calculating to text unit weight; Except above-mentioned method, automatic abstract problem can be solved equally from philological angle.Chapter relationship analysis is also widely used in automatic abstract field; Two points of sorters, Hidden Markov Model (HMM) and bayes methods are the machine learning methods being applied in automatic abstract field the earliest.In addition, the application of other machines learning method in automatic abstract field is also very extensive.

Automatic Summarization Technique in recent years based on figure sort algorithm receives increasing concern, because these class methods can judge according to the global information of document in the process of text unit weight sequencing, and be not only the information relying on local finite, similar with the thought of manually making a summary.These class methods are using the node of text unit as figure, and generate the limit between node according to the correlativity between text unit, document representation is become a text figure, then utilize figure sort algorithm, such as PageRank sorts to text unit.In this approach, similar with the sequence of webpage, can higher weights be obtained with the tight text unit of other text units.

In figure sort algorithm, the degree of accuracy of the relativity measurement between node directly has influence on the result of figure sequence, and therefore in the automatic abstract based on figure sort algorithm, the relativity measurement of text unit is core missions.In previous many research, use sentence the most general as text node of graph, and the relativity measurement between sentence is confined to the word aspect in sentence mostly, such as utilize the cosine similarity of the co-occurrence of the word between sentence, sentence, utilize WordNet to measure word correlativity etc., but the measure of word aspect is difficult to the semantic dependency accurately weighed between sentence.On the other hand, though have ignored an important indicator based on the abstract method correlativity considered between text unit of figure sequence---the correlativity between text unit and document subject matter, may cause text unit to sort like this and occur the situation of local optimum.Such as, the non-core content that large section theme is irrelevant is had in certain article, but the cross correlation of the sentence in this part content own is close, so after figure sequence, occur the sentence that weight is higher in this section of content possibly, but this local optimum sentence can only represent this part content but can not represent whole document.In addition, the method based on figure sequence also have ignored some attributes of text unit self, such as sentence length, sentence position etc.In a lot of article especially news category article, first section of content can illustrate article general idea usually, ignores the position attribution of sentence, can affect the sequence to sentence weight undoubtedly.

Summary of the invention

The object of the invention will overcome in prior art the problem of the semantic dependency being difficult to accurately weigh between sentence and some attributes that have ignored text unit self, provides a kind of automaticabstracting based on graph model of improvement.

To achieve the above object of the invention, the present invention proposes a kind of double-deck measuring similarity model in conjunction with LDA topic model and cosine similarity, the correlativity between semantic and word two aspects tolerance sentences.And define degree of subject relativity and the location sen-sitivity of sentence, in figure sequence, give sentence initial weight by degree of subject relativity and location sen-sitivity, utilize Biased-Pagerank algorithm to sort, optimize the effect of sentence sequence.

The present invention is achieved through the following technical solutions:

A kind of automaticabstracting based on graph model, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.

Based on an automaticabstracting for graph model, specifically comprise the steps:

(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus.

(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector.

(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus.

(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word.

(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme.

(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution; According to the cosine similarity between sentence TF*IDF vector calculation sentence, supplementing as semantic similarity.

(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure.

(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to the JS distance of the theme probability distribution of sentence theme probability distribution and document.

(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document.

(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7).

(11) Text summarization.The higher combination of sentences of weight is selected to generate digest according to the result of sentence sequence in (10).

In force, calculate the method for the probability topic distribution of sentence in step (6), its computing formula is:

P (T_{j} | S_{r}) = \underset{W_{i} &Element; S_{r}}{Σ} P (W_{i} | T_{j}) * P (T_{j} | D_{k}) * Σ_{k = 1}^{M} P (T_{j} | D_{k})

Wherein P (T _j| S _r) be document D _kin sentence S _r, it belongs to theme T _jprobability; P (W _i| T _j) represent word W _irepresent theme T _jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T _i| D _k) represent document D _kbelong to theme T _iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.Its beneficial effect is, the semantic dependency tolerance between sentence is effectively converted to the relativity measurement of sentence probability topic distribution.

In force, wherein step (6) calculates the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution, and for theme distribution P and Q of sentence P, Q, computing formula is as follows:

{SemSim}_{P Q} = 1 - \frac{1}{2} (K L (P | | M) + K L (Q | | M))

Wherein kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:

K L (P | | M) = \underset{i}{Σ} P (i) l n \frac{P (i)}{M (i)}

Its beneficial effect is, weighs the semantic dependency between sentence more accurately.

In force, wherein step (6) is according to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity.For sentence P, Q, the final measuring similarity formula in conjunction with cosine similarity is as follows:

Sim _PQ＝(1-λ)*SemSim _PQ+λ*CosSim _PQ

Its beneficial effect is, weighs the similarity between sentence, reach complementary effect, more accurately at word and semantic two aspects.

λ value, between 0 to 1, is used for regulating the proportion shared by SemSim and CosSim.Wherein CosSim _pQcomputing formula is as follows

{CosSim}_{P Q} = \frac{Σ_{w &Element; P, Q} {TF}_{w, P} * {TF}_{w, Q} * {({IDF}_{w})}^{2}}{\sqrt{Σ_{w &Element; P} {({TF}_{w, p} * {IDF}_{w})}^{2}} * \sqrt{Σ_{w &Element; Q} {({TF}_{w, p} * {IDF}_{w})}^{2}}}

Wherein TF _{w, P}for the word frequency of word w in sentence P, computing formula is as follows:

{TF}_{w, P} = \frac{N_{w, P}}{N_{P}}

N _{w, P}for the number of times that word w occurs in sentence P, N _pfor the total words of sentence P.

IDF _wfor the inverse document frequency of word w, computing formula is as follows:

{IDF}_{w} = l o g \frac{N}{N_{w}}

Wherein N is total number of documents in corpus, N _wfor the number of times of word w.

In force, the degree of subject relativity of sentence is wherein calculated in step (8) according to the JS distance of the theme probability distribution of sentence theme probability distribution and document, if the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P _pcomputing formula is as follows:

{TR}_{P} = \frac{{SemSim}_{P D}}{Σ_{P &Element; D} {SemSim}_{P D}}

Its beneficial effect is, weighs the degree of subject relativity between sentence more accurately at semantic layer.

In force, the wherein computing method of the middle location sen-sitivity of step (9), computing formula is as follows:

{PS}_{P} = \frac{\frac{1}{p o s}}{Σ_{i = 1}^{l e n (D)} \frac{1}{i}}

Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.Its beneficial effect is, has reasonably measured the location sen-sitivity of sentence, makes forward sentence weight higher.

In force, wherein give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in step (8) and step (9) in step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:

R (P) = d * \frac{({TR}_{P} + {PS}_{P})}{2} + (1 - d) \underset{Q &Element; a d j (P)}{Σ} \frac{{Sim}_{P Q}}{Σ_{Z &Element; a d j (Q)} {Sim}_{Z Q}} R (Q)

The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, Q ∈ adj (P)) represent all node Q be connected with P.Its beneficial effect is, merges sentence degree of subject relativity and location sen-sitivity, and ranking results is more reasonable, better effects if.

Accompanying drawing explanation

Fig. 1 is the block scheme of structure of the present invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, further describe according to of the invention process automaticabstracting below.Be to be understood that; specific embodiment described herein is only for explaining the present invention; be not intended to limit the present invention; namely protection scope of the present invention is not limited to following embodiment; on the contrary; according to inventive concept of the present invention, those of ordinary skill in the art can suitably change, and these changes can fall within the invention scope that claims limit.

Automaticabstracting according to the specific embodiment of the invention comprises the steps:

1) document pretreatment module:

Selected specific document sets, document sets should be too not little, an at least hundreds of section, otherwise LDA topic model training result can be had a strong impact on.This example selects the DUC02 meeting document sets in automatic abstract field to carry out digest, is made up of, contains 60 themes, have 10 sections of articles under average each theme 586 sections of news articles.First according to inactive vocabulary, stop words removed to the document in document sets and removes punctuation mark, then using Stemming to carry out stem extraction, after the process of these natural language processing techniques, the cleaned document sets D obtained.Such as sentence " Theexplosionoccurredat8:26a.m.inaloungeinthebarracks. " becomes " explosionoccurloungebarrack " after pretreated, eliminates stop words, and the tense of word obtains recovery.

2) data preparation module:

Carry out document vectorization: the present embodiment, by cleaned corpus D, first carries out symbolism process: all words occurred in statistic document collection D, give unique number, from 0 to N (total word number) to each word; Then by every section of document according to occurring that the numbering that word is corresponding is converted to a vector, every one dimension of vector is the corresponding numbering of the word occurred in the document.This step obtains the document sets after vectorization, be designated as T_D (TokenizedDocuments), T_D is the set of the vector of 586 sections of documents, every one dimension of T_D is corresponding document vector, such as vector (2,3,56,78 ... 120) represent a document, numeral is wherein the numbering that word appears in the document.

Carry out the vectorization of document sentence: each sentence in each document is converted to a N dimensional vector, N is document total words.For each word occurred in sentence, the numerical value of its corresponding dimension in sentence vector is the TF*IDF value of this word, TF*IDF and word frequency * inverse document frequency.Note TF _{w, P}for the word frequency of word w in sentence P, computing formula is as follows:

{TF}_{w, P} = \frac{N_{w, P}}{N_{P}}

N _{w, P}for the number of times that word w occurs in sentence P, N _pfor the total words of sentence P.IDF _wfor the inverse document frequency of word w, computing formula is as follows:

{IDF}_{w} = l o g \frac{N}{N_{w}}

Wherein N is total number of documents in corpus, N _wfor the number of times that word w occurs in language material.Document sets after sentence vectorization is designated as S_T_D (SentencesTokenizedDocuments) by us.Every one dimension of S_T_D is a corresponding document, and every one dimension of the document is the TF*IDF vector of its sentence.Such as sentence " explosionoccurloungebarrack " vector turns to (4.324,0,0,0.836,0,0 ... 2.563,0,0 ... 3.125 ...), explosion is the vocabulary compared with core, lounge and barrack be outbalance also, so its TF*IDF value is comparatively large, and occur is less as common word TF*IDF value.

3) computing module:

Comprise sentence similarity tolerance, degree of subject relativity calculates and location sen-sitivity calculates.

Use T_D as document sets, and utilize Gibbs sampling to train LDA topic model.For the number of 586 sections of documents, we select the number of topics of LDA topic model to be set to 20, and Study first α is set to 2.5 (50/ numbers of topics), and β is set to 0.01, and this is general method to set up, Gibbs sampling converges faster during training.Theme-word the distribution of each theme and the document-theme distribution of every section of document is obtained after training.We represent theme-word distribution with P (W|T), P (W _i| T _j) represent word W _irepresent theme T _jprobability.Document-theme distribution is represented, P (T with P (T|D) _i| D _k) represent document D _kbelong to theme T _iprobability.Wherein the dimension of W is the word number in corpus, and the dimension of T is the number of topics of LDA topic model setting, and the dimension of D is the number of files in corpus.

For one section of document, the theme distribution of the sentence required by us is P (T|S), for document D _kin sentence S _r, it belongs to theme T _jprobability calculation formula as follows:

P (T_{j} | S_{r}) = \underset{W_{i} &Element; S_{r}}{Σ} P (W_{i} | T_{j}) * P (T_{j} | D_{k}) * Σ_{k = 1}^{M} P (T_{j} | D_{k})

So just obtain the theme distribution P (T|S) of sentence, such as, P (T|S _r) be exactly sentence S _rtheme probability distribution, conveniently we represent the theme distribution of sentence P, Q with P and Q.For theme distribution P and Q of sentence P, Q, Semantic Similarity Measurement formula is as follows:

{SemSim}_{P Q} = 1 - \frac{1}{2} (K L (P | | M) + K L (Q | | M))

K L (P | | M) = \underset{i}{Σ} P (i) l n \frac{P (i)}{M (i)}

According to the document sets S_T_D after the sentence vectorization obtained, the cosine similarity computing formula of sentence P, Q is as follows:

{CosSim}_{P Q} = \frac{A * B}{| | A | | * | | B | |}

A.B is the TF*IDF vector of sentence P, Q.

For sentence P, Q, final measuring similarity formula is as follows:

Sim _PQ＝(1-λ)*SemSim _PQ+λ*CosSim _PQ

According to this formula, we just obtain the similarity in every section of document between any two sentences.We get the similarity in λ=0.2 calculating document between sentence, and following table illustrates the similarity of sentence in certain document:

Sentence	S1	S2	S3	S4	S5	……
							S1	1			……
S2	0.653	1				……
							S3	0.436	0.554	1	……
S4	0.362	0.432	0.235	1		……

S5

0.375

0.343

0.482

0.275

1

……

Semantic similarity between the computing formula of sentence degree of subject relativity and sentence is completely the same, and the theme distribution distribution of two sentences being changed into sentence distribution and document is just passable.If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P _pcomputing formula is as follows:

{TR}_{P} = \frac{{SemSim}_{P D}}{Σ_{P &Element; D} {SemSim}_{P D}}

According to this formula, we have calculated the degree of subject relativity of each sentence in every section of document.Following table illustrates the degree of subject relativity of sentence in certain document:

Sentence

S1

S2

S3

S4

S5

……

Degree of subject relativity

0.436

0.343

0.276

0.145

0.193

……

Can see that forward sentence degree of subject relativity is also higher.

Location sen-sitivity computing formula is as follows:

{PS}_{P} = \frac{\frac{1}{p o s}}{Σ_{i = 1}^{l e n (D)} \frac{1}{i}}

Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.According to this formula, we have calculated the location sen-sitivity of each sentence in every section of document.Such as, certain document is made up of 10 sentences, and following table illustrates the location sen-sitivity of its sentence:

Sentence	S1	S2	S3	S4	S5	S6	S7	S8	S9	S10
												0.341	0.170	0.113	0.085	0.068	0.056	0.048	0.042	0.037	0.034

4) sentence order module:

Text figure is abstract figure, does not need real building.For sentence iterative sequencing, as long as we have 3) the middle sentence similarity calculated is measured, degree of subject relativity calculates and location sen-sitivity is just passable.

For a document, we represent its weight vectors with R (P), and every one dimension of R (P) represents its weight for sentence, is initially all set to 1, i.e. R (P)=(1,1 ... 1).

Then we are according to formula:

R (P) = d * \frac{({TR}_{P} + {PS}_{P})}{2} + (1 - d) \underset{Q &Element; a d j (P)}{Σ} \frac{{Sim}_{P Q}}{Σ_{Z &Element; a d j (Q)} {Sim}_{Z Q}} R (Q)

Upgrade every one dimension of R (P) iteratively, i.e. the weight of each sentence.R (P) upgrades and is once designated as an iteration, until R (P) numerical stability, no longer till change, usual iteration restrains in tens times, so just obtains the final weight of each sentence.We get d=0.85 and sort to sentence, and convergence threshold is set to 0.0001, think and restrain when namely weight sum is less than or equal to 1.0001 in R (P).Such as, as follows for the R (P) after its convergence of certain section of article, retain 4 decimals:

[0.1823,0.1590,0.0901,0.0835,0.080,0.083,0.1685,0.0532,0.0712,0.0823,0.081,0.0855]

Can find out weight higher be 1,2,7.

5) Text summarization

According to 4) in ranking results, limit the highest sentence of requirements selection weight by number of words, just generate summary by original text sequential combination.

Claims

1. the automaticabstracting based on graph model, it is characterized in that, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.

2. a kind of automaticabstracting based on graph model as claimed in claim 1, comprises the following steps:

(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus;

(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector;

(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus;

(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word;

(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme;

(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the JS distance of different sentence theme probability distribution; According to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity;

(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure;

(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to Jensen – Shannon distance (JS distance) of the theme probability distribution of sentence theme probability distribution and document;

(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document;

(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7);

3. method according to claim 2, is characterized in that, calculate the method for the probability topic distribution of sentence in described step (6), its computing formula is:

P (T_{j} | S_{r}) = \underset{W_{i} &Element; S_{r}}{Σ} P (W_{i} | T_{j}) * P (T_{j} | D_{k}) * Σ_{k = 1}^{M} P (T_{j} | D_{k})

Wherein P (T _j| S _r) be document D _kin sentence S _rbelong to theme T _jprobability; P (W _i| T _j) represent word W _irepresent theme T _jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T _i| D _k) represent document D _kbelong to theme T _iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.

4. method according to claim 2, it is characterized in that, described step (6) is fallen into a trap and is calculated the quantized value of semantic similarity between sentence according to the JS of different sentence theme probability distribution distance, and in described step (8) according to the JS of sentence theme probability distribution and the theme probability distribution of document apart from the degree of subject relativity calculating sentence; For theme distribution P and Q of sentence P, Q, computing formula is as follows:

{SemSim}_{P Q} = 1 - \frac{1}{2} (K L (P | | M) + K L (Q | | M))

K L (P | | M) = \underset{i}{Σ} P (i) l n \frac{P (i)}{M (i)}

If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P _pcomputing formula is as follows:

{TR}_{P} = \frac{{SemSim}_{P D}}{Σ_{P &Element; D} {SemSim}_{P D}} .

5. method according to claim 2, is characterized in that, the computing method of location sen-sitivity in described step (9), and computing formula is as follows:

{PS}_{P} = \frac{\frac{1}{p o s}}{Σ_{i = 1}^{l e n (D)} \frac{1}{i}}

Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.

6. method according to claim 2, it is characterized in that, sentence initial weight is given according to the degree of subject relativity in (8) and the location sen-sitivity in (9) in described step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:

R (P) = d * \frac{({TR}_{P} + {PS}_{P})}{2} + (1 - d) \underset{Q &Element; a d j [P]}{Σ} \frac{{Sim}_{P Q}}{Σ_{Z &Element; a d j (Q)} {Sim}_{Z Q}} R (Q)

The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, and Q ∈ adj [P] represents all node Q be connected with node P.Sim _pQfor the similarity of sentence P, Q, combined by the semantic similarity of sentence and cosine similarity and obtain;

Sim _PQ＝(1-λ)*SemSim _PQ+λ*CosSim _PQ

{CosSim}_{P Q} = \frac{Σ_{w &Element; P . Q} {TF}_{w, P} * {TF}_{w, Q} * {({IDF}_{w})}^{2}}{\sqrt{Σ_{w &Element; P} {({TF}_{w, p} * {IDF}_{w})}^{2}} * \sqrt{Σ_{w &Element; Q} {({TF}_{w, p} * {IDF}_{w})}^{2}}}

Wherein TF _w,Pfor the word frequency of word w in sentence P, computing formula is as follows:

{TF}_{w, P} = \frac{N_{w, P}}{N_{P}}

N _w,Pfor the number of times that word w occurs in sentence P, N _pfor the total words of sentence P;

{IDF}_{w} = l o g \frac{N}{N_{w}}