CN105243152A - Graph model-based automatic abstracting method - Google Patents
Graph model-based automatic abstracting method Download PDFInfo
- Publication number
- CN105243152A CN105243152A CN201510703353.2A CN201510703353A CN105243152A CN 105243152 A CN105243152 A CN 105243152A CN 201510703353 A CN201510703353 A CN 201510703353A CN 105243152 A CN105243152 A CN 105243152A
- Authority
- CN
- China
- Prior art keywords
- sentence
- document
- theme
- word
- probability distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the field of automatic abstracting, and discloses a graph model-based automatic abstracting method. According to the technical scheme, an LDA probability topic model is applied to measurement of semantic correlation between sentences and improvement of the measurement effect of sentence correlation; and an idea of topic correlation and position sensitivity of the sentences is provided, so that abstract generation is relatively reasonable and effective. The method comprises the following steps: firstly, obtaining topic probability distribution of a document and word probability distribution of the topic through training the LDA topic model, determining the topic probability distribution of the sentences and effectively converting a semantic similarity measurement between the sentences into a similarity measurement problem of the topic probability distribution of the sentences; with the sentences as nodes, building edges by referring tothe cosine similarity and according to the semantic similarity between the sentences and generating a text graph representing the document; calculating the topic correlation between the sentences according to the topic probability distribution of the sentences and the topic probability distribution of the document; and calculating the position sensitivity and the like of the sentences according to the positions of the sentences in the document.
Description
Technical field
The present invention relates to automatic abstract field, be related specifically to a kind of automaticabstracting based on graph model.
Background technology
Automatic Summarization Technique utilizes computing machine automatically to process document, generate the summary comprising original text shelves core content, realize the compression to document, people just can be found with the less time and obtain the information needed, effectively can solve information overload problem.
Although automatic abstract is from 1960 existing quite long research histories so far, from original text, namely directly extract based on the digest mode extracted the method that crucial sentence generation digest is still the most main flow in this field.Core concept based on the automatic abstract extracted is: first carry out statistical study to the various features of sentence in a section or many sections of documents, calculates the significance level of sentence, then selects summary sentence by suitable method of abstracting, forms article abstract.5 classifications can be divided into: the technology of Corpus--based Method based on current the applied technology of automatic abstract extracted; Based on the technology of theme; Based on the technology of chapter relation; Machine learning techniques; Based on the technology of graph model.
Statistical model is widely used in natural language processing field, and statistical technique is also the technology that automatic abstract is applied the earliest, and relative to other technologies, statistical technique does not need complicated modeling, simple and be easy to realize; The importance of text unit is not only decided by that literal word repeats, and also depends on word semantic association behind, and the method based on theme is by excavating semantic association, thematic knowledge being incorporated the calculating to text unit weight; Except above-mentioned method, automatic abstract problem can be solved equally from philological angle.Chapter relationship analysis is also widely used in automatic abstract field; Two points of sorters, Hidden Markov Model (HMM) and bayes methods are the machine learning methods being applied in automatic abstract field the earliest.In addition, the application of other machines learning method in automatic abstract field is also very extensive.
Automatic Summarization Technique in recent years based on figure sort algorithm receives increasing concern, because these class methods can judge according to the global information of document in the process of text unit weight sequencing, and be not only the information relying on local finite, similar with the thought of manually making a summary.These class methods are using the node of text unit as figure, and generate the limit between node according to the correlativity between text unit, document representation is become a text figure, then utilize figure sort algorithm, such as PageRank sorts to text unit.In this approach, similar with the sequence of webpage, can higher weights be obtained with the tight text unit of other text units.
In figure sort algorithm, the degree of accuracy of the relativity measurement between node directly has influence on the result of figure sequence, and therefore in the automatic abstract based on figure sort algorithm, the relativity measurement of text unit is core missions.In previous many research, use sentence the most general as text node of graph, and the relativity measurement between sentence is confined to the word aspect in sentence mostly, such as utilize the cosine similarity of the co-occurrence of the word between sentence, sentence, utilize WordNet to measure word correlativity etc., but the measure of word aspect is difficult to the semantic dependency accurately weighed between sentence.On the other hand, though have ignored an important indicator based on the abstract method correlativity considered between text unit of figure sequence---the correlativity between text unit and document subject matter, may cause text unit to sort like this and occur the situation of local optimum.Such as, the non-core content that large section theme is irrelevant is had in certain article, but the cross correlation of the sentence in this part content own is close, so after figure sequence, occur the sentence that weight is higher in this section of content possibly, but this local optimum sentence can only represent this part content but can not represent whole document.In addition, the method based on figure sequence also have ignored some attributes of text unit self, such as sentence length, sentence position etc.In a lot of article especially news category article, first section of content can illustrate article general idea usually, ignores the position attribution of sentence, can affect the sequence to sentence weight undoubtedly.
Summary of the invention
The object of the invention will overcome in prior art the problem of the semantic dependency being difficult to accurately weigh between sentence and some attributes that have ignored text unit self, provides a kind of automaticabstracting based on graph model of improvement.
To achieve the above object of the invention, the present invention proposes a kind of double-deck measuring similarity model in conjunction with LDA topic model and cosine similarity, the correlativity between semantic and word two aspects tolerance sentences.And define degree of subject relativity and the location sen-sitivity of sentence, in figure sequence, give sentence initial weight by degree of subject relativity and location sen-sitivity, utilize Biased-Pagerank algorithm to sort, optimize the effect of sentence sequence.
The present invention is achieved through the following technical solutions:
A kind of automaticabstracting based on graph model, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.
Based on an automaticabstracting for graph model, specifically comprise the steps:
(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus.
(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector.
(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus.
(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word.
(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme.
(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution; According to the cosine similarity between sentence TF*IDF vector calculation sentence, supplementing as semantic similarity.
(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure.
(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to the JS distance of the theme probability distribution of sentence theme probability distribution and document.
(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document.
(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7).
(11) Text summarization.The higher combination of sentences of weight is selected to generate digest according to the result of sentence sequence in (10).
In force, calculate the method for the probability topic distribution of sentence in step (6), its computing formula is:
Wherein P (T
j| S
r) be document D
kin sentence S
r, it belongs to theme T
jprobability; P (W
i| T
j) represent word W
irepresent theme T
jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T
i| D
k) represent document D
kbelong to theme T
iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.Its beneficial effect is, the semantic dependency tolerance between sentence is effectively converted to the relativity measurement of sentence probability topic distribution.
In force, wherein step (6) calculates the quantized value of semantic similarity between sentence according to the Jensen – Shannon distance of different sentence theme probability distribution, and for theme distribution P and Q of sentence P, Q, computing formula is as follows:
Wherein
kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
Its beneficial effect is, weighs the semantic dependency between sentence more accurately.
In force, wherein step (6) is according to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity.For sentence P, Q, the final measuring similarity formula in conjunction with cosine similarity is as follows:
Sim
PQ=(1-λ)*SemSim
PQ+λ*CosSim
PQ
Its beneficial effect is, weighs the similarity between sentence, reach complementary effect, more accurately at word and semantic two aspects.
λ value, between 0 to 1, is used for regulating the proportion shared by SemSim and CosSim.Wherein CosSim
pQcomputing formula is as follows
Wherein TF
w, Pfor the word frequency of word w in sentence P, computing formula is as follows:
N
w, Pfor the number of times that word w occurs in sentence P, N
pfor the total words of sentence P.
IDF
wfor the inverse document frequency of word w, computing formula is as follows:
Wherein N is total number of documents in corpus, N
wfor the number of times of word w.
In force, the degree of subject relativity of sentence is wherein calculated in step (8) according to the JS distance of the theme probability distribution of sentence theme probability distribution and document, if the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P
pcomputing formula is as follows:
Its beneficial effect is, weighs the degree of subject relativity between sentence more accurately at semantic layer.
In force, the wherein computing method of the middle location sen-sitivity of step (9), computing formula is as follows:
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.Its beneficial effect is, has reasonably measured the location sen-sitivity of sentence, makes forward sentence weight higher.
In force, wherein give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in step (8) and step (9) in step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:
The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, Q ∈ adj (P)) represent all node Q be connected with P.Its beneficial effect is, merges sentence degree of subject relativity and location sen-sitivity, and ranking results is more reasonable, better effects if.
Accompanying drawing explanation
Fig. 1 is the block scheme of structure of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, further describe according to of the invention process automaticabstracting below.Be to be understood that; specific embodiment described herein is only for explaining the present invention; be not intended to limit the present invention; namely protection scope of the present invention is not limited to following embodiment; on the contrary; according to inventive concept of the present invention, those of ordinary skill in the art can suitably change, and these changes can fall within the invention scope that claims limit.
Automaticabstracting according to the specific embodiment of the invention comprises the steps:
1) document pretreatment module:
Selected specific document sets, document sets should be too not little, an at least hundreds of section, otherwise LDA topic model training result can be had a strong impact on.This example selects the DUC02 meeting document sets in automatic abstract field to carry out digest, is made up of, contains 60 themes, have 10 sections of articles under average each theme 586 sections of news articles.First according to inactive vocabulary, stop words removed to the document in document sets and removes punctuation mark, then using Stemming to carry out stem extraction, after the process of these natural language processing techniques, the cleaned document sets D obtained.Such as sentence " Theexplosionoccurredat8:26a.m.inaloungeinthebarracks. " becomes " explosionoccurloungebarrack " after pretreated, eliminates stop words, and the tense of word obtains recovery.
2) data preparation module:
Carry out document vectorization: the present embodiment, by cleaned corpus D, first carries out symbolism process: all words occurred in statistic document collection D, give unique number, from 0 to N (total word number) to each word; Then by every section of document according to occurring that the numbering that word is corresponding is converted to a vector, every one dimension of vector is the corresponding numbering of the word occurred in the document.This step obtains the document sets after vectorization, be designated as T_D (TokenizedDocuments), T_D is the set of the vector of 586 sections of documents, every one dimension of T_D is corresponding document vector, such as vector (2,3,56,78 ... 120) represent a document, numeral is wherein the numbering that word appears in the document.
Carry out the vectorization of document sentence: each sentence in each document is converted to a N dimensional vector, N is document total words.For each word occurred in sentence, the numerical value of its corresponding dimension in sentence vector is the TF*IDF value of this word, TF*IDF and word frequency * inverse document frequency.Note TF
w, Pfor the word frequency of word w in sentence P, computing formula is as follows:
N
w, Pfor the number of times that word w occurs in sentence P, N
pfor the total words of sentence P.IDF
wfor the inverse document frequency of word w, computing formula is as follows:
Wherein N is total number of documents in corpus, N
wfor the number of times that word w occurs in language material.Document sets after sentence vectorization is designated as S_T_D (SentencesTokenizedDocuments) by us.Every one dimension of S_T_D is a corresponding document, and every one dimension of the document is the TF*IDF vector of its sentence.Such as sentence " explosionoccurloungebarrack " vector turns to (4.324,0,0,0.836,0,0 ... 2.563,0,0 ... 3.125 ...), explosion is the vocabulary compared with core, lounge and barrack be outbalance also, so its TF*IDF value is comparatively large, and occur is less as common word TF*IDF value.
3) computing module:
Comprise sentence similarity tolerance, degree of subject relativity calculates and location sen-sitivity calculates.
Use T_D as document sets, and utilize Gibbs sampling to train LDA topic model.For the number of 586 sections of documents, we select the number of topics of LDA topic model to be set to 20, and Study first α is set to 2.5 (50/ numbers of topics), and β is set to 0.01, and this is general method to set up, Gibbs sampling converges faster during training.Theme-word the distribution of each theme and the document-theme distribution of every section of document is obtained after training.We represent theme-word distribution with P (W|T), P (W
i| T
j) represent word W
irepresent theme T
jprobability.Document-theme distribution is represented, P (T with P (T|D)
i| D
k) represent document D
kbelong to theme T
iprobability.Wherein the dimension of W is the word number in corpus, and the dimension of T is the number of topics of LDA topic model setting, and the dimension of D is the number of files in corpus.
For one section of document, the theme distribution of the sentence required by us is P (T|S), for document D
kin sentence S
r, it belongs to theme T
jprobability calculation formula as follows:
So just obtain the theme distribution P (T|S) of sentence, such as, P (T|S
r) be exactly sentence S
rtheme probability distribution, conveniently we represent the theme distribution of sentence P, Q with P and Q.For theme distribution P and Q of sentence P, Q, Semantic Similarity Measurement formula is as follows:
Wherein
kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
According to the document sets S_T_D after the sentence vectorization obtained, the cosine similarity computing formula of sentence P, Q is as follows:
A.B is the TF*IDF vector of sentence P, Q.
For sentence P, Q, final measuring similarity formula is as follows:
Sim
PQ=(1-λ)*SemSim
PQ+λ*CosSim
PQ
According to this formula, we just obtain the similarity in every section of document between any two sentences.We get the similarity in λ=0.2 calculating document between sentence, and following table illustrates the similarity of sentence in certain document:
Sentence | S1 | S2 | S3 | S4 | S5 | …… |
S1 | 1 | …… | ||||
S2 | 0.653 | 1 | …… | |||
S3 | 0.436 | 0.554 | 1 | …… | ||
S4 | 0.362 | 0.432 | 0.235 | 1 | …… |
S5 | 0.375 | 0.343 | 0.482 | 0.275 | 1 | …… |
…… | …… | …… | …… | …… | …… | …… |
Semantic similarity between the computing formula of sentence degree of subject relativity and sentence is completely the same, and the theme distribution distribution of two sentences being changed into sentence distribution and document is just passable.If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P
pcomputing formula is as follows:
According to this formula, we have calculated the degree of subject relativity of each sentence in every section of document.Following table illustrates the degree of subject relativity of sentence in certain document:
Sentence | S1 | S2 | S3 | S4 | S5 | …… |
Degree of subject relativity | 0.436 | 0.343 | 0.276 | 0.145 | 0.193 | …… |
Can see that forward sentence degree of subject relativity is also higher.
Location sen-sitivity computing formula is as follows:
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.According to this formula, we have calculated the location sen-sitivity of each sentence in every section of document.Such as, certain document is made up of 10 sentences, and following table illustrates the location sen-sitivity of its sentence:
Sentence | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 | S10 |
0.341 | 0.170 | 0.113 | 0.085 | 0.068 | 0.056 | 0.048 | 0.042 | 0.037 | 0.034 |
4) sentence order module:
Text figure is abstract figure, does not need real building.For sentence iterative sequencing, as long as we have 3) the middle sentence similarity calculated is measured, degree of subject relativity calculates and location sen-sitivity is just passable.
For a document, we represent its weight vectors with R (P), and every one dimension of R (P) represents its weight for sentence, is initially all set to 1, i.e. R (P)=(1,1 ... 1).
Then we are according to formula:
Upgrade every one dimension of R (P) iteratively, i.e. the weight of each sentence.R (P) upgrades and is once designated as an iteration, until R (P) numerical stability, no longer till change, usual iteration restrains in tens times, so just obtains the final weight of each sentence.We get d=0.85 and sort to sentence, and convergence threshold is set to 0.0001, think and restrain when namely weight sum is less than or equal to 1.0001 in R (P).Such as, as follows for the R (P) after its convergence of certain section of article, retain 4 decimals:
[0.1823,0.1590,0.0901,0.0835,0.080,0.083,0.1685,0.0532,0.0712,0.0823,0.081,0.0855]
Can find out weight higher be 1,2,7.
5) Text summarization
According to 4) in ranking results, limit the highest sentence of requirements selection weight by number of words, just generate summary by original text sequential combination.
Claims (6)
1. the automaticabstracting based on graph model, it is characterized in that, first the method obtains the theme probability distribution of document and the word probability distribution of theme by training LDA topic model, then obtain sentence theme probability distribution, the semantic similarity measurement between sentence is effectively converted to the similarity measurement problem of sentence theme probability distribution; Then use sentence as node, according to the Semantic Similarity between sentence and in conjunction with cosine similarity build limit, generate the text figure that represents document; Next the degree of subject relativity of sentence is calculated according to the theme probability distribution of sentence and document, according to the location sen-sitivity of sentence position calculation sentence in a document, and give node static weight according to these two attributes, then utilize Biased-PageRank algorithm to sort to sentence; The last sentence selecting high weight as requested just obtains article abstract according to original text sequential combination.
2. a kind of automaticabstracting based on graph model as claimed in claim 1, comprises the following steps:
(1) document pre-service, removes the garbage in language material.Given one group of collection of document, by participle, removes stop words, stemmed preconditioning technique, removes the garbage in language material, obtain cleaned after corpus;
(2) document vectorization, so that the training carrying out LDA topic model.Being numbered all words in the corpus after cleaned in (1), is corresponding vector according to numbering by every section of document subject feature vector;
(3) word frequency statistics, generates frequency matrix.Based on the statistics of the word frequency of occurrences in document, generate the frequency matrix of a document-term, i.e. frequency matrix, each in matrix have recorded the frequency that each word occurs in each document in corpus;
(4) sentence vectorization, is converted to corresponding vector according to frequency matrix in (3) by sentence each in document, and the every one dimension of vector is TF*IDF (the word frequency * inverse document frequency) numerical value of this word;
(5) LDA model training.Gibbs sampling algorithm training LDA topic model is adopted to the document of vectorization in (2), estimates the theme probability distribution of document and the word probability distribution of theme;
(6) Similarity Measure between sentence.Utilize the training result of LDA model in (5) to calculate the probability topic distribution of sentence, then calculate the quantized value of semantic similarity between sentence according to the JS distance of different sentence theme probability distribution; According to the cosine similarity between sentence word frequency vector calculation sentence, supplementing as semantic similarity;
(7) structure of text figure.Use sentence as node, generate weighting limit according to the similarity between the sentence that (6) draw, document representation is become a text figure;
(8) degree of subject relativity calculates.The degree of subject relativity of sentence is calculated according to Jensen – Shannon distance (JS distance) of the theme probability distribution of sentence theme probability distribution and document;
(9) location sen-sitivity calculates.According to the location sen-sitivity of sentence position calculation sentence in a document;
(10) sentence sequence.Give sentence initial weight according to the location sen-sitivity in the degree of subject relativity in (8) and (9), use Biased-PageRank algorithm to sort to the text figure generated in (7);
(11) Text summarization.The higher combination of sentences of weight is selected to generate digest according to the result of sentence sequence in (10).
3. method according to claim 2, is characterized in that, calculate the method for the probability topic distribution of sentence in described step (6), its computing formula is:
Wherein P (T
j| S
r) be document D
kin sentence S
rbelong to theme T
jprobability; P (W
i| T
j) represent word W
irepresent theme T
jprobability, the theme-word distribution P (W|T) according to the training of LDA topic model calculates; P (T
i| D
k) represent document D
kbelong to theme T
iprobability, according to LDA topic model training document subject matter distribution P (T|D) calculate.
4. method according to claim 2, it is characterized in that, described step (6) is fallen into a trap and is calculated the quantized value of semantic similarity between sentence according to the JS of different sentence theme probability distribution distance, and in described step (8) according to the JS of sentence theme probability distribution and the theme probability distribution of document apart from the degree of subject relativity calculating sentence; For theme distribution P and Q of sentence P, Q, computing formula is as follows:
Wherein
kL (P||M) is the KL distance of distribution P and M, and computing formula is as follows:
If the theme distribution of article is D, the theme distribution of sentence P is P, then the normalized degree of subject relativity TR of sentence P
pcomputing formula is as follows:
5. method according to claim 2, is characterized in that, the computing method of location sen-sitivity in described step (9), and computing formula is as follows:
Pos is the sequence of positions of sentence P in document D, and such as P is the 2nd word of document, and so pos=2, len (D) represent the sentence quantity that document D comprises.
6. method according to claim 2, it is characterized in that, sentence initial weight is given according to the degree of subject relativity in (8) and the location sen-sitivity in (9) in described step (10), use Biased-PageRank algorithm to sort to the text figure generated in (7), the iterative formula of sequence is:
The weight of R (P) representation node P, d is ratio of damping, and general value 0.85, N is figure interior joint sum, and Q ∈ adj [P] represents all node Q be connected with node P.Sim
pQfor the similarity of sentence P, Q, combined by the semantic similarity of sentence and cosine similarity and obtain;
Sim
PQ=(1-λ)*SemSim
PQ+λ*CosSim
PQ
λ value, between 0 to 1, is used for regulating the proportion shared by SemSim and CosSim.Wherein CosSim
pQcomputing formula is as follows
Wherein TF
w,Pfor the word frequency of word w in sentence P, computing formula is as follows:
N
w,Pfor the number of times that word w occurs in sentence P, N
pfor the total words of sentence P;
IDF
wfor the inverse document frequency of word w, computing formula is as follows:
Wherein N is total number of documents in corpus, N
wfor the number of times of word w.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510703353.2A CN105243152B (en) | 2015-10-26 | 2015-10-26 | A kind of automaticabstracting based on graph model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510703353.2A CN105243152B (en) | 2015-10-26 | 2015-10-26 | A kind of automaticabstracting based on graph model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105243152A true CN105243152A (en) | 2016-01-13 |
CN105243152B CN105243152B (en) | 2018-08-24 |
Family
ID=55040800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510703353.2A Active CN105243152B (en) | 2015-10-26 | 2015-10-26 | A kind of automaticabstracting based on graph model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243152B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488033A (en) * | 2016-01-26 | 2016-04-13 | 中国人民解放军国防科学技术大学 | Preprocessing method and device for correlation calculation |
CN105740354A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Adaptive potential Dirichlet model selection method and apparatus |
CN105868178A (en) * | 2016-03-28 | 2016-08-17 | 浙江大学 | Multi-document automatic abstract generation method based on phrase subject modeling |
CN105938481A (en) * | 2016-04-07 | 2016-09-14 | 北京航空航天大学 | Anomaly detection method of multi-mode text data in cities |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106294863A (en) * | 2016-08-23 | 2017-01-04 | 电子科技大学 | A kind of abstract method for mass text fast understanding |
CN106708803A (en) * | 2016-12-21 | 2017-05-24 | 东软集团股份有限公司 | Feature extraction method and device |
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN107291836A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of Chinese text summary acquisition methods based on semantic relevancy model |
CN108182247A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Text summarization method and apparatus |
CN108985370A (en) * | 2018-07-10 | 2018-12-11 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
CN109213853A (en) * | 2018-08-16 | 2019-01-15 | 昆明理工大学 | A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm |
CN109284357A (en) * | 2018-08-29 | 2019-01-29 | 腾讯科技(深圳)有限公司 | Interactive method, device, electronic equipment and computer-readable medium |
CN109344280A (en) * | 2018-10-13 | 2019-02-15 | 中山大学 | A kind of flow chart search method and system based on graph model |
CN110399606A (en) * | 2018-12-06 | 2019-11-01 | 国网信息通信产业集团有限公司 | A kind of unsupervised electric power document subject matter generation method and system |
CN110609997A (en) * | 2018-06-15 | 2019-12-24 | 北京百度网讯科技有限公司 | Method and device for generating abstract of text |
CN110728144A (en) * | 2019-10-06 | 2020-01-24 | 湖北工业大学 | Extraction type document automatic summarization method based on context semantic perception |
US10599694B2 (en) | 2017-02-10 | 2020-03-24 | International Business Machines Corporation | Determining a semantic distance between subjects |
CN111159393A (en) * | 2019-12-30 | 2020-05-15 | 电子科技大学 | Text generation method for abstracting abstract based on LDA and D2V |
CN111177365A (en) * | 2019-12-20 | 2020-05-19 | 山东科技大学 | Unsupervised automatic abstract extraction method based on graph model |
CN111339287A (en) * | 2020-02-24 | 2020-06-26 | 成都网安科技发展有限公司 | Abstract generation method and device |
WO2020258948A1 (en) * | 2019-06-24 | 2020-12-30 | 北京大米科技有限公司 | Text generation method and apparatus, storage medium, and electronic device |
CN112560479A (en) * | 2020-12-24 | 2021-03-26 | 北京百度网讯科技有限公司 | Abstract extraction model training method, abstract extraction device and electronic equipment |
CN112765344A (en) * | 2021-01-12 | 2021-05-07 | 哈尔滨工业大学 | Method, device and storage medium for generating meeting abstract based on meeting record |
CN114064885A (en) * | 2021-11-25 | 2022-02-18 | 北京航空航天大学 | Unsupervised Chinese multi-document extraction type abstract method |
US11295219B2 (en) | 2017-02-10 | 2022-04-05 | International Business Machines Corporation | Answering questions based on semantic distances between subjects |
CN116108831A (en) * | 2023-04-11 | 2023-05-12 | 宁波深擎信息科技有限公司 | Method, device, equipment and medium for extracting text abstract based on field words |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020078090A1 (en) * | 2000-06-30 | 2002-06-20 | Hwang Chung Hee | Ontological concept-based, user-centric text summarization |
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
-
2015
- 2015-10-26 CN CN201510703353.2A patent/CN105243152B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020078090A1 (en) * | 2000-06-30 | 2002-06-20 | Hwang Chung Hee | Ontological concept-based, user-centric text summarization |
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
Non-Patent Citations (3)
Title |
---|
GUNES ERKAN 等: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", 《JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH》 * |
边晋强: "基于LDA 主题模型的文档文摘研究", 《中国硕士学位论文全文数据库信息科技辑》 * |
邓光喜: "面向主题的 Web 文档自动文摘生成方法研究", 《中国硕士学位论文全文数据库信息科技辑》 * |
Cited By (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740354A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Adaptive potential Dirichlet model selection method and apparatus |
CN105488033A (en) * | 2016-01-26 | 2016-04-13 | 中国人民解放军国防科学技术大学 | Preprocessing method and device for correlation calculation |
CN105488033B (en) * | 2016-01-26 | 2018-01-02 | 中国人民解放军国防科学技术大学 | Associate the preprocess method and device calculated |
CN105740354B (en) * | 2016-01-26 | 2018-11-30 | 中国人民解放军国防科学技术大学 | The method and device of adaptive potential Di Li Cray model selection |
CN105868178B (en) * | 2016-03-28 | 2018-07-17 | 浙江大学 | A kind of multi-document auto-abstracting generation method of phrase-based theme modeling |
CN105868178A (en) * | 2016-03-28 | 2016-08-17 | 浙江大学 | Multi-document automatic abstract generation method based on phrase subject modeling |
CN105938481A (en) * | 2016-04-07 | 2016-09-14 | 北京航空航天大学 | Anomaly detection method of multi-mode text data in cities |
CN106294863A (en) * | 2016-08-23 | 2017-01-04 | 电子科技大学 | A kind of abstract method for mass text fast understanding |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106227722B (en) * | 2016-09-12 | 2019-07-05 | 中山大学 | A kind of extraction method based on listed company's bulletin abstract |
CN106708803A (en) * | 2016-12-21 | 2017-05-24 | 东软集团股份有限公司 | Feature extraction method and device |
US10621219B2 (en) | 2017-02-10 | 2020-04-14 | International Business Machines Corporation | Techniques for determining a semantic distance between subjects |
US10599694B2 (en) | 2017-02-10 | 2020-03-24 | International Business Machines Corporation | Determining a semantic distance between subjects |
US11295219B2 (en) | 2017-02-10 | 2022-04-05 | International Business Machines Corporation | Answering questions based on semantic distances between subjects |
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN107291836A (en) * | 2017-05-31 | 2017-10-24 | 北京大学 | A kind of Chinese text summary acquisition methods based on semantic relevancy model |
CN107291836B (en) * | 2017-05-31 | 2020-06-02 | 北京大学 | Chinese text abstract obtaining method based on semantic relevancy model |
CN108182247A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Text summarization method and apparatus |
CN110609997A (en) * | 2018-06-15 | 2019-12-24 | 北京百度网讯科技有限公司 | Method and device for generating abstract of text |
CN110609997B (en) * | 2018-06-15 | 2023-05-23 | 北京百度网讯科技有限公司 | Method and device for generating abstract of text |
CN108985370B (en) * | 2018-07-10 | 2021-04-16 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
CN108985370A (en) * | 2018-07-10 | 2018-12-11 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
CN109213853A (en) * | 2018-08-16 | 2019-01-15 | 昆明理工大学 | A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm |
CN109213853B (en) * | 2018-08-16 | 2022-04-12 | 昆明理工大学 | CCA algorithm-based Chinese community question-answer cross-modal retrieval method |
CN109284357B (en) * | 2018-08-29 | 2022-07-19 | 腾讯科技(深圳)有限公司 | Man-machine conversation method, device, electronic equipment and computer readable medium |
CN109284357A (en) * | 2018-08-29 | 2019-01-29 | 腾讯科技(深圳)有限公司 | Interactive method, device, electronic equipment and computer-readable medium |
US11775760B2 (en) | 2018-08-29 | 2023-10-03 | Tencent Technology (Shenzhen) Company Limited | Man-machine conversation method, electronic device, and computer-readable medium |
CN109344280B (en) * | 2018-10-13 | 2021-09-17 | 中山大学 | Method and system for retrieving flow chart based on graph model |
CN109344280A (en) * | 2018-10-13 | 2019-02-15 | 中山大学 | A kind of flow chart search method and system based on graph model |
CN110399606B (en) * | 2018-12-06 | 2023-04-07 | 国网信息通信产业集团有限公司 | Unsupervised electric power document theme generation method and system |
CN110399606A (en) * | 2018-12-06 | 2019-11-01 | 国网信息通信产业集团有限公司 | A kind of unsupervised electric power document subject matter generation method and system |
WO2020258948A1 (en) * | 2019-06-24 | 2020-12-30 | 北京大米科技有限公司 | Text generation method and apparatus, storage medium, and electronic device |
CN110728144B (en) * | 2019-10-06 | 2023-04-07 | 湖北工业大学 | Extraction type document automatic summarization method based on context semantic perception |
CN110728144A (en) * | 2019-10-06 | 2020-01-24 | 湖北工业大学 | Extraction type document automatic summarization method based on context semantic perception |
CN111177365B (en) * | 2019-12-20 | 2022-08-02 | 山东科技大学 | Unsupervised automatic abstract extraction method based on graph model |
CN111177365A (en) * | 2019-12-20 | 2020-05-19 | 山东科技大学 | Unsupervised automatic abstract extraction method based on graph model |
CN111159393A (en) * | 2019-12-30 | 2020-05-15 | 电子科技大学 | Text generation method for abstracting abstract based on LDA and D2V |
CN111159393B (en) * | 2019-12-30 | 2023-10-10 | 电子科技大学 | Text generation method for abstract extraction based on LDA and D2V |
CN111339287A (en) * | 2020-02-24 | 2020-06-26 | 成都网安科技发展有限公司 | Abstract generation method and device |
CN111339287B (en) * | 2020-02-24 | 2023-04-21 | 成都网安科技发展有限公司 | Abstract generation method and device |
CN112560479A (en) * | 2020-12-24 | 2021-03-26 | 北京百度网讯科技有限公司 | Abstract extraction model training method, abstract extraction device and electronic equipment |
CN112560479B (en) * | 2020-12-24 | 2024-01-12 | 北京百度网讯科技有限公司 | Abstract extraction model training method, abstract extraction device and electronic equipment |
CN112765344A (en) * | 2021-01-12 | 2021-05-07 | 哈尔滨工业大学 | Method, device and storage medium for generating meeting abstract based on meeting record |
CN114064885A (en) * | 2021-11-25 | 2022-02-18 | 北京航空航天大学 | Unsupervised Chinese multi-document extraction type abstract method |
CN114064885B (en) * | 2021-11-25 | 2024-05-31 | 北京航空航天大学 | Unsupervised Chinese multi-document extraction type abstract method |
CN116108831A (en) * | 2023-04-11 | 2023-05-12 | 宁波深擎信息科技有限公司 | Method, device, equipment and medium for extracting text abstract based on field words |
Also Published As
Publication number | Publication date |
---|---|
CN105243152B (en) | 2018-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243152A (en) | Graph model-based automatic abstracting method | |
CN100517330C (en) | Word sense based local file searching method | |
CN103235772B (en) | A kind of text set character relation extraction method | |
CN101398814B (en) | Method and system for simultaneously abstracting document summarization and key words | |
CN103399901B (en) | A kind of keyword abstraction method | |
CN100353361C (en) | New method of characteristic vector weighting for text classification and its device | |
CN106844424A (en) | A kind of file classification method based on LDA | |
CN106599029A (en) | Chinese short text clustering method | |
CN105868178A (en) | Multi-document automatic abstract generation method based on phrase subject modeling | |
CN106445920A (en) | Sentence similarity calculation method based on sentence meaning structure characteristics | |
CN104268197A (en) | Industry comment data fine grain sentiment analysis method | |
CN106294863A (en) | A kind of abstract method for mass text fast understanding | |
CN101630312A (en) | Clustering method for question sentences in question-and-answer platform and system thereof | |
CN104615593A (en) | Method and device for automatic detection of microblog hot topics | |
CN106970910A (en) | A kind of keyword extracting method and device based on graph model | |
CN108519971B (en) | Cross-language news topic similarity comparison method based on parallel corpus | |
CN103631858A (en) | Science and technology project similarity calculation method | |
CN101782898A (en) | Method for analyzing tendentiousness of affective words | |
CN101231634A (en) | Autoabstract method for multi-document | |
CN104484380A (en) | Personalized search method and personalized search device | |
Sharma et al. | Proposed stemming algorithm for Hindi information retrieval | |
CN101295294A (en) | Improved Bayes acceptation disambiguation method based on information gain | |
CN101702167A (en) | Method for extracting attribution and comment word with template based on internet | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |