CN107526841A

CN107526841A - A kind of Tibetan language text summarization generation method based on Web

Info

Publication number: CN107526841A
Application number: CN201710847326.1A
Authority: CN
Inventors: 胥桂仙
Original assignee: Minzu University of China
Current assignee: Minzu University of China
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2017-12-29

Abstract

The present invention relates to a kind of Tibetan language text summarization generation method based on Web, comprise the following steps：Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated；It is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence；The sentence of extraction is resequenced according to order of the sentence in original text, sentence is subjected to splicing generation summary.The present invention, which proposes the present invention and proposed, takes extracts formula mode to carry out autoabstract generation method, it is a number of sentence composition summary that can most represent text subject thought of selection, it is effective to be convenient for people to obtain Tibetan information, while improve the efficiency that people obtain information.

Description

A kind of Tibetan language text summarization generation method based on Web

Technical field

The present invention relates to field of information processing, more particularly to a kind of Tibetan language text summarization generation method based on Web.

Background technology

Digest is for the purpose of providing content outline, is not added with comment and additional explanation, simplicity, definitely describes document The short essay of important content.Autoabstract technology is that text is analyzed using computer, and therefrom selection reflects text subject Content is extracted so as to form summary.Summary can be divided into different classifications according to different criteria for classifications.According in summary Summary can be divided into extracts formula summary and production two classes of summary by holding the contrast with source document.Extracts formula summary texts are by from source The sentence extracted in text is formed, and production summary establishes the generation made a summary on the basis of discourse semanticses are understood, The not all content of summary texts is all in source text.And summary can be divided into Dan Wen according to the amount of text of summary source text This summary and more text snippets.Single text snippet refers to only carry out abstract extraction to a source text, and more text snippets are pair Multiple source documents carry out comprehensive property summarization generation under same subject.Single text is made a summary using extracts formula method herein Extraction.

Now, the extraction for having Many researchers centering english abstract has carried out substantial amounts of research.From the research of autoabstract From the point of view of on strategy, the research of autoabstract can be divided into three periods：Mechanical summary period, understand summary period and comprehensive summary Period.Machinery summary is not limited by field, and method is simple, but summary quality will not be too high；And it is a certain to understand that summary is only used for Individual small field and quality height, but difficulty is larger, is not easy to realize；Therefore, people start gradually to pay attention to integrating a variety of methods To extract summary, have complementary advantages to realize, and then have the appearance of comprehensive summary.

The external research for autoabstract technology is started from 1958, and U.S. IBM H.P.Luhn has been carried out for the first time certainly Dynamic digest experiment.Luhn uses the occurrence number of vocabulary as the weights of word, and then the sentence comprising these high frequency words is carried out Marking, the high sentence of score is extracted as summary from text；P.E.Baxendale indicates sentence in section under study for action Relation between the position fallen and this importance；There is researcher to propose the side based on NB Algorithm again afterwards Method, the method based on decision Tree algorithms, method based on implicit Markov model etc., achieve certain effect.

The domestic research for automatic Summarization Technique is started late, with popularization of the computer in China, and during network The demand that generation is handled information flow, the research of automatic Summarization Technique just gradually grow up in the 90's of 20th century.1988 Professor Wang Yongcheng of Shanghai Communications University starts Chinese literature automatic abstracting system, develops that " Chinese document is worked out automatically in succession Digest pilot system ", " Automatic Abstracting System on Chinese Documents CAES " and 1997 year " OA Chinese literature autoabstracts system developed System ".Wherein OA Automatic Abstracting System on Chinese Documents employs apery algorithm, and has considered position, referring expression, keyword And many factors such as title, and the system is not limited by field, is a more practical system.In addition, Harbin is industrial University professor Wang Kaizhu combines the machinery summary based on statistics and made a summary with the understanding based on meaning, develops HIT -97I type English Literary automatic Summarization System.

Research for Tibetan language autoabstract is less at present, mainly have researcher peace see just allow proposition based on sentence The weight of each Web sentences is mainly decomposed into Web Feature Words and Web sentences by the Web Text summarization methods of extraction, this method Structure ratio, a number of sentence is then selected as summary according to sentence weights size.

With energetically support of the country to Tibetan areas information technology construction, the quantity of Tibetan language net is more and more, and this is The research of Tibetan language text summarization technology provides substantial amounts of language material.On the other hand, what Tibetan web page was made a summary is extracted as Tibetan area Informatization Development provides favourable technology, is brought convenience for the retrieval of Tibetan information, can allow one to soon judge Whether in original text have interested content, people can be allowed to find the information oneself really needed quickly if going out, without by the time It is wasted in the reading of uncorrelated document, greatly improves the efficiency that people obtain information, thus to social development and economic construction There is certain use value.

The content of the invention

The research of Tibetan language autoabstract is less at present, helps people to obtain Tibetan information for convenience, and the present invention carries Go out to take extracts formula mode to carry out autoabstract generation method, text subject thought can most be represented by being that selection is a number of Sentence composition summary, is effectively convenient for people to obtain Tibetan information, while improves the efficiency that people obtain information.

To achieve the above object, the invention provides a kind of Tibetan language text summarization generation method based on Web, including Following steps：Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated；Weighed according to sentence It is ranked up again, chooses the percentage of sentences in article sum as summary sentence；By the sentence of extraction according to sentence in original text Order is resequenced, and sentence is carried out into splicing generation summary.

Preferably, thesaurus is built by the following method：By counting the word frequency of article, a number of high frequency is selected Vocabulary is added to candidate topics vocabulary；The article after participle is matched by field antistop list, by the keyword of matching It is added to candidate topics vocabulary；Part vocabulary is extracted from article title and is added to candidate topics vocabulary；Finally according to theme Word extraction algorithm extracts thesaurus from above three parts vocabulary.

Preferably, sentence Weight function is designed as Wherein W (S_k) represent sentence S_kWeights, W_p(S_k) represent sentence position weight, w_kiThe weighting of high frequency vocabulary distich is represented, w_kjRepresent keyword to sentence S_kWeighting, w_kmRepresent text header in vocabulary to sentence S_kWeighting, W_c(S_k) represent clue Word is to sentence S_kWeighting.

Preferably, to avoid clip Text from repeating, sentence novelty degree is calculated, sentence similarity calculation formula is：Wherein, Sim (S_i,S_j) represent sentence S_iWith sentence S_jBetween it is similar Degree.

Preferably, it is the cutting by carrying out word to the sentence in article original text, is obtained after removing nonsensical stop words Take；Stop words is determined by filtering out high-frequency vocabulary.

Preferably, the sentence of extraction is resequenced according to order of the sentence in original text, sentence is spliced Generation summary step includes：By filtering the sentence of redundancy, extraction summary sentence is resequenced in original text order by sentence, pieced together It is used as summary.

Preferably, influenceing the factor of the sentence weight calculation includes：Word frequency, field keyword, title, position and clue One or more of word.

The present invention carries out Tibetan language abstract extraction using extracts formula method to single text, and being that selection is a number of can most represent The sentence composition summary of text subject thought, is effectively convenient for people to obtain Tibetan information, while improve people and obtain information Efficiency.

Brief description of the drawings

Fig. 1 is a kind of autoabstract generation method schematic flow sheet provided in an embodiment of the present invention；

The test sample surface chart of Fig. 2 embodiment of the present invention；

Fig. 3 is abstract extraction surface chart of the embodiment of the present invention.

Embodiment

Below by drawings and examples, technical scheme is described in further detail.

Fig. 1 is a kind of Tibetan language text summarization generation method flow signal based on Web provided in an embodiment of the present invention Figure.

As shown in figure 1, a kind of autoabstract generation method schematic flow sheet, specific steps include：

Step S110, go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated.

Sentence is the base unit of language performance, and no matter for Chinese text or Tibetan language text, sentence is all that semanteme is patrolled Collect structure, there is the minimum unit of complete syntax, the correlation between multiple semantic objects can be expressed.Therefore, select herein The elementary cell that sentence extracts as text snippet.

For Chinese, punctuation mark is the symbol of supplementary text record instruction, for representing pause, the tone and word Property and effect.Wherein period is used for representing the pause of different length, including fullstop (.), question mark (), exclamation mark (！), comma (), pause mark (), branch (；) and colon (：).Other also have label and symbol these symbols that article is divided into many sentences, Consequently facilitating the understanding of article meaning.And Tibetan language is different from Chinese, its primary symbols has：

The Tibetan language symbol of table 1

By carrying out statistical analysis to network text, mainly single line is used in network textCarry out the division of sentence.Therefore Single line is used hereinThe cutting and extraction of sentence are carried out as separator.

The extraction for taking extraction-type mode to be made a summary herein, removable auto-abstracting method are usually to select a fixed number The sentence that can most the represent text subject thought composition summary of amount, the selection of sentence are needed according to each sentence in the text important Degree is screened, and is needed for this to each sentence S_kCertain weight is assigned, is designated as w (S_k).Herein for sentence weight w (S_k) calculating consider several factors：

(1) word frequency

In the writing of article, people often reuse for the vocabulary closely related with article theme.Therefore, exist Under statistical significance, vocabulary frequently occurs, mean that it to article to a certain extent expressed by the related possibility of theme It is higher.On the other hand, in article, the frequency that a word occurs is higher, and its significance level is also bigger, and place sentence also has more Representativeness, but except some can not represent the word of article meaning, namely stop words.

(2) field keyword

Field keyword can be good at reacting the text subject of association area, so can be with to the sentence containing keyword It is considered as sentence of making a summary.

(3) title

Title is the phrase for the prompting article content that author provides, and reflects the theme of article.Title is divided herein After word, by stop words vocabulary (Stoplist), reject the stop words included in title, remaining word often with original text master Topic has close relation.

(4) position

Position is a key character, and the P.E.Baxendale in U.S. investigation result is shown：The proposition of paragraph is paragraph The probability of first sentence is 85%, and the probability for being paragraph end sentence is 7%.Therefore, for the first section of article, latter end, section head and section tail The weight of sentence should be increased suitably.

(5) clue word

The probability that sentence where some special words is selected into summary is greater than other sentences, such asWe claim This kind of word is summary sentence clue word.If contained in sentence " discuss herein () ", " herein Propose () ", " all in all () " and " last () " etc. represent recapitulative word, then say The bright sentence can summarize the meaning of article, it should suitably increase weight.

In summary the analysis of factor, the Weight function design for sentence is as follows herein：

Wherein：W(S_k) represent sentence S_kWeights；

W_p(S_k) sentence position weight is represented, it is according to sentence position that weights setting is as follows：

w_kiThe weighting of expression high frequency vocabulary distich, specific value are as follows：

l_kRepresent sentence S_kLength.In general, the high frequency words that longer sentence usually contains are more, therefore need basis The number of high frequency words is normalized the length of sentence, so as to eliminate the influence of sentence length.Herein by the weight of high frequency words The entry sum that sum divided by sentence are included, obtains the average entry weight of sentence.

w_kjRepresent keyword to sentence S_kWeighting, value set it is as follows：

w_kmRepresent title in vocabulary to sentence S_kWeighting, its value set it is as follows：

W_c(S_k) represent clue word to sentence S_kWeighting, value set it is as follows；

Step S120, it is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence；

Step S130, the sentence of extraction is resequenced according to order of the sentence in original text, sentence is spelled Deliver a child into summary.

The summary of original text is generated using extracting method herein, extraction algorithm is as follows:

Input：Tibetan language text

Output：Text snippet

Process:

(1) sentence in text is extracted；

(2) too short or too long of sentence is filtered out；

(3) the novel degree between sentence is calculated, to filter out redundancy sentence；

(4) weight of sentence is calculated according to formula (3)；

(5) text snippet is generated；

(6) summary of generation is exported.

Mainly consider following factor during the selection of summary sentence：

(1) long or too short sentence is filtered out.

In the summary of article, the less appearance of generally long or too short sentence, so for long too short sentence Be not suitable for electing summary sentence as.The length of sentence is calculated in units of the word in Tibetan language herein, by statistics, chosen herein Most short and extreme length threshold value is respectively 5 and 40.

(2) redundancy sentence is filtered.

In the creation of article, in order to highlight article centre point, people often repeatedly use can be anti- The sentence of article centre point is reflected, these sentences are easy to be selected simultaneously as summary sentence, so as to cause the weight of clip Text It is multiple.Therefore, the novel degree of sentence can be calculated in the selection course of summary sentence herein.Herein by between calculating sentence Cosine similarity carry out judging sentence novelty degree.Sentence i is expressed as using vector space model SVM：S_i(w_i1,…,w_ik,…, w_1n),w_ikRepresent term weight function in sentence.Characteristic value is used as using the number that Feature Words occur in sentence herein.Sentence is similar It is as follows to spend calculation formula：

Wherein, Sim (S_i,S_j) represent sentence S_iWith sentence S_jBetween similarity.

(3) sentence weight

After being filtered to long in article or too short sentence, for remaining sentence, carried out by thesaurus The calculating of sentence weight.

The weight size that sentence is first according to after sentence weight is calculated carries out the sequence of sentence, and sentence is chosen according to ranking The 30% of sub- sum is as candidate's summary sentence.Sentence of being made a summary afterwards to candidate carries out sentence redundant computation, filters out redundancy sentence.Most Afterwards, resequenced for the summary sentence of extraction according to order of the sentence in original text, sentence is stitched together as summary.

As shown in Fig. 2 a test sample is chosen from the Tibetan language corpus of acquisition herein carries out instance analysis.

Summary sentence choose before, i.e., to sentence carry out weight computing before, it is necessary to in text how long or too short sentence Son is rejected, and has been filtered out as shown in Figure 2 by the threshold value (sentence length is less than 5 or sentence length is more than 40) of setting Sentence (2), line renumbering is entered to remaining sentence.

By the word frequency statisticses and Keywords matching to text, we have obtained thesaurus, next right according to formula (1) Sentence carries out weight computing, and table 2 lists the weights of text sentence.Wherein secondary series is to be carried out according to the position of sentence for sentence Initial assignment；3rd row are to carry out the result after weight computing for sentence by formula (3)；4th row are the weights according to sentence Size sentence is ranked up after result.

The sentence weights of table 2

Fig. 3 is abstract extraction surface chart of the embodiment of the present invention.As shown in figure 3, be herein 30% by the ratio setting of summary, By before the weight selection ranking of table 2 30% sentence, totally 12, text, the strategy rounded downwards is taken, that is, select (3) (13) (9) (10) four, the summary after then being resequenced according to four positions in original text as this text.

By Fig. 2 and Fig. 3 to original text and summary contrast as can be seen that summary effect reached expected requirement, carry The clip Text taken can reflect the main contents of original text substantially.

The evaluation of autoabstract quality is carried out by the way of being compared with artificial summary herein, is manually made a summary by Tibetan people Member's manual extraction.In units of sentence, by calculating accuracy rate P, recall rate R and F value is weighed, and wherein F values are most important Index.The calculation formula of these three indexs such as formula (3), (4), (5).

Wherein：P is accuracy rate；R is recall rate；

A：In summary at the same be marked as make a summary sentence sentence number

B：Not in summary but it is marked as the sentence number of summary sentence

C：In summary but it is not flagged as the sentence number of summary sentence

Tibetan language language material 20 is randomly selected from corpus herein, compares, leads to artificial summary after generating autoabstract Cross the accuracy rate P that each piece article is calculated, recall rate R and F value.As shown in table 3

Table 3 P, R, F value

It is respectively 69.35%, 70.95%, 70.1% to finally give P, R, F average.From the point of view of F values, summary effect reaches More satisfactory effect.

Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include Within protection scope of the present invention.

Claims

1. a kind of Tibetan language text summarization generation method based on Web, it is characterised in that comprise the following steps：

Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated；

It is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence；

The sentence of extraction is resequenced according to order of the sentence in original text, sentence is subjected to splicing generation summary.

2. abstraction generating method according to claim 1, it is characterised in that the thesaurus is built by the following method：

By counting the word frequency of article, a number of high frequency vocabulary is selected to be added to candidate topics vocabulary；

The article after participle is matched by field antistop list, the keyword of matching is added to candidate topics vocabulary；

Part vocabulary is extracted from article title and is added to candidate topics vocabulary；

Thesaurus is finally extracted from above three parts vocabulary according to key phrases extraction algorithm.

3. autoabstract generation method as claimed in claim 1, it is characterised in that the sentence Weight function is designed asWherein W (S_k) represent sentence S_kWeights, W_p(S_k) Represent sentence position weight, w_kiRepresent the weighting of high frequency vocabulary distich, w_kjRepresent keyword to sentence S_kWeighting, w_kmTable Show vocabulary in text header to sentence S_kWeighting, W_c(S_k) represent clue word to sentence S_kWeighting.

4. autoabstract generation method as claimed in claim 1, it is characterised in that to avoid clip Text from repeating, to sentence Novel degree is calculated, and sentence similarity calculation formula is：Wherein, Sim(S_i,S_j) represent sentence S_iWith sentence S_jBetween similarity.

5. autoabstract generation method as claimed in claim 1, it is characterised in that be logical to the sentence in the article original text The cutting for carrying out word is crossed, is obtained after removing nonsensical stop words；The stop words is by filtering out high-frequency vocabulary It is determined that.

6. autoabstract generation method as claimed in claim 1, it is characterised in that described that the sentence of extraction exists according to sentence Order in original text is resequenced, and sentence is carried out into splicing generation summary step includes：

By filtering the sentence of redundancy, extraction summary sentence is resequenced in original text order by sentence, has pieced together and be used as summary.

7. autoabstract generation method as claimed in claim 1, it is characterised in that influence the factor of the sentence weight calculation Including：One or more of word frequency, field keyword, title, position and clue word.

8. autoabstract generation method as claimed in claim 1, it is characterised in that it is described to be ranked up according to sentence weight, The percentage of sentences in article sum is chosen as summary sentence step, including：

It is ranked up according to sentence weight, chooses the 30% of sentences in article sum and be used as summary sentence.