CN107526841A - A kind of Tibetan language text summarization generation method based on Web - Google Patents

A kind of Tibetan language text summarization generation method based on Web Download PDF

Info

Publication number
CN107526841A
CN107526841A CN201710847326.1A CN201710847326A CN107526841A CN 107526841 A CN107526841 A CN 107526841A CN 201710847326 A CN201710847326 A CN 201710847326A CN 107526841 A CN107526841 A CN 107526841A
Authority
CN
China
Prior art keywords
sentence
article
text
vocabulary
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710847326.1A
Other languages
Chinese (zh)
Inventor
胥桂仙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minzu University of China
Original Assignee
Minzu University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minzu University of China filed Critical Minzu University of China
Priority to CN201710847326.1A priority Critical patent/CN107526841A/en
Publication of CN107526841A publication Critical patent/CN107526841A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention relates to a kind of Tibetan language text summarization generation method based on Web, comprise the following steps:Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated;It is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence;The sentence of extraction is resequenced according to order of the sentence in original text, sentence is subjected to splicing generation summary.The present invention, which proposes the present invention and proposed, takes extracts formula mode to carry out autoabstract generation method, it is a number of sentence composition summary that can most represent text subject thought of selection, it is effective to be convenient for people to obtain Tibetan information, while improve the efficiency that people obtain information.

Description

A kind of Tibetan language text summarization generation method based on Web
Technical field
The present invention relates to field of information processing, more particularly to a kind of Tibetan language text summarization generation method based on Web.
Background technology
Digest is for the purpose of providing content outline, is not added with comment and additional explanation, simplicity, definitely describes document The short essay of important content.Autoabstract technology is that text is analyzed using computer, and therefrom selection reflects text subject Content is extracted so as to form summary.Summary can be divided into different classifications according to different criteria for classifications.According in summary Summary can be divided into extracts formula summary and production two classes of summary by holding the contrast with source document.Extracts formula summary texts are by from source The sentence extracted in text is formed, and production summary establishes the generation made a summary on the basis of discourse semanticses are understood, The not all content of summary texts is all in source text.And summary can be divided into Dan Wen according to the amount of text of summary source text This summary and more text snippets.Single text snippet refers to only carry out abstract extraction to a source text, and more text snippets are pair Multiple source documents carry out comprehensive property summarization generation under same subject.Single text is made a summary using extracts formula method herein Extraction.
Now, the extraction for having Many researchers centering english abstract has carried out substantial amounts of research.From the research of autoabstract From the point of view of on strategy, the research of autoabstract can be divided into three periods:Mechanical summary period, understand summary period and comprehensive summary Period.Machinery summary is not limited by field, and method is simple, but summary quality will not be too high;And it is a certain to understand that summary is only used for Individual small field and quality height, but difficulty is larger, is not easy to realize;Therefore, people start gradually to pay attention to integrating a variety of methods To extract summary, have complementary advantages to realize, and then have the appearance of comprehensive summary.
The external research for autoabstract technology is started from 1958, and U.S. IBM H.P.Luhn has been carried out for the first time certainly Dynamic digest experiment.Luhn uses the occurrence number of vocabulary as the weights of word, and then the sentence comprising these high frequency words is carried out Marking, the high sentence of score is extracted as summary from text;P.E.Baxendale indicates sentence in section under study for action Relation between the position fallen and this importance;There is researcher to propose the side based on NB Algorithm again afterwards Method, the method based on decision Tree algorithms, method based on implicit Markov model etc., achieve certain effect.
The domestic research for automatic Summarization Technique is started late, with popularization of the computer in China, and during network The demand that generation is handled information flow, the research of automatic Summarization Technique just gradually grow up in the 90's of 20th century.1988 Professor Wang Yongcheng of Shanghai Communications University starts Chinese literature automatic abstracting system, develops that " Chinese document is worked out automatically in succession Digest pilot system ", " Automatic Abstracting System on Chinese Documents CAES " and 1997 year " OA Chinese literature autoabstracts system developed System ".Wherein OA Automatic Abstracting System on Chinese Documents employs apery algorithm, and has considered position, referring expression, keyword And many factors such as title, and the system is not limited by field, is a more practical system.In addition, Harbin is industrial University professor Wang Kaizhu combines the machinery summary based on statistics and made a summary with the understanding based on meaning, develops HIT -97I type English Literary automatic Summarization System.
Research for Tibetan language autoabstract is less at present, mainly have researcher peace see just allow proposition based on sentence The weight of each Web sentences is mainly decomposed into Web Feature Words and Web sentences by the Web Text summarization methods of extraction, this method Structure ratio, a number of sentence is then selected as summary according to sentence weights size.
With energetically support of the country to Tibetan areas information technology construction, the quantity of Tibetan language net is more and more, and this is The research of Tibetan language text summarization technology provides substantial amounts of language material.On the other hand, what Tibetan web page was made a summary is extracted as Tibetan area Informatization Development provides favourable technology, is brought convenience for the retrieval of Tibetan information, can allow one to soon judge Whether in original text have interested content, people can be allowed to find the information oneself really needed quickly if going out, without by the time It is wasted in the reading of uncorrelated document, greatly improves the efficiency that people obtain information, thus to social development and economic construction There is certain use value.
The content of the invention
The research of Tibetan language autoabstract is less at present, helps people to obtain Tibetan information for convenience, and the present invention carries Go out to take extracts formula mode to carry out autoabstract generation method, text subject thought can most be represented by being that selection is a number of Sentence composition summary, is effectively convenient for people to obtain Tibetan information, while improves the efficiency that people obtain information.
To achieve the above object, the invention provides a kind of Tibetan language text summarization generation method based on Web, including Following steps:Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated;Weighed according to sentence It is ranked up again, chooses the percentage of sentences in article sum as summary sentence;By the sentence of extraction according to sentence in original text Order is resequenced, and sentence is carried out into splicing generation summary.
Preferably, thesaurus is built by the following method:By counting the word frequency of article, a number of high frequency is selected Vocabulary is added to candidate topics vocabulary;The article after participle is matched by field antistop list, by the keyword of matching It is added to candidate topics vocabulary;Part vocabulary is extracted from article title and is added to candidate topics vocabulary;Finally according to theme Word extraction algorithm extracts thesaurus from above three parts vocabulary.
Preferably, sentence Weight function is designed as Wherein W (Sk) represent sentence SkWeights, Wp(Sk) represent sentence position weight, wkiThe weighting of high frequency vocabulary distich is represented, wkjRepresent keyword to sentence SkWeighting, wkmRepresent text header in vocabulary to sentence SkWeighting, Wc(Sk) represent clue Word is to sentence SkWeighting.
Preferably, to avoid clip Text from repeating, sentence novelty degree is calculated, sentence similarity calculation formula is:Wherein, Sim (Si,Sj) represent sentence SiWith sentence SjBetween it is similar Degree.
Preferably, it is the cutting by carrying out word to the sentence in article original text, is obtained after removing nonsensical stop words Take;Stop words is determined by filtering out high-frequency vocabulary.
Preferably, the sentence of extraction is resequenced according to order of the sentence in original text, sentence is spliced Generation summary step includes:By filtering the sentence of redundancy, extraction summary sentence is resequenced in original text order by sentence, pieced together It is used as summary.
Preferably, influenceing the factor of the sentence weight calculation includes:Word frequency, field keyword, title, position and clue One or more of word.
The present invention carries out Tibetan language abstract extraction using extracts formula method to single text, and being that selection is a number of can most represent The sentence composition summary of text subject thought, is effectively convenient for people to obtain Tibetan information, while improve people and obtain information Efficiency.
Brief description of the drawings
Fig. 1 is a kind of autoabstract generation method schematic flow sheet provided in an embodiment of the present invention;
The test sample surface chart of Fig. 2 embodiment of the present invention;
Fig. 3 is abstract extraction surface chart of the embodiment of the present invention.
Embodiment
Below by drawings and examples, technical scheme is described in further detail.
Fig. 1 is a kind of Tibetan language text summarization generation method flow signal based on Web provided in an embodiment of the present invention Figure.
As shown in figure 1, a kind of autoabstract generation method schematic flow sheet, specific steps include:
Step S110, go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated.
Sentence is the base unit of language performance, and no matter for Chinese text or Tibetan language text, sentence is all that semanteme is patrolled Collect structure, there is the minimum unit of complete syntax, the correlation between multiple semantic objects can be expressed.Therefore, select herein The elementary cell that sentence extracts as text snippet.
For Chinese, punctuation mark is the symbol of supplementary text record instruction, for representing pause, the tone and word Property and effect.Wherein period is used for representing the pause of different length, including fullstop (.), question mark (), exclamation mark (!), comma (), pause mark (), branch (;) and colon (:).Other also have label and symbol these symbols that article is divided into many sentences, Consequently facilitating the understanding of article meaning.And Tibetan language is different from Chinese, its primary symbols has:
The Tibetan language symbol of table 1
By carrying out statistical analysis to network text, mainly single line is used in network textCarry out the division of sentence.Therefore Single line is used hereinThe cutting and extraction of sentence are carried out as separator.
The extraction for taking extraction-type mode to be made a summary herein, removable auto-abstracting method are usually to select a fixed number The sentence that can most the represent text subject thought composition summary of amount, the selection of sentence are needed according to each sentence in the text important Degree is screened, and is needed for this to each sentence SkCertain weight is assigned, is designated as w (Sk).Herein for sentence weight w (Sk) calculating consider several factors:
(1) word frequency
In the writing of article, people often reuse for the vocabulary closely related with article theme.Therefore, exist Under statistical significance, vocabulary frequently occurs, mean that it to article to a certain extent expressed by the related possibility of theme It is higher.On the other hand, in article, the frequency that a word occurs is higher, and its significance level is also bigger, and place sentence also has more Representativeness, but except some can not represent the word of article meaning, namely stop words.
(2) field keyword
Field keyword can be good at reacting the text subject of association area, so can be with to the sentence containing keyword It is considered as sentence of making a summary.
(3) title
Title is the phrase for the prompting article content that author provides, and reflects the theme of article.Title is divided herein After word, by stop words vocabulary (Stoplist), reject the stop words included in title, remaining word often with original text master Topic has close relation.
(4) position
Position is a key character, and the P.E.Baxendale in U.S. investigation result is shown:The proposition of paragraph is paragraph The probability of first sentence is 85%, and the probability for being paragraph end sentence is 7%.Therefore, for the first section of article, latter end, section head and section tail The weight of sentence should be increased suitably.
(5) clue word
The probability that sentence where some special words is selected into summary is greater than other sentences, such asWe claim This kind of word is summary sentence clue word.If contained in sentence " discuss herein () ", " herein Propose () ", " all in all () " and " last () " etc. represent recapitulative word, then say The bright sentence can summarize the meaning of article, it should suitably increase weight.
In summary the analysis of factor, the Weight function design for sentence is as follows herein:
Wherein:W(Sk) represent sentence SkWeights;
Wp(Sk) sentence position weight is represented, it is according to sentence position that weights setting is as follows:
wkiThe weighting of expression high frequency vocabulary distich, specific value are as follows:
lkRepresent sentence SkLength.In general, the high frequency words that longer sentence usually contains are more, therefore need basis The number of high frequency words is normalized the length of sentence, so as to eliminate the influence of sentence length.Herein by the weight of high frequency words The entry sum that sum divided by sentence are included, obtains the average entry weight of sentence.
wkjRepresent keyword to sentence SkWeighting, value set it is as follows:
wkmRepresent title in vocabulary to sentence SkWeighting, its value set it is as follows:
Wc(Sk) represent clue word to sentence SkWeighting, value set it is as follows;
Step S120, it is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence;
Step S130, the sentence of extraction is resequenced according to order of the sentence in original text, sentence is spelled Deliver a child into summary.
The summary of original text is generated using extracting method herein, extraction algorithm is as follows:
Input:Tibetan language text
Output:Text snippet
Process:
(1) sentence in text is extracted;
(2) too short or too long of sentence is filtered out;
(3) the novel degree between sentence is calculated, to filter out redundancy sentence;
(4) weight of sentence is calculated according to formula (3);
(5) text snippet is generated;
(6) summary of generation is exported.
Mainly consider following factor during the selection of summary sentence:
(1) long or too short sentence is filtered out.
In the summary of article, the less appearance of generally long or too short sentence, so for long too short sentence Be not suitable for electing summary sentence as.The length of sentence is calculated in units of the word in Tibetan language herein, by statistics, chosen herein Most short and extreme length threshold value is respectively 5 and 40.
(2) redundancy sentence is filtered.
In the creation of article, in order to highlight article centre point, people often repeatedly use can be anti- The sentence of article centre point is reflected, these sentences are easy to be selected simultaneously as summary sentence, so as to cause the weight of clip Text It is multiple.Therefore, the novel degree of sentence can be calculated in the selection course of summary sentence herein.Herein by between calculating sentence Cosine similarity carry out judging sentence novelty degree.Sentence i is expressed as using vector space model SVM:Si(wi1,…,wik,…, w1n),wikRepresent term weight function in sentence.Characteristic value is used as using the number that Feature Words occur in sentence herein.Sentence is similar It is as follows to spend calculation formula:
Wherein, Sim (Si,Sj) represent sentence SiWith sentence SjBetween similarity.
(3) sentence weight
After being filtered to long in article or too short sentence, for remaining sentence, carried out by thesaurus The calculating of sentence weight.
The weight size that sentence is first according to after sentence weight is calculated carries out the sequence of sentence, and sentence is chosen according to ranking The 30% of sub- sum is as candidate's summary sentence.Sentence of being made a summary afterwards to candidate carries out sentence redundant computation, filters out redundancy sentence.Most Afterwards, resequenced for the summary sentence of extraction according to order of the sentence in original text, sentence is stitched together as summary.
As shown in Fig. 2 a test sample is chosen from the Tibetan language corpus of acquisition herein carries out instance analysis.
Summary sentence choose before, i.e., to sentence carry out weight computing before, it is necessary to in text how long or too short sentence Son is rejected, and has been filtered out as shown in Figure 2 by the threshold value (sentence length is less than 5 or sentence length is more than 40) of setting Sentence (2), line renumbering is entered to remaining sentence.
By the word frequency statisticses and Keywords matching to text, we have obtained thesaurus, next right according to formula (1) Sentence carries out weight computing, and table 2 lists the weights of text sentence.Wherein secondary series is to be carried out according to the position of sentence for sentence Initial assignment;3rd row are to carry out the result after weight computing for sentence by formula (3);4th row are the weights according to sentence Size sentence is ranked up after result.
The sentence weights of table 2
Fig. 3 is abstract extraction surface chart of the embodiment of the present invention.As shown in figure 3, be herein 30% by the ratio setting of summary, By before the weight selection ranking of table 2 30% sentence, totally 12, text, the strategy rounded downwards is taken, that is, select (3) (13) (9) (10) four, the summary after then being resequenced according to four positions in original text as this text.
By Fig. 2 and Fig. 3 to original text and summary contrast as can be seen that summary effect reached expected requirement, carry The clip Text taken can reflect the main contents of original text substantially.
The evaluation of autoabstract quality is carried out by the way of being compared with artificial summary herein, is manually made a summary by Tibetan people Member's manual extraction.In units of sentence, by calculating accuracy rate P, recall rate R and F value is weighed, and wherein F values are most important Index.The calculation formula of these three indexs such as formula (3), (4), (5).
Wherein:P is accuracy rate;R is recall rate;
A:In summary at the same be marked as make a summary sentence sentence number
B:Not in summary but it is marked as the sentence number of summary sentence
C:In summary but it is not flagged as the sentence number of summary sentence
Tibetan language language material 20 is randomly selected from corpus herein, compares, leads to artificial summary after generating autoabstract Cross the accuracy rate P that each piece article is calculated, recall rate R and F value.As shown in table 3
Table 3 P, R, F value
It is respectively 69.35%, 70.95%, 70.1% to finally give P, R, F average.From the point of view of F values, summary effect reaches More satisfactory effect.
Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include Within protection scope of the present invention.

Claims (8)

1. a kind of Tibetan language text summarization generation method based on Web, it is characterised in that comprise the following steps:
Go to match the sentence in article original text by thesaurus, and the weight of sentence is calculated;
It is ranked up according to sentence weight, chooses the percentage of sentences in article sum as summary sentence;
The sentence of extraction is resequenced according to order of the sentence in original text, sentence is subjected to splicing generation summary.
2. abstraction generating method according to claim 1, it is characterised in that the thesaurus is built by the following method:
By counting the word frequency of article, a number of high frequency vocabulary is selected to be added to candidate topics vocabulary;
The article after participle is matched by field antistop list, the keyword of matching is added to candidate topics vocabulary;
Part vocabulary is extracted from article title and is added to candidate topics vocabulary;
Thesaurus is finally extracted from above three parts vocabulary according to key phrases extraction algorithm.
3. autoabstract generation method as claimed in claim 1, it is characterised in that the sentence Weight function is designed asWherein W (Sk) represent sentence SkWeights, Wp(Sk) Represent sentence position weight, wkiRepresent the weighting of high frequency vocabulary distich, wkjRepresent keyword to sentence SkWeighting, wkmTable Show vocabulary in text header to sentence SkWeighting, Wc(Sk) represent clue word to sentence SkWeighting.
4. autoabstract generation method as claimed in claim 1, it is characterised in that to avoid clip Text from repeating, to sentence Novel degree is calculated, and sentence similarity calculation formula is:Wherein, Sim(Si,Sj) represent sentence SiWith sentence SjBetween similarity.
5. autoabstract generation method as claimed in claim 1, it is characterised in that be logical to the sentence in the article original text The cutting for carrying out word is crossed, is obtained after removing nonsensical stop words;The stop words is by filtering out high-frequency vocabulary It is determined that.
6. autoabstract generation method as claimed in claim 1, it is characterised in that described that the sentence of extraction exists according to sentence Order in original text is resequenced, and sentence is carried out into splicing generation summary step includes:
By filtering the sentence of redundancy, extraction summary sentence is resequenced in original text order by sentence, has pieced together and be used as summary.
7. autoabstract generation method as claimed in claim 1, it is characterised in that influence the factor of the sentence weight calculation Including:One or more of word frequency, field keyword, title, position and clue word.
8. autoabstract generation method as claimed in claim 1, it is characterised in that it is described to be ranked up according to sentence weight, The percentage of sentences in article sum is chosen as summary sentence step, including:
It is ranked up according to sentence weight, chooses the 30% of sentences in article sum and be used as summary sentence.
CN201710847326.1A 2017-09-19 2017-09-19 A kind of Tibetan language text summarization generation method based on Web Pending CN107526841A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710847326.1A CN107526841A (en) 2017-09-19 2017-09-19 A kind of Tibetan language text summarization generation method based on Web

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710847326.1A CN107526841A (en) 2017-09-19 2017-09-19 A kind of Tibetan language text summarization generation method based on Web

Publications (1)

Publication Number Publication Date
CN107526841A true CN107526841A (en) 2017-12-29

Family

ID=60737091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710847326.1A Pending CN107526841A (en) 2017-09-19 2017-09-19 A kind of Tibetan language text summarization generation method based on Web

Country Status (1)

Country Link
CN (1) CN107526841A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815328A (en) * 2018-12-28 2019-05-28 东软集团股份有限公司 A kind of abstraction generating method and device
CN110489543A (en) * 2019-08-14 2019-11-22 北京金堤科技有限公司 A kind of extracting method and device of news in brief
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN111651588A (en) * 2020-06-10 2020-09-11 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111797225A (en) * 2020-06-16 2020-10-20 北京北大软件工程股份有限公司 Text abstract generation method and device
CN112328946A (en) * 2020-12-10 2021-02-05 青海民族大学 Method and system for automatically generating Tibetan language webpage abstract

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393545A (en) * 2008-11-06 2009-03-25 新百丽鞋业(深圳)有限公司 Method for implementing automatic abstracting by utilizing association model
CN106021226A (en) * 2016-05-16 2016-10-12 中国建设银行股份有限公司 Text abstract generation method and apparatus
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393545A (en) * 2008-11-06 2009-03-25 新百丽鞋业(深圳)有限公司 Method for implementing automatic abstracting by utilizing association model
CN106021226A (en) * 2016-05-16 2016-10-12 中国建设银行股份有限公司 Text abstract generation method and apparatus
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
南奎娘若: "基于敏感信息的藏文文本摘要提取的研究", 《网络安全》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815328A (en) * 2018-12-28 2019-05-28 东软集团股份有限公司 A kind of abstraction generating method and device
CN109815328B (en) * 2018-12-28 2021-05-25 东软集团股份有限公司 Abstract generation method and device
CN110489543A (en) * 2019-08-14 2019-11-22 北京金堤科技有限公司 A kind of extracting method and device of news in brief
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN111159393B (en) * 2019-12-30 2023-10-10 电子科技大学 Text generation method for abstract extraction based on LDA and D2V
CN111651588A (en) * 2020-06-10 2020-09-11 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111651588B (en) * 2020-06-10 2024-03-05 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111797225A (en) * 2020-06-16 2020-10-20 北京北大软件工程股份有限公司 Text abstract generation method and device
CN111797225B (en) * 2020-06-16 2023-08-22 北京北大软件工程股份有限公司 Text abstract generation method and device
CN112328946A (en) * 2020-12-10 2021-02-05 青海民族大学 Method and system for automatically generating Tibetan language webpage abstract

Similar Documents

Publication Publication Date Title
CN107526841A (en) A kind of Tibetan language text summarization generation method based on Web
CN102360383B (en) Method for extracting text-oriented field term and term relationship
CN101599071B (en) Automatic extraction method of conversation text topic
CN109710947B (en) Electric power professional word bank generation method and device
Choi et al. Domain-specific sentiment analysis using contextual feature generation
CN101520802A (en) Question-answer pair quality evaluation method and system
El-Shishtawy et al. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques
Ayadi et al. Latent topic model for indexing arabic documents
CN109062895A (en) A kind of intelligent semantic processing method
CN111563372B (en) Typesetting document content self-duplication checking method based on teaching book publishing
Sembok et al. Arabic word stemming algorithms and retrieval effectiveness
Fodil et al. Theme classification of Arabic text: A statistical approach
Chader et al. Sentiment Analysis for Arabizi: Application to Algerian Dialect.
Alhanjouri Pre processing techniques for Arabic documents clustering
Al Taawab et al. Transliterated bengali comment classification from social media
Cherif et al. New rules-based algorithm to improve Arabic stemming accuracy
Ringlstetter et al. Adaptive text correction with Web-crawled domain-dependent dictionaries
Yapinus et al. Automatic multi-document summarization for Indonesian documents using hybrid abstractive-extractive summarization technique
Alam et al. Bangla news trend observation using LDA based topic modeling
Heidary et al. Automatic Persian text summarization using linguistic features from text structure analysis
Li-Juan et al. A classification method of Vietnamese news events based on maximum entropy model
Maheswari et al. Rule based morphological variation removable stemming algorithm
CN111209737B (en) Method for screening out noise document and computer readable storage medium
Liao et al. Combining Language Model with Sentiment Analysis for Opinion Retrieval of Blog-Post.
CN103646058B (en) Method and system for identifying key words in technical documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171229

RJ01 Rejection of invention patent application after publication