WO2017092316A1 - 摘要生成方法及装置 - Google Patents

摘要生成方法及装置 Download PDF

Info

Publication number
WO2017092316A1
WO2017092316A1 PCT/CN2016/088929 CN2016088929W WO2017092316A1 WO 2017092316 A1 WO2017092316 A1 WO 2017092316A1 CN 2016088929 W CN2016088929 W CN 2016088929W WO 2017092316 A1 WO2017092316 A1 WO 2017092316A1
Authority
WO
WIPO (PCT)
Prior art keywords
statement
statements
combination
document
processed
Prior art date
Application number
PCT/CN2016/088929
Other languages
English (en)
French (fr)
Inventor
赵九龙
Original Assignee
乐视控股(北京)有限公司
乐视网信息技术(北京)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视网信息技术(北京)股份有限公司 filed Critical 乐视控股(北京)有限公司
Priority to US15/239,768 priority Critical patent/US20170161259A1/en
Publication of WO2017092316A1 publication Critical patent/WO2017092316A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present disclosure relates to computer technology, and more particularly to a method and apparatus for generating a digest.
  • the news headline is a short text that summarizes or evaluates the news content in front of the news text content.
  • the function is to divide, organize, reveal, evaluate news content, and attract readers to read.
  • the present disclosure provides a method and apparatus for generating a digest to solve the technical problem that the news headline does not match the news content in the prior art, and the user may not be able to obtain the desired content by reading such news.
  • a digest generating method including:
  • the statement with the largest weight value in the combination of sentences is selected as a candidate statement
  • the candidate statements corresponding to a part of the statement combination are combined into a summary of the document to be processed.
  • a digest generating apparatus including:
  • a dividing module configured to divide the to-be-processed document into a plurality of statement combinations, each of the statement combinations comprising a preset number of statements;
  • a calculation module configured to calculate a weight value of all statements in each of the combination of statements
  • a combination module configured to combine the candidate statements corresponding to a part of the statement combination into a summary of the to-be-processed document.
  • a terminal comprising the digest generating apparatus according to any one of the implementation manners of the second aspect of the embodiments of the present disclosure.
  • the present disclosure divides a document to be processed into a plurality of sentence combinations, each of which includes a preset number of sentences; calculates a weight value of all statements in each of the sentence combinations; for each sentence combination, selects The statement with the largest weight value in the statement combination is used as a candidate statement; the candidate statements corresponding to a part of the statement combination may be combined into a summary of the to-be-processed document.
  • the method provided by the present disclosure can automatically generate a summary according to the content of the document, facilitate the user to quickly obtain the required information by reading the abstract, help people understand the document profile, and determine whether the original text should be read according to the document profile.
  • FIG. 1 is a flowchart of a digest generating method according to an exemplary embodiment
  • FIG. 2 is a flow chart of step S102 of Figure 1;
  • FIG. 3 is a flow chart of step S101 of Figure 1;
  • FIG. 4 is a flow chart of step S104 of Figure 1;
  • FIG. 5 is a flow chart of step S104 of Figure 1;
  • FIG. 6 is a device diagram of a digest generating apparatus according to an exemplary embodiment.
  • a digest generating method including the following steps.
  • step S101 the document to be processed is divided into a plurality of sentence combinations, and each of the sentence combinations includes a preset number of sentences.
  • the document may be divided into a plurality of sentences according to a punctuation indicating a long pause, such as a period, an exclamation mark, a question mark, etc., and a preset number of sentences may be combined into one statement combination, and each statement in the embodiment of the present disclosure There can be five statements in the combination.
  • a punctuation indicating a long pause such as a period, an exclamation mark, a question mark, etc.
  • step S102 the weight values of all the statements in each of the statement combinations are calculated.
  • the TextRank formula can be used to calculate the weight of the statement in the document to be processed, and the BM25 algorithm can be used to calculate the similarity between the two sentences.
  • step S103 for each combination of sentences, a sentence with the highest weight value in the combination of sentences is selected as a candidate statement.
  • step S104 the candidate sentences corresponding to a part of the statement combination are combined into the to-be-supplied A summary of the documentation.
  • a preset number of weights with the largest weight may be selected as a summary of the document to be processed, for example, CPQRS, CFPQS, and the like.
  • the present disclosure can automatically generate a summary according to the content of the document, so that the user can quickly obtain the required information by reading the abstract, help people understand the document overview, and determine whether the original text should be read according to the document overview.
  • the step S102 includes the following steps.
  • step S201 the text in the document is divided into a plurality of words.
  • step S202 each word is marked with part of speech.
  • step S201 and step S202 the document to be processed can be segmented by using a tokenizer to realize entity recognition such as person name and place name, and obtain words and their part of speech.
  • step S203 the words whose partiality is segmented in each sentence are the words of the predetermined part of speech, and the words located in the preset blacklist are deleted.
  • the words belonging to the preset part of speech and the words in the preset blacklist may be filtered according to the preset part of speech and the preset blacklist.
  • the preset part of speech includes the name
  • the pending document may be The name of the person appearing in the deletion is deleted.
  • the place name is included in the preset blacklist, the place name in the document to be processed can be deleted.
  • step S204 the similarity of every two sentences in the combination of sentences is calculated.
  • the similarity between two sentences can be calculated using the BM25 algorithm.
  • the BM25 algorithm is as follows:
  • Q and d represent two sentences
  • qi is a word in the sentence
  • Wi represents the weight of qi
  • R(qi, d) represents the correlation score of the morpheme qi and the document d to be processed.
  • Score(Q,d) is the similarity between the two sentences Q and d.
  • step S205 the weight values of all the sentences in each of the sentence combinations are calculated using the similarity.
  • the TextRank formula is as follows. under:
  • WS(Vi) on the left side of the equation represents the weight of a sentence (WS is an abbreviation of weight_sum), and the summation on the right side indicates the degree of contribution of each adjacent sentence to the sentence.
  • the summed numerator wji indicates the similarity of two sentences.
  • the degree, the denominator is again a weight_sum
  • WS(Vj) represents the weight of the last iteration j.
  • In (vi) represents the set of nodes pointing to node vi
  • Out (vj) represents the set of nodes pointed to by node vi
  • d is the damping coefficient (DampingFac-tor), generally takes a value of 0.85
  • the whole formula is a The iterative process.
  • the method provided by the embodiment of the present disclosure can reflect each article as a whole, reflect the relevance between sentences, facilitate calculation of weights, and can take into account the similarity between sentences, and avoid repeated sentences in the extracted abstract.
  • the step S101 includes the following steps.
  • step S301 the content of the document to be processed is divided into a plurality of sentences according to a preset punctuation.
  • step S302 for each statement, according to the ordering of the statements in the document to be processed, the statement and a preset number of consecutive statements after the statement are selected as one sentence combination.
  • the A statement, the B statement, the C statement, the D statement, the E statement, the F statement, and the G statement can be regarded as a first
  • the statement combination combines the B statement, the C statement, the D statement, the E statement, and the F statement as a second statement, and combines the C statement, the D statement, the E statement, the F statement, and the G statement as a third statement.
  • the method provided by the embodiment of the present disclosure can combine each statement with its adjacent sentence constituent sentences, so that the similarity and weight value between the calculated sentences will be more accurate.
  • the step S104 includes the following steps.
  • step S401 a statement corresponding to the largest weight value in each sentence combination is determined as the target sentence.
  • step S402 a predetermined number of target sentences are determined as candidate statements.
  • the embodiment of the present disclosure can determine the "most important" sentence in each sentence combination as the target sentence, and sort all the target sentences and select the "most important" statement as the candidate statement, which can accurately select The most important candidate statements in the document, in order to generate a summary based on these candidate statements, the calculation is small, and the selection range is more comprehensive.
  • the step S104 includes the following steps.
  • step S501 the ordering of the candidate sentences corresponding to the statement combination in the document to be processed is acquired.
  • step S502 a summary of the to-be-processed document is generated according to the sorting.
  • a summary of the document can be generated in the order in which the partial statements are combined in the document.
  • the method provided by the embodiment of the present disclosure can display the finally selected candidate statements according to their order in the document, which is convenient for the user to understand.
  • the embodiment of the present disclosure further provides a computer storage medium, wherein the computer storage medium can store a program, and the program can be implemented in each implementation manner of the digest generating method provided by the embodiment shown in FIG. 1 to FIG. Part or all of the steps.
  • a digest generating apparatus including: a dividing module 601, a calculating module 602, a selecting module 603, and a combining module 604.
  • the dividing module 601 is configured to divide the to-be-processed document into a plurality of statement combinations, and each of the statement combinations includes a preset number of statements.
  • the calculation module 602 is configured to calculate a weight value of all the statements in each of the statement combinations.
  • the selection module 603 is configured to select, as a candidate statement, a statement with the largest weight value in the combination of sentences for each combination of sentences.
  • the combination module 604 is configured to combine the candidate statements corresponding to a part of the statement combination into a summary of the to-be-processed document.
  • the calculation module 602 includes: a segmentation submodule, an annotation submodule, a deletion submodule, a similarity calculation submodule, and a weight calculation submodule.
  • a split submodule that splits the text in a document into multiple words.
  • the sub-module is deleted, and the words whose word-of-speech is the preset part of speech in the plurality of words obtained by segmentation in each sentence, and the words located in the preset blacklist are deleted.
  • the similarity calculation submodule is configured to calculate the similarity of each two sentences in the combination of sentences.
  • a weight calculation sub-module is configured to calculate a weight value of all statements in each of the statement combinations by using the similarity.
  • the dividing module 601 includes: a dividing submodule and a selecting submodule.
  • the sub-module is configured to divide the content of the document to be processed into a plurality of sentences according to preset punctuation.
  • the sub-module is selected to select, according to the order of the statement in the to-be-processed document for each statement, the statement and a preset number of consecutive statements after the statement as a combination of sentences.
  • the combining module 604 includes: a first determining submodule and a second determining submodule.
  • a first determining submodule configured to determine a statement corresponding to a largest weight value in each statement combination as a target statement
  • the second determining submodule is configured to determine a preset number of target statements as candidate statements.
  • the combining module 604 includes: an obtaining submodule and a generating submodule.
  • Obtaining a sub-module configured to obtain a sorting of the candidate statements corresponding to a part of the statement combination in a document to be processed
  • the embodiment of the present disclosure further provides a terminal, which may be an electronic device having a document reading function, such as a personal computer, a mobile phone, a tablet computer, etc., wherein the terminal includes any implementation manner of the embodiment shown in FIG. 6 above.
  • the digest generating device may be an electronic device having a document reading function, such as a personal computer, a mobile phone, a tablet computer, etc., wherein the terminal includes any implementation manner of the embodiment shown in FIG. 6 above.
  • the digest generating device may be an electronic device having a document reading function, such as a personal computer, a mobile phone, a tablet computer, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种摘要生成方法及装置,所述方法包括以下步骤:将待处理文档划分为多个语句组合(S101),每个所述语句组合中均包含预设数量个语句;计算每个所述语句组合中所有语句的权重值(S102);针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句(S103);将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要(S104)。该方法能够通过自动根据文档内容生成摘要,方便用户快速通过阅读文摘获取所需信息,帮助人们了解文档概况,并根据文档概况确定是否应该详读原文。

Description

摘要生成方法及装置
本申请要求于2015年12月3日提交中国专利局、申请号为201510882825.5、发明名称为“摘要生成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术,尤其涉及摘要生成方法及装置。
背景技术
随着互联网的普及、以及信息获取途径的增加,每天都有不断涌现的海量信息。所以目前的新闻一般都设有新闻标题,新闻标题是在新闻正文内容前面,对新闻内容加以概括或评价的简短文字,作用是划分、组织、揭示、评价新闻内容、吸引读者阅读。
但是由于目前网络上新闻数据比较多,某些媒体为吸引用户眼球,以获取更大的用户的浏览量,某些新闻标题可能会设置的过分夸大,而且与文章内容无多大关联,用户在读取完这样的新闻之后可能并没有得到需要的信息,浪费用户时间及精力。
发明内容
本公开提供一种摘要生成方法及装置,用以解决现有技术中新闻标题与新闻内容不符,用户通过读取这样的新闻可能无法获取到所需内容的技术问题。
根据本公开实施例的第一方面,提供一种摘要生成方法,包括:
将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句;
计算每个所述语句组合中所有语句的权重值;
针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句;
将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
根据本公开实施例的第二方面,提供一种摘要生成装置,包括:
划分模块,用于将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句;
计算模块,用于计算每个所述语句组合中所有语句的权重值;
选取模块,用于针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句;
组合模块,用于将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
根据本公开实施例的第三方面,提供一种终端,该终端包括本公开实施例第二方面中任一实现方式所述的摘要生成装置。
本公开的实施例提供的技术方案可以包括以下有益效果:
本公开通过将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句;计算每个所述语句组合中所有语句的权重值;针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句;可以将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
本公开提供的该方法能够通过自动根据文档内容生成摘要,方便用户快速通过阅读文摘获取所需信息,帮助人们了解文档概况,并根据文档概况确定是否应该详读原文。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。
图1是根据一示例性实施例示出的一种摘要生成方法的流程图;
图2是图1中步骤S102的流程图;
图3是图1中步骤S101的流程图;
图4是图1中步骤S104的流程图;
图5是图1中步骤S104的流程图;
图6是根据一示例性实施例示出的一种摘要生成装置的装置图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。
随着互联网的普及、信息获取途径的增加,每天都有不断涌现的海量信息。为了从这些海量信息中快速、准确地获取有用信息,文档的自动摘要处理变得越来越重要。为此,如图1所示,在本公开的一个实施例中,提供一种摘要生成方法,包括以下步骤。
在步骤S101中,将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句。
在该步骤中,可以按照句号、叹号、问号等表示长停顿的标点来将文档划分为多个语句,并且可以将预设数量个语句组合成一个语句组合,在本公开实施例中每个语句组合中可以包含五个语句。
在步骤S102中,计算每个所述语句组合中所有语句的权重值。
在该步骤中,可以利用TextRank公式计算语句在待处理文档中的权重,并且可以利用BM25算法来计算两个语句之间的相似度。
在步骤S103中,针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句。
例如:如果存在一个语句组合M中包含5个语句A、B、C、D和E的话,在经过TextRank公式计算A、B、C、D和E五个语句在待处理文档中的权重之后,得到C语句权重最大,则可以选取C作为候选语句,同理,如果存在一个语句组合N中包含5个语句F、G、H、I和J,则可以选取计算后权重最大的F语句作为候选语句,以此类推,除了候选语句C和F,还可以得到候选语句P、Q、R、S等。
在步骤S104中,将部分所述语句组合对应的所述候选语句组合成所述待处 理文档的摘要。
在该步骤中,当候选语句为C、F、P、Q、R和S时,可以从中选取权重最大的预设数量个作为待处理文档的摘要,例如:CPQRS、CFPQS等等。
本公开能够通过自动根据文档内容生成摘要,方便用户快速通过阅读文摘获取所需信息,帮助人们了解文档概况,并根据文档概况确定是否应该详读原文。
如图2所示,在本公开的又一实施例中,所述步骤S102包括以下步骤。
在步骤S201中,将文档中的文字分割为多个词语。
在步骤S202中,为每个词语标注词性。
在步骤S201和步骤S202中,可以将待处理文档利用分词器来对文本进行分词,实现人名、地名等实体识别,得到词语以及其词性。
在步骤S203中,将每个语句中分割得到的多个词语中词性为预设词性的词语,以及,位于预设黑名单中的词语删除。
在该步骤中,可以根据预设词性和预设黑名单,来过滤掉属于预设词性的词语以及位于预设黑名单中的词语,例如:当预设词性包括名字时,可以将待处理文档中出现的人名删除,当预设黑名单中包括地名时,可以将待处理文档中的地名删除等。
在步骤S204中,计算所述语句组合中每两个语句的相似度。
在该步骤中,可以利用BM25算法计算两个语句之间的相似度,BM25算法如下:
Figure PCTCN2016088929-appb-000001
在本公开实施例中,Q和d所代表的是两个句子,qi是句子中的一个词,Wi表示qi的权重,R(qi,d)表示语素qi与待处理文档d的相关性得分,这样Score(Q,d)就是Q和d两个句子的相似度。
在步骤S205中,利用所述相似度计算每个所述语句组合中所有语句的权重值。
在该步骤中,可以利用TextRank公式计算语句的权重值,TextRank公式如 下:
Figure PCTCN2016088929-appb-000002
其中,等式左边WS(Vi)表示一个句子的权重(WS是weight_sum的缩写),右侧的求和表示每个相邻句子对本句子的贡献程度,求和的分子wji表示两个句子的相似程度,分母又是一个weight_sum,WS(Vj)代表上次迭代j的权重。In(vi)表示指向结点vi的结点集合,Out(vj)表示结点vi所指向的结点的集合,d为阻尼系数(DampingFac-tor),一般取值为0.85,整个公式是一个迭代的过程。
本公开实施例提供的该方法,能够把每篇文章作为一个整体,体现出句子间的关联性,方便计算权重,并且能够兼顾语句间的相似性,避免提取出的摘要中出现重复的语句。
如图3所示,在本公开的又一实施例中,所述步骤S101包括以下步骤。
在步骤S301中,将待处理文档的内容按照预设标点划分为多个语句。
在步骤S302中,针对每个语句,按照语句在所述待处理文档中的排序,选取所述语句以及所述语句之后的预设数量个连续的语句作为一个语句组合。
例如:划分语句后的文档包括A语句、B语句、C语句、D语句、E语句、F语句和G语句,则可以将A语句、B语句、C语句、D语句和E语句作为一个第一语句组合,将B语句、C语句、D语句、E语句和F语句作为第二语句组合,将C语句、D语句、E语句、F语句和G语句作为第三语句组合。
本公开实施例提供的该方法,能够将每个语句分别与其相邻的语句构成语句组合,这样计算的句子之间的相似性及权重值将更加准确。
如图4所示,在本公开的又一个实施例中,所述步骤S104包括以下步骤。
在步骤S401中,在将每个语句组合中最大的权重值对应的语句确定为目标语句。
在步骤S402中,将预设数量个目标语句确定为候选语句。
在该步骤中,可以将所有目标语句按照权重值大小进行排序后,选取其中权重值最大的预设数量个目标语句作为候选语句。
本公开实施例能够将每个语句组合中“最重要”即权重值最大的语句确定为目标语句,并将所有目标语句进行排序后选取“最重要”的语句作为候选语句,能够准确的选取出文档中最重要的候选语句,以便根据这些候选语句生成摘要,计算量小,且选取范围更全面。
如图5所示,在本公开的又一实施例中,所述步骤S104包括以下步骤。
在步骤S501中,获取部分所述语句组合对应的所述候选语句在待处理文档中的排序。
在该步骤中,可以获取部分语句组合在文档中的位置,或者在文档中的先后顺序。
在步骤S502中,按照所述排序生成所述待处理文档的摘要。
在该步骤中,可以按照部分语句组合在文档中的先后顺序生成文档的摘要。
本公开实施例提供的该方法,能够将最终选取的候选语句按照其在文档中的先后顺序进行显示,方便用户理解。
另外,本公开实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时可实现图1至图5所示实施例提供的摘要生成方法的各实现方式中的部分或全部步骤。
如图6所示,在本公开的又一实施例中,提供一种摘要生成装置,包括:划分模块601、计算模块602、选取模块603和组合模块604。
划分模块601,用于将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句。
计算模块602,用于计算每个所述语句组合中所有语句的权重值。
选取模块603,用于针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句。
组合模块604,用于将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
在本公开的又一实施例中,所述计算模块602,包括:分割子模块、标注子模块、删除子模块、相似度计算子模块和权重计算子模块。
分割子模块,用于将文档中的文字分割为多个词语。
标注子模块,用于为每个词语标注词性。
删除子模块,用于将每个语句中分割得到的多个词语中词性为预设词性的词语,以及,位于预设黑名单中的词语删除。
相似度计算子模块,用于计算所述语句组合中每两个语句的相似度。
权重计算子模块,用于利用所述相似度计算每个所述语句组合中所有语句的权重值。
在本公开的又一实施例中,所述划分模块601,包括:划分子模块和选取子模块。
划分子模块,用于将待处理文档的内容按照预设标点划分为多个语句。
选取子模块,用于针对每个语句,按照语句在所述待处理文档中的排序,选取所述语句以及所述语句之后的预设数量个连续的语句作为一个语句组合。
在本公开的又一实施例中,所述组合模块604,包括:第一确定子模块和第二确定子模块。
第一确定子模块,用于将每个语句组合中最大的权重值对应的语句确定为目标语句;
第二确定子模块,用于将预设数量个目标语句确定为候选语句。
在本公开的又一实施例中,所述组合模块604,包括:获取子模块和生成子模块。
获取子模块,用于获取部分所述语句组合对应的所述候选语句在待处理文档中的排序;
生成子模块,用于按照所述排序生成所述待处理文档的摘要。
另外,本公开实施例还提供一种终端,该终端可以为个人电脑、手机、平板电脑等具有文档阅读功能的电子设备;其中,该终端包括上文图6所示实施例中任一实现方式所述的摘要生成装置。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由所附的权利要求指出。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (11)

  1. 一种摘要生成方法,其特征在于,包括:
    将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句;
    计算每个所述语句组合中所有语句的权重值;
    针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句;
    将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
  2. 根据权利要求1所述的摘要生成方法,其特征在于,所述计算每个所述语句组合中所有语句的权重值包括:
    将文档中的文字分割为多个词语;
    为每个词语标注词性;
    将每个语句中分割得到的多个词语中词性为预设词性的词语,以及,位于预设黑名单中的词语删除;
    计算所述语句组合中每两个语句的相似度;
    利用所述相似度计算每个所述语句组合中所有语句的权重值。
  3. 根据权利要求1所述的摘要生成方法,其特征在于,所述将待处理文档划分为多个语句组合包括:
    将待处理文档的内容按照预设标点划分为多个语句;
    针对每个语句,按照语句在所述待处理文档中的排序,选取所述语句以及所述语句之后的预设数量个连续的语句作为一个语句组合。
  4. 根据权利要求1所述的摘要生成方法,其特征在于,所述将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要包括:
    将每个语句组合中最大的权重值对应的语句确定为目标语句;
    将预设数量个目标语句确定为候选语句。
  5. 根据权利要求1所述的摘要生成方法,其特征在于,所述将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要包括:
    获取部分所述语句组合对应的所述候选语句在待处理文档中的排序;
    按照所述排序生成所述待处理文档的摘要。
  6. 一种摘要生成装置,其特征在于,包括:
    划分模块,用于将待处理文档划分为多个语句组合,每个所述语句组合中均包含预设数量个语句;
    计算模块,用于计算每个所述语句组合中所有语句的权重值;
    选取模块,用于针对每个语句组合,选取所述语句组合中权重值最大的语句作为候选语句;
    组合模块,用于将部分所述语句组合对应的所述候选语句组合成所述待处理文档的摘要。
  7. 根据权利要求6所述的摘要生成装置,其特征在于,所述计算模块,包括:
    分割子模块,用于将文档中的文字分割为多个词语;
    标注子模块,用于为每个词语标注词性;
    删除子模块,用于将每个语句中分割得到的多个词语中词性为预设词性的词语,以及,位于预设黑名单中的词语删除;
    相似度计算子模块,用于计算所述语句组合中每两个语句的相似度;
    权重计算子模块,用于利用所述相似度计算每个所述语句组合中所有语句的权重值。
  8. 根据权利要求6所述的摘要生成装置,其特征在于,所述划分模块,包括:
    划分子模块,用于将待处理文档的内容按照预设标点划分为多个语句;
    选取子模块,用于针对每个语句,按照语句在所述待处理文档中的排序,选取所述语句以及所述语句之后的预设数量个连续的语句作为一个语句组合。
  9. 根据权利要求6所述的摘要生成装置,其特征在于,所述组合模块,包括:
    第一确定子模块,用于将每个语句组合中最大的权重值对应的语句确定为目标语句;
    第二确定子模块,用于将预设数量个目标语句确定为候选语句。
  10. 根据权利要求6所述的摘要生成装置,其特征在于,所述组合模块,包括:
    获取子模块,用于获取部分所述语句组合对应的所述候选语句在待处理文档中的排序;
    生成子模块,用于按照所述排序生成所述待处理文档的摘要。
  11. 一种终端,其特征在于,包括:如权利要求6至10任一项所述的摘要生成装置。
PCT/CN2016/088929 2015-12-03 2016-07-06 摘要生成方法及装置 WO2017092316A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/239,768 US20170161259A1 (en) 2015-12-03 2016-08-17 Method and Electronic Device for Generating a Summary

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510882825.5A CN105868175A (zh) 2015-12-03 2015-12-03 摘要生成方法及装置
CN201510882825.5 2015-12-03

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/239,768 Continuation US20170161259A1 (en) 2015-12-03 2016-08-17 Method and Electronic Device for Generating a Summary

Publications (1)

Publication Number Publication Date
WO2017092316A1 true WO2017092316A1 (zh) 2017-06-08

Family

ID=56624346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/088929 WO2017092316A1 (zh) 2015-12-03 2016-07-06 摘要生成方法及装置

Country Status (3)

Country Link
US (1) US20170161259A1 (zh)
CN (1) CN105868175A (zh)
WO (1) WO2017092316A1 (zh)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11544306B2 (en) 2015-09-22 2023-01-03 Northern Light Group, Llc System and method for concept-based search summaries
US11886477B2 (en) 2015-09-22 2024-01-30 Northern Light Group, Llc System and method for quote-based search summaries
US11226946B2 (en) 2016-04-13 2022-01-18 Northern Light Group, Llc Systems and methods for automatically determining a performance index
CN106708932A (zh) * 2016-11-21 2017-05-24 百度在线网络技术(北京)有限公司 问答类网站的回复的摘要提取方法及装置
CN106959945B (zh) * 2017-03-23 2021-01-05 北京百度网讯科技有限公司 基于人工智能的为新闻生成短标题的方法和装置
CN109299454A (zh) * 2017-07-24 2019-02-01 北京京东尚科信息技术有限公司 基于聊天日志的摘要生成方法及装置、存储介质及电子终端
CN109947929A (zh) * 2017-07-24 2019-06-28 北京京东尚科信息技术有限公司 会话摘要生成方法及装置、存储介质及电子终端
US10127323B1 (en) 2017-07-26 2018-11-13 International Business Machines Corporation Extractive query-focused multi-document summarization
CN108304445B (zh) * 2017-12-07 2021-08-03 新华网股份有限公司 一种文本摘要生成方法和装置
CN108197103B (zh) * 2017-12-27 2019-05-17 掌阅科技股份有限公司 电子缩略书生成方法、电子设备及计算机存储介质
CN108399265A (zh) * 2018-03-23 2018-08-14 北京奇虎科技有限公司 基于搜索的实时热点新闻提供方法及装置
CN108628833B (zh) * 2018-05-11 2021-01-22 北京三快在线科技有限公司 原创内容摘要确定方法及装置,原创内容推荐方法及装置
CN108897852B (zh) * 2018-06-29 2020-10-23 北京百度网讯科技有限公司 对话内容连贯性的判断方法、装置以及设备
CN110781659A (zh) * 2018-07-11 2020-02-11 株式会社Ntt都科摩 基于神经网络的文本处理方法和文本处理装置
CN108959269B (zh) * 2018-07-27 2019-07-05 首都师范大学 一种语句自动排序方法及装置
CN109726282A (zh) * 2018-12-26 2019-05-07 东软集团股份有限公司 一种生成文章摘要的方法、装置、设备和存储介质
CN110245230A (zh) * 2019-05-15 2019-09-17 北京思源智通科技有限责任公司 一种图书分级方法、***、存储介质和服务器
CN110334192B (zh) * 2019-07-15 2021-09-24 河北科技师范学院 文本摘要生成方法及***、电子设备及存储介质
CN111241267B (zh) * 2020-01-10 2022-12-06 科大讯飞股份有限公司 摘要提取和摘要抽取模型训练方法及相关装置、存储介质
CN114595684A (zh) * 2022-02-11 2022-06-07 北京三快在线科技有限公司 一种摘要生成方法、装置、电子设备及存储介质
CN114328883B (zh) * 2022-03-08 2022-06-28 恒生电子股份有限公司 一种机器阅读理解的数据处理方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (zh) * 2012-10-29 2013-02-27 广西工学院 一种基于文本分割技术的多文档文摘方法
CN103136359A (zh) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 单文档摘要生成方法
US20140250376A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Summarizing and navigating data using counting grids
CN104156452A (zh) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 一种网页文本摘要生成方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2184518A1 (en) * 1996-08-30 1998-03-01 Jim Reed Real time structured summary search engine
US7017114B2 (en) * 2000-09-20 2006-03-21 International Business Machines Corporation Automatic correlation method for generating summaries for text documents
US20040133560A1 (en) * 2003-01-07 2004-07-08 Simske Steven J. Methods and systems for organizing electronic documents
US8055713B2 (en) * 2003-11-17 2011-11-08 Hewlett-Packard Development Company, L.P. Email application with user voice interface
CN100418093C (zh) * 2006-04-13 2008-09-10 北大方正集团有限公司 一种基于簇排列的面向主题或查询的多文档摘要方法
US20110295612A1 (en) * 2010-05-28 2011-12-01 Thierry Donneau-Golencer Method and apparatus for user modelization
CN102411621B (zh) * 2011-11-22 2014-01-08 华中师范大学 一种基于云模型的中文面向查询的多文档自动文摘方法
CN103246687B (zh) * 2012-06-13 2016-08-17 苏州大学 基于特征信息的Blog自动摘要方法
US9461876B2 (en) * 2012-08-29 2016-10-04 Loci System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (zh) * 2012-10-29 2013-02-27 广西工学院 一种基于文本分割技术的多文档文摘方法
US20140250376A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Summarizing and navigating data using counting grids
CN103136359A (zh) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 单文档摘要生成方法
CN104156452A (zh) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 一种网页文本摘要生成方法和装置

Also Published As

Publication number Publication date
CN105868175A (zh) 2016-08-17
US20170161259A1 (en) 2017-06-08

Similar Documents

Publication Publication Date Title
WO2017092316A1 (zh) 摘要生成方法及装置
Singh et al. Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level sentiment classification
CN107168954B (zh) 文本关键词生成方法及装置和电子设备及可读存储介质
TWI536181B (zh) 在多語文本中的語言識別
WO2019153607A1 (zh) 智能应答方法、电子装置及存储介质
CA2777520C (en) System and method for phrase identification
US8073877B2 (en) Scalable semi-structured named entity detection
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US20150269163A1 (en) Providing search recommendation
WO2021189951A1 (zh) 文本搜索方法、装置、计算机设备和存储介质
EP2812883A1 (en) System and method for semantically annotating images
CN107885717B (zh) 一种关键词提取方法及装置
US20160335244A1 (en) System and method for text normalization in noisy channels
US10685012B2 (en) Generating feature embeddings from a co-occurrence matrix
CN112364624B (zh) 基于深度学习语言模型融合语义特征的关键词提取方法
JP2019530063A (ja) 電子記録のタグ付けのためのシステム及び方法
Habib et al. An exploratory approach to find a novel metric based optimum language model for automatic bangla word prediction
CN106663123B (zh) 以评论为中心的新闻阅读器
CN115794995A (zh) 目标答案获取方法及相关装置、电子设备和存储介质
CN110196910B (zh) 一种语料分类的方法及装置
Dubuisson Duplessis et al. Utterance retrieval based on recurrent surface text patterns
WO2020133186A1 (zh) 一种文档信息提取方法、存储介质及终端
TW201421267A (zh) 搜索系統及方法
CN115062135B (zh) 一种专利筛选方法与电子设备
JP5366179B2 (ja) 情報の重要度推定システム及び方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869629

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869629

Country of ref document: EP

Kind code of ref document: A1