CN110728143A

CN110728143A - Method and equipment for identifying document key sentences

Info

Publication number: CN110728143A
Application number: CN201910900141.1A
Authority: CN
Inventors: 翟光景; 田进太; 赵庆平; 刘益东
Original assignee: Shanghai Midu Information Technology Co Ltd
Current assignee: Shanghai Midu Information Technology Co Ltd
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-24

Abstract

The application aims to provide a method and equipment for identifying document key sentences. Compared with the prior art, the method and the device have the advantages that the document is subjected to word segmentation processing based on the text content in the document, and a plurality of entries corresponding to the document are obtained; calculating the entry importance score of each entry, and determining M entries with the entry importance scores ranked at the top, wherein M is a preset value; performing sentence splitting processing on the document to obtain a sentence set related to the document; traversing the sentence set, and screening out sentences containing one or more of the M entries; and calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries, and determining one or more sentences with the highest sentence importance scores as document key sentences.

Description

Method and equipment for identifying document key sentences

Technical Field

The application relates to the technical field of computers, in particular to a technology for identifying key sentences of a document.

Background

Since a large amount of document data exists in a public website, for a document, a central sentence capable of representing the document information generally exists, that is, a key sentence in the document, and if the key sentence can be extracted, the document information can be quickly known, which is helpful for sharing or classifying the document, but there is no technology for identifying the key sentence in the document in the prior art.

Disclosure of Invention

The application aims to provide a method and equipment for identifying document key sentences.

According to one aspect of the application, a method for identifying a document key sentence is provided, wherein the method comprises the following steps:

performing word segmentation processing on a document based on the text content in the document to obtain a plurality of entries corresponding to the document;

calculating the entry importance score of each entry, and determining M entries with the entry importance scores ranked at the top, wherein M is a preset value;

performing sentence splitting processing on the document to obtain a sentence set related to the document;

traversing the sentence set, and screening out sentences containing one or more of the M entries;

and calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries, and determining one or more sentences with the highest sentence importance scores as document key sentences.

Further, the performing word segmentation processing on the document based on the text content in the document to obtain a plurality of entries corresponding to the document includes:

acquiring a title and a text of the document;

respectively carrying out word segmentation processing on the text contents of the title and the text of the document to obtain a plurality of title entries and text entries;

wherein the method further comprises:

and adding preset weight to the title entries to calculate entry importance scores of the weighted title entries.

Further, wherein the method further comprises:

performing semantic analysis on the screened sentences, and respectively endowing the screened sentences with preset weight values according to semantic analysis results;

wherein the calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries and the determining one or more sentences with the highest sentence importance scores as the document key sentences comprises:

and calculating the sentence importance scores of the screened sentences based on the importance scores of the M entries and the preset weight values, and determining one or more sentences with the highest sentence importance scores as document key sentences.

Further, wherein the method further comprises:

obtaining a public D document as a basic corpus set, wherein D is a preset value;

performing word segmentation processing on the documents in the basic corpus set to obtain basic entries;

the word segmentation processing of the document based on the text content in the document to obtain a plurality of entries corresponding to the document comprises:

and performing word segmentation processing on the document based on the text content in the document, and acquiring a plurality of entries corresponding to the document based on the basic entries.

Further, the formula for calculating the importance score of each entry is as follows:

f_i＝tf_i,jmultiplying by idf_iWherein, in the step (A),

where n represents the number of times a term appears in a document, D is the number of base corpora, | { j: t_i∈d_jAnd represents the number of files containing the entry in the basic corpus.

Further, the formula for calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries is as follows:

F_ifurther, based on the importance scores of the M entries and the preset weight values, a calculation formula corresponding to the sentence importance scores of the screened sentences is calculated as follows:

S_i＝F_i+E_iwherein E is_iRepresenting the preset weight value of the ith sentence.

According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the operations of the method as described above.

According to still another aspect of the present application, there is also provided an apparatus for document key sentence recognition, wherein the apparatus includes:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform operations of the method as previously described.

Compared with the prior art, the method and the device have the advantages that the document is subjected to word segmentation processing based on the text content in the document, and a plurality of entries corresponding to the document are obtained; calculating the entry importance score of each entry, and determining M entries with the entry importance scores ranked at the top, wherein M is a preset value; performing sentence splitting processing on the document to obtain a sentence set related to the document; traversing the sentence set, and screening out sentences containing one or more of the M entries; and calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries, and determining one or more sentences with the highest sentence importance scores as document key sentences.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 illustrates a flow diagram of a method for document key sentence identification in accordance with an aspect of the subject application;

FIG. 2 illustrates a flow diagram of a method for word segmentation processing in accordance with a preferred embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

To further illustrate the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.

FIG. 1 illustrates a flow diagram of a method for document key sentence identification provided in one aspect of the present application. The method is performed at a device 1, the method comprising the steps of:

s11, performing word segmentation processing on the document based on the text content in the document to obtain a plurality of entries corresponding to the document;

s12, calculating the entry importance score of each entry, and determining M entries with the entry importance scores ranked at the top, wherein M is a preset value;

s13, carrying out sentence splitting processing on the document to obtain a sentence set related to the document;

s14 traversing the sentence collection, and screening out sentences containing one or more of the M entries;

s15, based on the entry importance scores of the M entries, calculating the sentence importance scores of the screened sentences, and determining one or more sentences with the highest sentence importance scores as document key sentences.

In this embodiment, in step S11, the device 1 performs word segmentation processing on the document based on the text content in the document, and obtains a plurality of entries corresponding to the document.

In the present application, the device 1 includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud of a plurality of servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. The specific device 1 is not limited in any way in this application.

Specifically, the device 1 obtains all the text contents of the document, and performs word segmentation processing on the document based on the text contents, and a specific word segmentation mode may perform word segmentation based on an existing word segmentation mode, and in addition, word segmentation modes that may appear in the future are also included in the scope of protection of the present application as applicable to the present application, and are included herein by reference.

Fig. 2 shows a flowchart of a method for word segmentation processing according to a preferred embodiment of the present application. In the first step, variable initialization is performed, wherein S1 is a word string to be segmented, S2 is a segmented word string to be output, and the maximum word length MaxLen of the segmented word string is set to control the length of the segmented word string.

Preferably, wherein the step S11 includes: acquiring a title and a text of the document; respectively carrying out word segmentation processing on the text contents of the title and the text of the document to obtain a plurality of title entries and text entries;

wherein the method further comprises: and adding preset weight to the title entries to calculate entry importance scores of the weighted title entries.

In this embodiment, the title and the text of the document may be respectively subjected to word segmentation, and since the segmentation of the title part may be more important for the document, a preset weight is added to the entry obtained after the title is segmented, for example, if the frequency of the title after the word segmentation is n, the preset weight a is added to the entry and then the entry becomes the product of a and n, where a is a number greater than 1, and here, the specific value of a is not specifically limited.

Continuing in this embodiment, in step S12, the device 1 calculates the entry importance score of each entry, and determines M entries with the entry importance scores ranked at the top, where M is a preset value.

Specifically, the device 1 may count the occurrence frequency of each segmented entry, and may represent the entry importance score of the entry by the occurrence frequency, for example, the occurrence frequency of each entry may be directly used as the entry importance score, or a value obtained by normalizing the occurrence frequency may be used as the entry importance score, which is not specifically limited herein. After the entry importance scores of the entries are calculated, M entries with the top rank are selected, wherein M can be preset, and specific numerical values are not limited.

Continuing in this embodiment, in said step S13, device 1 performs sentence segmentation processing on said document, obtaining a sentence set for the document. Specifically, a document may be sentence-divided by punctuation, for example, a sentence may be sentence-divided based on comma or period or other punctuation to obtain a sentence set about the document.

Continuing in this embodiment, in step S14, device 1 traverses the sentence set and screens out sentences containing one or more of the M entries. Specifically, after the document is subjected to sentence dividing processing, each sentence process is matched and searched, whether the sentence contains one or more of the M entries is checked, and if the sentence contains one or more of the M entries, the sentence is screened out.

Continuing in this embodiment, in said step S15, the device 1 calculates the sentence importance scores of the screened sentences based on the term importance scores of said M terms, and determines one or more sentences having the highest sentence importance scores as the document key sentences.

Specifically, the sentence importance score may be based on the sum of the importance scores of the terms, for example, if a sentence includes P terms of M terms, where P is a numerical value smaller than M, the sentence importance scores of the sentence may be obtained by adding the term importance scores of the P terms.

And calculating and sequencing the sentence importance scores of all the sentences, and taking one or more sentences with the highest sentence importance scores as the document key sentences.

Preferably, wherein the method further comprises: s16 (not shown) performing semantic analysis on the screened sentences, and assigning preset weight values to the screened sentences according to the semantic analysis result;

wherein the step S15 includes:

In this embodiment, semantic analysis is also performed on the screened sentences, for example, whether the sentences contain a principal and a subordinate guest or not is analyzed, and then preset weight values are respectively assigned to the screened sentences according to the semantic analysis result. For example, a weight value is preset for a sentence with a leading and trailing object as Q, a weight value is preset for a sentence with only a leading and trailing object as Y, and a weight value is preset for other cases as 0, where Q > Y, and the specific numerical value is not limited. The above-mentioned assignment of weights is merely exemplary, and other existing or future assignments, as applicable to the present application, are also intended to be included herein by reference.

In this embodiment, the calculation of the sentence importance score may be based on a preset weight value of each sentence in addition to the importance scores of the M entries, for example, the sentence importance score of each sentence may be obtained by adding the importance scores of one or more of the M entries included in each sentence and then multiplying the added importance scores by the weight value, where the weight value is greater than one, or the sentence importance score of each sentence may be obtained by adding the importance scores of one or more of the M entries included in each sentence and then adding the added importance scores to the weight value. The method of calculating the importance score of the sentence is only an example, and other existing or future calculation methods, such as those applicable to the present application, are also included in the scope of the present application, and are hereby incorporated by reference.

Preferably, wherein the method further comprises: s17 (not shown) acquiring a public D document as a basic corpus set, wherein D is a preset value; performing word segmentation processing on the documents in the basic corpus set to obtain basic entries;

wherein the step S11 includes: and performing word segmentation processing on the document based on the text content in the document, and acquiring a plurality of entries corresponding to the document based on the basic entries.

In this embodiment, a basic vocabulary entry library is obtained by obtaining a published document to perform word segmentation, for example, 30 ten thousand news and information of each news website are collected as basic corpus sets, and word segmentation is performed on the basic corpus sets to obtain basic vocabulary entries, that is, a vocabulary entry dictionary is created to facilitate subsequent word segmentation of the document. The specific value of D is not limited, wherein the larger D, the better and the more comprehensive the base vocabulary entry base is constructed. The flow chart of the word segmentation process can be shown in fig. 2.

In an embodiment of the present application, the following steps are included for identifying the key sentences in the document:

the method comprises the following steps: preparing basic corpora, for example, collecting D pieces of news and information of each news website;

step two: the basic corpus is participled, and the processing flow is shown in fig. 2, where the participle result is represented as set W, W { { d { (d)₁,(w₁,w₂,w₃,...w_n)},{d₂,(w₁,w₂,w₃,...w_n)}...{d_n,(w₁,w₂,w₃,...w_n) } where d is equal to_iRepresenting a document, w_iRepresenting an entry;

step three: setting a document needing keyword identification as X, firstly, performing word segmentation processing on a title and a text of the X according to a graph 2, and recording word segmentation results as:

title segmentation result W_t＝{(w_t1,n₁),(w_t2,n₂),..(w_tn,n_n)}、

Text word segmentation result W_c＝{(w_c1,n₁),(w_c2,n₂),..(w_cn,n_n)}

Wherein w_iIs an entry, n_iIs the word frequency of the word.

Step four: to W_tThe weight a is increased, that is, the result of the title segmentation after the preset weight a is increased is: w_ta＝{(w_t1,a*n₁),(w_t2,a*n₂),..(w_tn,a*n_n)}；

Step five: using weighted W_taAnd W_cAnd calculating the entry importance score of each entry in the X, wherein the entry importance score formula is as follows:

f_i＝tf_i,jmultiplying by idf_iWherein, in the step (A),

where n represents the number of times a term appears in a document, D is the number of base corpora, | { j: t is t_i∈d_jAnd represents the number of files containing the entry in the basic corpus.

Determining top with M entries with highest importance scores_mIndividual entry top_m＝{(w_t1，f₁)，(w_t2，f₂),...(w_tm，f_m) Where w_iIs an entry, f_iEntry importance scores for corresponding entries;

step six: performing sentence division processing on the document X according to punctuation marks to obtain a sentence set S, traversing S, and including any one or more top in the sentence_mThe entries in the sentence are screened out, and the screened sentence set is marked as S_t；

Step seven: to S_tEach sentence S in (1)_iCalculate its sentence importance score F in the document_i

F_iThe sum of the entry importance scores of one or more entries in the M entries contained in the sentence is calculated as S_tf＝{(S₁，F₁)，(S₂，F₂)，...(S_n，F_n)}；

Step eight: to S_tEach sentence S in (1)_iAnd performing semantic analysis. S_iAll the major-minor guests are set as a weight Q, only the major-minor is set as a weight Y, wherein Q>Y, otherwise the weight is set to 0. Let S_iCorresponding weight is E_iThen S is_tIs set as S_te＝{(S₁，Ε₁)，(S₂，Ε₂)，...(S_n，Ε_n)}；

Step nine: calculating the sentence importance scores of the screened sentences based on the importance scores of the M entries and the preset weight values to obtain: s_tfe＝{(S₁，Ε₁+F₁)，(S₂，Ε₂+F₂)，...(S_n，Ε_n+F_n) And determining one or more sentences of which the sentence importance scores are the highest as key sentences of the document.

Furthermore, the embodiment of the present application also provides a computer readable medium, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the foregoing method.

The embodiment of the present application further provides an apparatus for identifying a document key sentence, where the apparatus includes:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the foregoing method.

For example, the computer readable instructions, when executed, cause the one or more processors to: performing word segmentation processing on a document based on the text content in the document to obtain a plurality of entries corresponding to the document;

calculating the entry importance score of each entry, and determining M entries with the entry importance scores ranked at the top, wherein M is a preset value; performing sentence splitting processing on the document to obtain a sentence set related to the document; traversing the sentence set, and screening out sentences containing one or more of the M entries; and calculating the sentence importance scores of the screened sentences based on the entry importance scores of the M entries, and determining one or more sentences with the highest sentence importance scores as document key sentences.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for document key sentence identification, wherein the method comprises:

2. The method of claim 1, wherein the performing word segmentation processing on the document based on the text content in the document to obtain a plurality of entries corresponding to the document comprises:

acquiring a title and a text of the document;

wherein the method further comprises:

3. The method according to claim 1 or 2, wherein the method further comprises:

4. The method of any of claims 1-3, wherein the method further comprises:

5. The method of claim 4, wherein the formula for calculating the importance score of each entry is:

f_i＝tf_i,jmultiplying by idf_iWherein, in the step (A),

，

where n represents the number of times a term appears in a document, D is the number of basic corpora, | { j: t is t_i∈d_jAnd represents the number of files containing the entry in the basic corpus.

6. The method of claim 5, wherein the formula for calculating the sentence importance scores of the screened sentences based on the term importance scores of the M terms is:

F_ithe term importance score is the sum of the term importance scores of one or more of the M terms contained in the sentence.

7. The method of claim 6, wherein based on the importance scores of the M entries and the preset weight values, the formula for calculating the sentence importance scores of the screened sentences is:

8. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 7.

9. An apparatus for document key sentence recognition, wherein the apparatus comprises:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 7.