WO2015033341A1

WO2015033341A1 - Polytope based summarization method

Info

Publication number: WO2015033341A1
Application number: PCT/IL2014/050791
Authority: WO
Inventors: Marina Litvak; Natalia VANETIK
Original assignee: Sami Shamoon College Of Engineering (R.A.)
Priority date: 2013-09-09
Filing date: 2014-09-03
Publication date: 2015-03-12

Abstract

The invention relates to a method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises the steps of (a) defining a sentence vector S=(S ₁ ,....,S _n ) which relates to all the sentences in the article; (b) defining a terms vector T=(T ₁ )...., T _m ) which relates to all the terms in the article! (c) preparing a sentence-term matrix which defines the number of appearances of each term T _j within each sentence Si, (d) representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively; (e) defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences! and (f) defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.

Description

POLYTOPE BASED SUMMARIZATION METHOD

Field of the Invention

The invention relates to the field of document summarization. More particularly, the invention relates to a document summarization method which is based on linear programming over rationals.

Background of the Invention

Automated text summarization is an active field of research which attracts much attention from both academic and industrial communities. Extractive summarization is the task of selecting a small subset of sentences from a document, thereby to create a summary which best reflects the content of the original document. In many cases, this task is expanded to the creation of a single summary from a collection of many related documents. The field of summarization is important for IR (Information Retrieval) systems, since it enables an efficient access to large repositories of textual data by determining the real essence of each document, indexing the repository, and creation of a compact version of each original document. Therefore, the summarization saves time for a user who needs to efficiently reach conclusions and decisions relating to the data content. This field distinguishes between automatically generated extract - the most salient fragments of the input document (e.g., sentences, paragraphs, etc.) and abstract - re-formulated synopsis expressing the main idea of the input document. Since the generation of an abstract requires a deep linguistic analysis of the input document (or documents), most prior art summarizing techniques deal with extractive summarization. Moreover, there are known extractive summarization techniques that can be applied to cross - lingual and multilingual domains.

Extractive summarization is considered as an optimization problem in a very natural way - extraction of maximum information in minimal number of words. Unfortunately, this problem is known as NP-hard, and there is no known polynomial algorithm which can tell, given a solution, whether it is optimal or not. Over the last decade many researchers formulated the summarization task as an optimization problem and solved it using approximation techniques such as the standard hill-climbing algorithm, A* search algorithm, regression models, and evolutionary algorithms.

Some prior art techniques measure information by means of text units such as terms, N-grams, etc., and reduce the summarization task to a maximum coverage problem. The maximum coverage model extracts sentences that cover as many terms or N-grams as possible. Despite a relatively good performance in the summarization field, a maximum coverage problem is a private case of a general optimization task, and it is known as NP-hard. Some other prior art techniques attempt to find a near-optimum solution by applying a greedy approach. Linear programming helps in finding a more accurate solution to the maximum coverage problem, and it became very popular in the summarization field. The prior art techniques typically use Integer Linear Programming. However, these prior art techniques suffer from several drawbacks, such as high computational complexity, and high number of constraints.

It is therefore an object of the present invention to provide a novel summarization method which requires a minimal level linguistic analysis, and which can therefore easily adaptable to multiple languages.

It is another object of the present invention to provide a summarization method which has polynomial running time, and small number of constraints.

It is another object of the invention to provide a summarization method which can be easily adapted to different user requirements relating to the summary quality. It is still another object of the present invention to provide a summarization method which can be applied to a single document, as well as to a collection of documents.

Other objects and advantages of the present invention will become apparent as the description proceeds.

Summary of the Invention

The invention relates to a method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises the steps of (a) defining a sentence vector S=(Si .,Sn) which relates to all the sentences in the article; (b) defining a terms vector

T_m) which relates to all the terms in the article! (c) preparing a sentence-term matrix which defines the number of appearances of each term 7 within each sentence Si, (d) representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively; (e) defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences; and (f) defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.

Preferably, said objective function can be defined as a distance function.

Preferably, said maximal predefined summary length is expressed by an additional hyperplane and its lower half space.

Preferably, the method further comprises definition of a minimal summary length, which is expressed by an additional hyperplane and its upper half space. Preferably, the method further comprises definition of constraints to ensure that said polytope is bounded.

Preferably, the method further comprises a preprocessing stage, which in turn comprises a step of sentence splitting, and a stage of tokenization.

Preferably, said preprocessing stage further comprises one or more of- (a) stemming and (b) synonym resolution.

Preferably, the objective function is a weighted sum, selected from several measures of terms importance.

Brief Description of the Drawings

In the Drawings :

Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, ¾, ¾, each hyperplane H represents a sentence within the document, and their intersections.

Detailed Description of Preferred Embodiments

In general, the invention relates to a summarization method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises :

a. defining a sentence vector

which relates to all the sentences in the article;

b. defining a terms vector

T_m) which relates to all the terms in the article;

c. preparing a sentence-term matrix which defines the number of appearances of each term 7) within each sentence Si,

d. representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively;

e. defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences! and

f. defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.

The description below elaborates on said method steps.

The method of the invention involves the providing of a novel model for text representation, making it possible to represent an exponential number of summaries without computing them exphcitly, and determining an optimal one by optimizing an objective function in polynomial time.

De£mtions

Problem Setting: The invention assumes a given set of sentences Si, S_n derived from a document or a cluster of related documents. Meaningful words in these sentences are entirely described by terms Ti, T_m. The quality of the summary must meet the requirements as predefined by an objective function. A specific object of the invention is to find a subset of sentences such that:

1. The summary (i.e., the subset of sentences) size is bounded to N words.

2. The quality of the summary is optimized in terms of said objective function.

Text Preprocessing - In order to construct said sentence-term matrix, it is necessary to perform a basic text preprocessing, as follows: a. Sentence splitting - i.e., determination where each sentence begins, and where it ends;

b. Tokenization - determination where each word begins and where it ends; and optionally

c. Stop-words removal - ignorance of meaningless words, such as "a", "the", "of, etc.;

d. Stemming - a process for reducing inflected words to their base or root form.

e. Synonym resolution - treating synonyms as same words.

The text preprocessing step reduces the matrix dimensionality, and results in a more compact and efficient model.

The Matrix Model - The matrix model of the invention represents text data, this model is in fact a sentence-term matrix A. The matrix A is a real matrix, A = (aij) of size mxn where: aij = k\£ term 7/ appears in the sentence 6 precisely k times.

The columns of A describe sentences, and the rows of A describe terms. The total number of words in the document (i.e., those remained following the preprocessing step), denoted by S, can be computed from the matrix A as follows:

m n

S =∑∑a, (1)

i=l j=l

Example V Given the following text of n = 3 sentences Si,S2,S3 and m = 5 terms Ti=fat, T₂=cat, T₃=eat, T₄=fish, T₆=meat-

Si - A fat cat is a cat that eats fat meat.

52 - My cat eats &sh but he is a fat cat.

53 = All fat cats eat fish and meat. Matrix A corresponds to the text above has the following structure^

The total count of words in this matrix is^

S =∑∑a_t] = 16

i=\ j=\

Hyperplanes and Half-spaces - As noted above, the present invention uses a representation model in which each sentence of the document is represented by a single hyperplane. All the sentences that are derived from the document form hyperplane intersections. All the possible extractive summaries of the document are represented by hyperplane intersections. The surface of the resulting polytope represents all the extracts that can be generated from the given document.

The method of the present invention views each column of the sentence-term matrix A as a linear constraint representing a hyperplane in a multi^¬ dimensional real space. An occurrence of term 7} in sentence 6/ is represented by variable Xy.

Example 2. This example demonstrates variables corresponding to the 5x3 matrix A of Example 1. 5, s₂ s₃

Each sentence in the document is a hyperlane expressed by columns:

Aflfjj ' = [aij, a_mj]

and variables:

xj = [xij, Xmj] for all 1≤j≤n Note that A[][j] denotes the j^th column of A.

The method of the invention further defines a system of linear inequalities^

A[][j] - x_j ^T

for all l≤j≤n (2)

Every inequality in (2) defines a lower half-space of a hyperplane Hi. This hyperplane describes sentence _*¾ and it is defined as follows:

∑^αα ^χϋ =∑^αϋ

To express the fact that every term is either present or absent from the chosen summary, the following limitations are added:

0 < Χ <1 (3)

The intersection of two hyperplanes, say and Hj, represents a summary which is composed of sentences Si and Sj. Similarly, a subset of r sentences is represented by the intersection of r hyperplanes. Example 3. Sentence-term matrix A of Example 1 defines the following hyperplane equations:

Hi■ 2xii+2x2i+X3i+X5i =2+2+1+1 = 6

H₂ ^■ X12+2X22+X32+X42 = 1+2+1+1 = 5

Hg ^{■ '} X13+X23+X33+X43+X53 = 1+1+1+1+1 = 5

Here, a summary consisting of the first and the second sentence is expressed by the intersection of hyperplanes Hi and H2.

Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, H2, H3 and their intersections.

Summary constraints - The length constraint on the number of words in the summary can be easily expressed as a constraint on the sum of these variables.

In order to limit the number of words and terms within the summary, the invention defines summarization constraints in a form of linear inequalities. The limit on the minimal and maximal number of terms in the summary is expressed as follows:

m n

T mm y / 'y / '.v ij.. 7 max (4) '

i=\ j=\

Tmin stands for the minimal number of terms in the resulted summary, and T_max stands for the maximal number of terms in the resulted summary.

Example 4- for Example 1, given T_min = 6 and T_max = 11, Equation (4) has the following form: 6 < XH+X21+X31+X51+X12+X22+X32+X42+ X13+X23+X33+X43+X53 < 10

The limit on the minimal and maximal number of words in the summary is expressed as follows:

W mm . ≤Y / Y a..x..≤W max (5) ⁷

=l

Wmin stands for the minimal number of words in the resulted summary, and Wmax stands for the maximal number of words in the resulted summary.

The difference between the number terms and the number of words in a summary is that a single term can appear more than once in a sentence.

Example 5. Equation (5) for the sentence-term matrix of Example 1 for Wmm = 6 Wmax = 11 has the form:

6 < 2XH+2X21+X31+X51+X12+X22+2X32+X42+ X13+X23+X33+X43+X53 < H

The polytope model So far, the method defined linear inequalities that describe all the sentences of the document, length constraints on the summary, and bounding constraints on the variables that are used in said inequahties. The method continues by combining all the above inequalities into one system which in turn describes the polytope. This is done in the following manner:

a. X. ≤ > a.

T mm < y / i^m _j-_\ y / i" _j- x .≤T max

w " mm. < y L-l^m y L-l a.._x..≤w m

o < ¾ < i The first n inequalities describe sentences Si, S_n, while the next two inequalities describe constraints on the total number of terms and words in the summary, and the last inequality determines upper and lower boundaries for all the sentence-term variables.

Since every inequality in the equation (6) is linear, the entire system in (6) describes a closed convex polyhedron (polytope) P. The faces of Pare defined by the intersections of its bounding hyperplanes.

Defining an objective function - As said above, the surface of the polyhedron P is a representation of all the possible sentence subsets (the number of vertices of P can reach 0(2°)). According to the method of the invention, there is no necessity to scan the entire set of the facets of P. More specifically, the method finds a point on P which optimizes (minimizes or maximizes) a selected objective function.

The method of the invention may comprise several objective functions, while one objective function may be selected according to the user needs. The present application discusses the considerations for defining the objective function. The application also provides examples for various objective functions.

Example 6:

This example discusses an objective function that maximizes the Weighted Term Sum, as follows:

m

max ^ (7)

;=i

In equation (7) each ti is a variable which represents all appearances of term 7] in sentences Si, S_n, as follows: t_i = _j x_ij ,\≤i≤m (8) where variables x_y are taken from equation (6).

In equation (7) each wi denotes the weight of term ti. The present method suggests six exemplary weight types, as follows:

1. POS_EQ: unweighted sum where all the terms T are equally important w_t = 1 for all i,

2. POS _F: closeness to the beginning of the document - the term 7] is more important if it first appears closer to the beginning of the respective document:

1

αρρ(ϊ) where app(i) ' is the index of a sentence in the document where the term 7] first appears;

POS_L: closeness to the end of the document - the term T is more important if it first appears closer to the end of a document: w_t = cipp{i) , ^'

POS_B: closeness to borders of the document - the term Ti is more important if it first appears closer to the beginning or the end of a document:

TF: normalized term frequency: w,- = tf i) , where tf(i) is the normalized term frequency of term 7/ computed as the number of times that the term appears in the document, divided by the total number appearances of all the terms in the document.

6. TF-ISF: term frequency multiplied by inverse sentence frequency. In this case, the method gives weight to term 7] in inverse proportion to the number of sentences in which it appears.

where tf(i) is the normalized term frequency of and isf(i) is inverse sentence frequency of 7] defined as

log n

Example 6.1 For the text of example 1, terms variables are defined as follows:

= X21 +X22+X23

ts = X31 +X32+X33

ts = X51 +X52+X53

his case, the "weighted term sum" objective function has the following form:

the weight type POS_EQ, this objective function takes the following form:

max ti + t2 + ts + + ts Example 7-

This example discusses an objective function that minimizes the Euclidean Distance between the term vector t = (ti, ... , tm), and a point p = (pi, ... , p_m), as follows: m

m n∑<¾ - p^ ² (9)

=l

The method uses term variables ti, ... , t_m defined in equation (8).

One example for the point p can be the point which contains all the terms precisely once, thus minimizing term repetition, but increasing term coverage in the summary: p = (!, ... , !) (10)

As an alternative to the use of Euclidean Distance, the method may use Manhattan distance instead, as follows:

Example 1.1- For the text of Example 1, the Euclidean distance objective function has the form: min∑(i,. -l)

Example 8:

This example discusses an objective function that minimizes the Euclidean Distance between the term frequency vector sf= (s£,... , s£n) oi the summary, and the document term frequency vector df= (d£,... , df_m), as follows: min /,. - ^ )² (12)

ί=1

where dfiis the normalized term frequency of term 7] in the document computed as

The term frequency vector of the summary is defined as follows:

where term variables tare defined by equation (8).

Example 8.1 For the text of Example 1 the term frequency vector of the document is as follows: df= (0.25, 0.3125, 0.1875, 0.125, 0.125)

The objective function of Equation (12) has the form: minf(s i - 0.25) ² + (sf₂ - 0.3125) ² + (sf₃ - 0.1875) ² + (sf₄ - 0.125) ² + (sf₅ - 0.125)0 Example 9:

This example discusses an objective function that minimizes the sentence overlap in the summary, as follows:

where variables ovljk express the similarity between sentence Sj and sentence Si. Overlap variable ovlji for the pair of sentences SjSi is defined as sentence intersection divided by sentence union, according to Jaccard similarity coefficient formula, as follows :

For every term 7/ that appears in both sentences the weight w(aij an) of sentence overlap is defined as:

\ il if T_t is present in both S . and S_k

w[a_i., a_ik ) =

[0 otherwise

Example 10:

This example discusses an objective function that maximizes the sum of bigrams in the summary. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram could be any combination of text units, letters or terms/words. An n-gram of size 1 is referred to as a "unigram"; An n- gram of size 2 is called a "bigram". The method of the invention considers any continuous pair of words or words separated by stopwords to be a bigram.

Said objective function is defined as follows:

max b (16) where Vz, j 0 < bi_tj < 1 Bigram variables bi are defined for every bigram consisting of terms 7} and 7} that appear in the document.

The system of equations (6) is extended by adding expressions that describe each sentence as a weighted sum of the bi rams contained within it:

Where the weight wjk denotes the number of times a bigram appears in a sentence.

Example 10.1. The text of Example 1 contains the following bigrams: "fat, cat", "cat, eat", "eat, fat", "fat, meat", "eat, fish", "fish, meat". Bigram variables are bil2, bl23, bl31, bil5, bl34, bl45.

Equalities (17) have the form:

Si = bii2+bi23+bi3i+bii5

S3 = bii2+bi23+bi34+bi45

The "bigram sum " objective function is defined as:

m ax bi i2+bi23+bi3 ι+bi i5+bi34+bi45

Extracting The Summary: The above discussion has described the following steps:

(a) The preprocessing step in which sentences and terms were recognized;

(b) Construction of sentence-term matrix;

(c) Construction of polytope model for the text according to equation (6);

(d) Defining of an objective function according to the user requirements;

(e) Having performed steps (a)-(d), the optimal value of the defined objective function on the polytope is obtained. This value is obtained at a point x=(xij) on the polytope. The method uses this point in order to construct the summary, as follows: 1. Determining of a normalized distance from x to each sentence hy erplane^

, ₌∑:,(¾ -¾)

vn »i

The point jsrlies in lower half-spaces of sentence hyperplanes.

2. Sorting the distances in increasing order and obtaining a sorted list

3. Adding sentences Sji,... ,Sjk, where k≤ m , to the summary until a maximal summary length is reached.

Experiments

Experiment Setup The method of the present invention was compared to several summarizing techniques that participated in the generic multi-document summarization tasks of the DUC 2002 (DUC, 2002) and MultiLing 2013 (MultiLing, 2013) competitions. The method was implemented in Java using lpsolve software. The Experiment used the following objective functions^

1. Maximal weighted term sum function, defined in equation (7) and denoted by OBJ^^ight-^me where weight_type is one of POS_EQ, POS_L, POS_F, POS_B, TF, TF-ISF.

2. Minimal distance function denoted by OBJ₂ , as defined by equation (l l).

3. Minimal distance to term frequency function denoted by OBJ₃ and defined in equation (13).

4. Minimal sentence overlap function, denoted by OBJ₄ and defined in equation (14).

5. Maximal bigram sum function, denoted by OBJ₅ and defined in equation (16).

Two experiments were performed having available public datasets^ (a) a first experiment was conducted on the datasets from the Document Understanding Conference (DUC) run by NIST in 2002. The dataset of DUC 2002 contains 59 document collections, each having about 10 documents and two manually created abstracts for lengths 50, 100, and 200 words. The experiment generated summaries of size 200 words and the quality of those summaries was compared to the Gold Standard summaries having a same size.

(b) a second experiment was conducted on the MultiLing 2013 dataset. Said MultiLing dataset is composed of 15 document sets, each of 10 related documents, and three manually created abstracts of 250 words length. The same length constraint was applied to the summaries generated. The dataset contains parallel corpora in several languages.

The well-known automatic summarization evaluation package ROUGE, was used to evaluate the effectiveness of the present technique against other summarizers. The recall scores of ROUGE-N for N e{l, 2}, ROUGE-L, ROUGE- W- 1.2, and ROUGE-SU4 which are based on N-gram, Longest Common Subsequence (LCS), Weighted Longest Common Subsequence (WLCS), and Skip-bigram plus unigram, with maximum skip-distance of 4, matching between system summaries and reference summaries, respectively, are reported in Tables 1-4 below.

Experimental Results

As it can be seen from Table 1, the technique of the present invention, given best performed objective functions, received better scores than four other techniques that participated in said DUC 2002 competition, in terms of ROUGE- 1, ROUGE-2, ROUGE-L, and ROUGE-W-1.2, and 3 systems in terms of ROUGE-SU4. Position-based weights behave as expected on said dataset: closeness to the beginning of a document is the best indication of relevance, while closeness to the end of a document, conversely, the worst one. The evaluation results on the MultiLing'13 English (the second experiment), Hebrew and Arabic data can be seen in Tables 2, 3 and 4, respectively. As it can be seen from the results, only 3 to 4 systems are outperformed by the present technique in English, 6 to 7 systems in Hebrew, and 5 to 8 systems in Arabic.

Table l: Evaluation results. Dataset ofDUC2002. English.

Table 2: Evaluation results. Dataset ofMultiLing2013. English.

Table 3- Evaluation results. Dataset ofMultiLing2013. Hebrew.

Table 4- Evaluation results. Dataset ofMultiLing2013. Arabic.

The experiments that were conducted on multiple datasets and languages show at least partial superiority of the present technique over other techniques. The following conclusions are made based on the obtained results.

1. The present technique performs better on Hebrew and Arabic than on English data. Possible reasons for this situation are as follows: Hebrew and Arabic, unlike English, have simple sentence splitting rules, where particular punctuation marks indicate sentence boundaries. Also, normalization of terms (stopwords removal and stemming) was not performed for Hebrew and Arabic. Apparently, the filtered during this stage information cause accuracy loss in distance measurement.

2. The lack of deep morphological analysis and NLP techniques also may affect the quality of the generated summaries, while permitting multilingual text processing.

3. Striving to preserve terms collection of the summarized documents as much as possible does not perform well when the results compared to Gold Standard abstracts. Despite better coverage of document terms, resulted extracts may contain different vocabulary, affecting the ROUGE scores.

4. Adding entire sentences to summaries decreases precision and recall metrics due to "garbage" information they carry. Therefore, sentence compression is required in order to see the "actual" results of the optimization procedure.

As shown, the present invention provides a summarization technique having the following advantages:

1. The method of the invention is an unsupervised method that does not require any annotated data and training sources.

2. The method of the invention considers all possible extracts and constructs a summary in polynomial time.

3. The method of the invention defines a novel text representation model independently from the objective functions that describe the summary quality. As such, one can easily to add more functions without modifying the model. This is in contrast to most prior art techniques that use linear programming for extracting sentences and embed objective functions into the model.

4. Since the technique of the present invention does not require any morphological analysis, this technique is language -independent.

While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be carried into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

1. Method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises :

a. defining a sentence vector S=(Si,....,Sn) which relates to all the sentences in the article;

b. defining a terms vector

T_m) which relates to all the terms in the article;

e. defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences; and

f. defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function,

g. adding sentences that are closest to said point to a summary in a greedy manner, until the maximal predefined summary length is reached.

2. Method according to claim 1, wherein said objective function can be defined as a distance function.

3. Method according to claim 1, wherein said maximal predefined summary length is expressed by an additional hyperplane and its lower half space.

4. Method according to claim 1, further comprising definition of a minimal summary length, which is expressed by an additional hyperplane and its upper half space.

5. Method according to claim 1, further comprising definition of constraints to ensure that said polytope is bounded.

6. Method according to claim 1, further comprising a preprocessing stage, which in turn comprises a step of sentence splitting, and a stage of tokenization.

7. Method according to claim 1, wherein said preprocessing stage further comprises one or more of- (a) stemming and (b) synonym resolution.

8. Method according to claim 1, wherein the objective function is a weighted sum, selected from several measures of terms importance.

9. Method according to claim 1, wherein the objective function is a weighted sum of bigrams.

10. Method according to claim 1, further comprising definition of a distance function between a sentence and a point on a surface of said polytope having an optimal value of said objective function.