WO2015033341A1 - Polytope based summarization method - Google Patents

Polytope based summarization method Download PDF

Info

Publication number
WO2015033341A1
WO2015033341A1 PCT/IL2014/050791 IL2014050791W WO2015033341A1 WO 2015033341 A1 WO2015033341 A1 WO 2015033341A1 IL 2014050791 W IL2014050791 W IL 2014050791W WO 2015033341 A1 WO2015033341 A1 WO 2015033341A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentences
sentence
polytope
objective function
defining
Prior art date
Application number
PCT/IL2014/050791
Other languages
French (fr)
Inventor
Marina Litvak
Natalia VANETIK
Original Assignee
Sami Shamoon College Of Engineering (R.A.)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sami Shamoon College Of Engineering (R.A.) filed Critical Sami Shamoon College Of Engineering (R.A.)
Publication of WO2015033341A1 publication Critical patent/WO2015033341A1/en
Priority to IL244470A priority Critical patent/IL244470B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Definitions

  • the invention relates to the field of document summarization. More particularly, the invention relates to a document summarization method which is based on linear programming over rationals.
  • Extractive summarization is the task of selecting a small subset of sentences from a document, thereby to create a summary which best reflects the content of the original document. In many cases, this task is expanded to the creation of a single summary from a collection of many related documents.
  • the field of summarization is important for IR (Information Retrieval) systems, since it enables an efficient access to large repositories of textual data by determining the real essence of each document, indexing the repository, and creation of a compact version of each original document. Therefore, the summarization saves time for a user who needs to efficiently reach conclusions and decisions relating to the data content.
  • This field distinguishes between automatically generated extract - the most salient fragments of the input document (e.g., sentences, paragraphs, etc.) and abstract - re-formulated synopsis expressing the main idea of the input document. Since the generation of an abstract requires a deep linguistic analysis of the input document (or documents), most prior art summarizing techniques deal with extractive summarization. Moreover, there are known extractive summarization techniques that can be applied to cross - lingual and multilingual domains.
  • Extractive summarization is considered as an optimization problem in a very natural way - extraction of maximum information in minimal number of words.
  • this problem is known as NP-hard, and there is no known polynomial algorithm which can tell, given a solution, whether it is optimal or not.
  • Approximation techniques such as the standard hill-climbing algorithm, A* search algorithm, regression models, and evolutionary algorithms.
  • Some prior art techniques measure information by means of text units such as terms, N-grams, etc., and reduce the summarization task to a maximum coverage problem.
  • the maximum coverage model extracts sentences that cover as many terms or N-grams as possible.
  • NP-hard a maximum coverage problem is a private case of a general optimization task, and it is known as NP-hard.
  • Some other prior art techniques attempt to find a near-optimum solution by applying a greedy approach. Linear programming helps in finding a more accurate solution to the maximum coverage problem, and it became very popular in the summarization field.
  • the prior art techniques typically use Integer Linear Programming.
  • these prior art techniques suffer from several drawbacks, such as high computational complexity, and high number of constraints.
  • said objective function can be defined as a distance function.
  • said maximal predefined summary length is expressed by an additional hyperplane and its lower half space.
  • the method further comprises definition of a minimal summary length, which is expressed by an additional hyperplane and its upper half space.
  • the method further comprises definition of constraints to ensure that said polytope is bounded.
  • the method further comprises a preprocessing stage, which in turn comprises a step of sentence splitting, and a stage of tokenization.
  • said preprocessing stage further comprises one or more of- (a) stemming and (b) synonym resolution.
  • the objective function is a weighted sum, selected from several measures of terms importance.
  • Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, 3 ⁇ 4, 3 ⁇ 4, each hyperplane H represents a sentence within the document, and their intersections.
  • the invention relates to a summarization method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises :
  • each sentence represents a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively;
  • a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences!
  • the method of the invention involves the providing of a novel model for text representation, making it possible to represent an exponential number of summaries without computing them exphcitly, and determining an optimal one by optimizing an objective function in polynomial time.
  • the invention assumes a given set of sentences Si, S n derived from a document or a cluster of related documents. Meaningful words in these sentences are entirely described by terms Ti, T m . The quality of the summary must meet the requirements as predefined by an objective function.
  • a specific object of the invention is to find a subset of sentences such that:
  • the summary (i.e., the subset of sentences) size is bounded to N words.
  • Sentence splitting i.e., determination where each sentence begins, and where it ends;
  • the text preprocessing step reduces the matrix dimensionality, and results in a more compact and efficient model.
  • the Matrix Model - The matrix model of the invention represents text data, this model is in fact a sentence-term matrix A.
  • the columns of A describe sentences, and the rows of A describe terms.
  • the total number of words in the document i.e., those remained following the preprocessing step, denoted by S, can be computed from the matrix A as follows:
  • Si - A fat cat is a cat that eats fat meat.
  • Matrix A corresponds to the text above has the following structure ⁇
  • Hyperplanes and Half-spaces use a representation model in which each sentence of the document is represented by a single hyperplane. All the sentences that are derived from the document form hyperplane intersections. All the possible extractive summaries of the document are represented by hyperplane intersections. The surface of the resulting polytope represents all the extracts that can be generated from the given document.
  • the method of the present invention views each column of the sentence-term matrix A as a linear constraint representing a hyperplane in a multi ⁇ dimensional real space.
  • An occurrence of term 7 ⁇ in sentence 6/ is represented by variable Xy.
  • Example 2 This example demonstrates variables corresponding to the 5x3 matrix A of Example 1. 5, s 2 s 3
  • xj [xij, Xmj] for all 1 ⁇ j ⁇ n Note that A[][j] denotes the j th column of A.
  • the method of the invention further defines a system of linear inequalities ⁇
  • Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, H2, H3 and their intersections.
  • Tmin stands for the minimal number of terms in the resulted summary
  • T ma x stands for the maximal number of terms in the resulted summary.
  • Wmin stands for the minimal number of words in the resulted summary
  • Wmax stands for the maximal number of words in the resulted summary
  • the difference between the number terms and the number of words in a summary is that a single term can appear more than once in a sentence.
  • the polytope model So far, the method defined linear inequalities that describe all the sentences of the document, length constraints on the summary, and bounding constraints on the variables that are used in said inequahties. The method continues by combining all the above inequalities into one system which in turn describes the polytope. This is done in the following manner:
  • the surface of the polyhedron P is a representation of all the possible sentence subsets (the number of vertices of P can reach 0(2°)). According to the method of the invention, there is no necessity to scan the entire set of the facets of P. More specifically, the method finds a point on P which optimizes (minimizes or maximizes) a selected objective function.
  • the method of the invention may comprise several objective functions, while one objective function may be selected according to the user needs.
  • the present application discusses the considerations for defining the objective function.
  • the application also provides examples for various objective functions.
  • each wi denotes the weight of term ti.
  • the present method suggests six exemplary weight types, as follows:
  • POS_B closeness to borders of the document - the term Ti is more important if it first appears closer to the beginning or the end of a document:
  • TF-ISF term frequency multiplied by inverse sentence frequency. In this case, the method gives weight to term 7] in inverse proportion to the number of sentences in which it appears.
  • tf(i) is the normalized term frequency of and isf(i) is inverse sentence frequency of 7] defined as
  • this objective function takes the following form:
  • the method uses term variables ti, ... , t m defined in equation (8).
  • the method may use Manhattan distance instead, as follows:
  • Example 1.1- For the text of Example 1, the Euclidean distance objective function has the form: min ⁇ (i,. -l)
  • frequency vector of the summary is defined as follows: where term variables tare defined by equation (8).
  • the method may use Manhattan distance instead, as follows:
  • Equation (12) The objective function of Equation (12) has the form: minf(s i - 0.25) 2 + (sf 2 - 0.3125) 2 + (sf 3 - 0.1875) 2 + (sf 4 - 0.125) 2 + (sf 5 - 0.125)0
  • Equation 9 The objective function of Equation (12) has the form: minf(s i - 0.25) 2 + (sf 2 - 0.3125) 2 + (sf 3 - 0.1875) 2 + (sf 4 - 0.125) 2 + (sf 5 - 0.125)0
  • Example 9 Example 9:
  • An n-gram is a contiguous sequence of n items from a given sequence of text or speech.
  • An n-gram could be any combination of text units, letters or terms/words.
  • An n-gram of size 1 is referred to as a "unigram”;
  • An n- gram of size 2 is called a "bigram”.
  • the method of the invention considers any continuous pair of words or words separated by stopwords to be a bigram.
  • weight wjk denotes the number of times a bigram appears in a sentence.
  • Example 10.1 The text of Example 1 contains the following bigrams: “fat, cat”, “cat, eat”, “eat, fat”, “fat, meat”, “eat, fish”, “fish, meat”. Bigram variables are bil2, bl23, bl31, bil5, bl34, bl45.
  • the well-known automatic summarization evaluation package ROUGE was used to evaluate the effectiveness of the present technique against other summarizers.
  • the recall scores of ROUGE-N for N e ⁇ l, 2 ⁇ , ROUGE-L, ROUGE- W- 1.2, and ROUGE-SU4 which are based on N-gram, Longest Common Subsequence (LCS), Weighted Longest Common Subsequence (WLCS), and Skip-bigram plus unigram, with maximum skip-distance of 4, matching between system summaries and reference summaries, respectively, are reported in Tables 1-4 below.
  • the technique of the present invention given best performed objective functions, received better scores than four other techniques that participated in said DUC 2002 competition, in terms of ROUGE- 1, ROUGE-2, ROUGE-L, and ROUGE-W-1.2, and 3 systems in terms of ROUGE-SU4.
  • Position-based weights behave as expected on said dataset: closeness to the beginning of a document is the best indication of relevance, while closeness to the end of a document, conversely, the worst one.
  • the evaluation results on the MultiLing'13 English (the second experiment), Hebrew and Arabic data can be seen in Tables 2, 3 and 4, respectively. As it can be seen from the results, only 3 to 4 systems are outperformed by the present technique in English, 6 to 7 systems in Hebrew, and 5 to 8 systems in Arabic.
  • the present technique performs better on Hebrew and Arabic than on English data. Possible reasons for this situation are as follows: Hebrew and Arabic, unlike English, have simple sentence splitting rules, where particular punctuation marks indicate sentence boundaries. Also, normalization of terms (stopwords removal and stemming) was not performed for Hebrew and Arabic. Apparently, the filtered during this stage information cause accuracy loss in distance measurement.
  • the present invention provides a summarization technique having the following advantages:
  • the method of the invention is an unsupervised method that does not require any annotated data and training sources.
  • the method of the invention considers all possible extracts and constructs a summary in polynomial time.
  • the method of the invention defines a novel text representation model independently from the objective functions that describe the summary quality. As such, one can easily to add more functions without modifying the model. This is in contrast to most prior art techniques that use linear programming for extracting sentences and embed objective functions into the model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises the steps of (a) defining a sentence vector S=(S 1 ,....,S n ) which relates to all the sentences in the article; (b) defining a terms vector T=(T 1 )...., T m ) which relates to all the terms in the article! (c) preparing a sentence-term matrix which defines the number of appearances of each term T j within each sentence Si, (d) representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively; (e) defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences! and (f) defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.

Description

POLYTOPE BASED SUMMARIZATION METHOD
Field of the Invention
The invention relates to the field of document summarization. More particularly, the invention relates to a document summarization method which is based on linear programming over rationals.
Background of the Invention
Automated text summarization is an active field of research which attracts much attention from both academic and industrial communities. Extractive summarization is the task of selecting a small subset of sentences from a document, thereby to create a summary which best reflects the content of the original document. In many cases, this task is expanded to the creation of a single summary from a collection of many related documents. The field of summarization is important for IR (Information Retrieval) systems, since it enables an efficient access to large repositories of textual data by determining the real essence of each document, indexing the repository, and creation of a compact version of each original document. Therefore, the summarization saves time for a user who needs to efficiently reach conclusions and decisions relating to the data content. This field distinguishes between automatically generated extract - the most salient fragments of the input document (e.g., sentences, paragraphs, etc.) and abstract - re-formulated synopsis expressing the main idea of the input document. Since the generation of an abstract requires a deep linguistic analysis of the input document (or documents), most prior art summarizing techniques deal with extractive summarization. Moreover, there are known extractive summarization techniques that can be applied to cross - lingual and multilingual domains.
Extractive summarization is considered as an optimization problem in a very natural way - extraction of maximum information in minimal number of words. Unfortunately, this problem is known as NP-hard, and there is no known polynomial algorithm which can tell, given a solution, whether it is optimal or not. Over the last decade many researchers formulated the summarization task as an optimization problem and solved it using approximation techniques such as the standard hill-climbing algorithm, A* search algorithm, regression models, and evolutionary algorithms.
Some prior art techniques measure information by means of text units such as terms, N-grams, etc., and reduce the summarization task to a maximum coverage problem. The maximum coverage model extracts sentences that cover as many terms or N-grams as possible. Despite a relatively good performance in the summarization field, a maximum coverage problem is a private case of a general optimization task, and it is known as NP-hard. Some other prior art techniques attempt to find a near-optimum solution by applying a greedy approach. Linear programming helps in finding a more accurate solution to the maximum coverage problem, and it became very popular in the summarization field. The prior art techniques typically use Integer Linear Programming. However, these prior art techniques suffer from several drawbacks, such as high computational complexity, and high number of constraints.
It is therefore an object of the present invention to provide a novel summarization method which requires a minimal level linguistic analysis, and which can therefore easily adaptable to multiple languages.
It is another object of the present invention to provide a summarization method which has polynomial running time, and small number of constraints.
It is another object of the invention to provide a summarization method which can be easily adapted to different user requirements relating to the summary quality. It is still another object of the present invention to provide a summarization method which can be applied to a single document, as well as to a collection of documents.
Other objects and advantages of the present invention will become apparent as the description proceeds.
Summary of the Invention
The invention relates to a method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises the steps of (a) defining a sentence vector S=(Si .,Sn) which relates to all the sentences in the article; (b) defining a terms vector
Figure imgf000004_0001
Tm) which relates to all the terms in the article! (c) preparing a sentence-term matrix which defines the number of appearances of each term 7 within each sentence Si, (d) representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively; (e) defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences; and (f) defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.
Preferably, said objective function can be defined as a distance function.
Preferably, said maximal predefined summary length is expressed by an additional hyperplane and its lower half space.
Preferably, the method further comprises definition of a minimal summary length, which is expressed by an additional hyperplane and its upper half space. Preferably, the method further comprises definition of constraints to ensure that said polytope is bounded.
Preferably, the method further comprises a preprocessing stage, which in turn comprises a step of sentence splitting, and a stage of tokenization.
Preferably, said preprocessing stage further comprises one or more of- (a) stemming and (b) synonym resolution.
Preferably, the objective function is a weighted sum, selected from several measures of terms importance.
Brief Description of the Drawings
In the Drawings :
Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, ¾, ¾, each hyperplane H represents a sentence within the document, and their intersections.
Detailed Description of Preferred Embodiments
In general, the invention relates to a summarization method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises :
a. defining a sentence vector
Figure imgf000005_0001
which relates to all the sentences in the article;
b. defining a terms vector
Figure imgf000005_0002
Tm) which relates to all the terms in the article;
c. preparing a sentence-term matrix which defines the number of appearances of each term 7) within each sentence Si,
d. representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively;
e. defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences! and
f. defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function, sentences that are closest to that point are added to a summary in a greedy manner, until the maximal predefined summary length is reached.
The description below elaborates on said method steps.
The method of the invention involves the providing of a novel model for text representation, making it possible to represent an exponential number of summaries without computing them exphcitly, and determining an optimal one by optimizing an objective function in polynomial time.
De£mtions
Problem Setting: The invention assumes a given set of sentences Si, Sn derived from a document or a cluster of related documents. Meaningful words in these sentences are entirely described by terms Ti, Tm. The quality of the summary must meet the requirements as predefined by an objective function. A specific object of the invention is to find a subset of sentences such that:
1. The summary (i.e., the subset of sentences) size is bounded to N words.
2. The quality of the summary is optimized in terms of said objective function.
Text Preprocessing - In order to construct said sentence-term matrix, it is necessary to perform a basic text preprocessing, as follows: a. Sentence splitting - i.e., determination where each sentence begins, and where it ends;
b. Tokenization - determination where each word begins and where it ends; and optionally
c. Stop-words removal - ignorance of meaningless words, such as "a", "the", "of, etc.;
d. Stemming - a process for reducing inflected words to their base or root form.
e. Synonym resolution - treating synonyms as same words.
The text preprocessing step reduces the matrix dimensionality, and results in a more compact and efficient model.
The Matrix Model - The matrix model of the invention represents text data, this model is in fact a sentence-term matrix A. The matrix A is a real matrix, A = (aij) of size mxn where: aij = k\£ term 7/ appears in the sentence 6 precisely k times.
The columns of A describe sentences, and the rows of A describe terms. The total number of words in the document (i.e., those remained following the preprocessing step), denoted by S, can be computed from the matrix A as follows:
m n
S =∑∑a, (1)
i=l j=l
Example V Given the following text of n = 3 sentences Si,S2,S3 and m = 5 terms Ti=fat, T2=cat, T3=eat, T4=fish, T6=meat-
Si - A fat cat is a cat that eats fat meat.
52 - My cat eats &sh but he is a fat cat.
53 = All fat cats eat fish and meat. Matrix A corresponds to the text above has the following structure^
Figure imgf000008_0001
The total count of words in this matrix is^
S =∑∑at] = 16
i=\ j=\
Hyperplanes and Half-spaces - As noted above, the present invention uses a representation model in which each sentence of the document is represented by a single hyperplane. All the sentences that are derived from the document form hyperplane intersections. All the possible extractive summaries of the document are represented by hyperplane intersections. The surface of the resulting polytope represents all the extracts that can be generated from the given document.
The method of the present invention views each column of the sentence-term matrix A as a linear constraint representing a hyperplane in a multi¬ dimensional real space. An occurrence of term 7} in sentence 6/ is represented by variable Xy.
Example 2. This example demonstrates variables corresponding to the 5x3 matrix A of Example 1. 5, s2 s3
Figure imgf000009_0001
Each sentence in the document is a hyperlane expressed by columns:
Aflfjj ' = [aij, amj]
and variables:
xj = [xij, Xmj] for all 1≤j≤n Note that A[][j] denotes the jth column of A.
The method of the invention further defines a system of linear inequalities^
A[][j] - xj T
Figure imgf000009_0002
for all l≤j≤n (2)
Every inequality in (2) defines a lower half-space of a hyperplane Hi. This hyperplane describes sentence *¾ and it is defined as follows:
αα χϋ =∑αϋ
To express the fact that every term is either present or absent from the chosen summary, the following limitations are added:
0 < Χ <1 (3)
The intersection of two hyperplanes, say and Hj, represents a summary which is composed of sentences Si and Sj. Similarly, a subset of r sentences is represented by the intersection of r hyperplanes. Example 3. Sentence-term matrix A of Example 1 defines the following hyperplane equations:
Hi■ 2xii+2x2i+X3i+X5i =2+2+1+1 = 6
H2 X12+2X22+X32+X42 = 1+2+1+1 = 5
Hg ■ ' X13+X23+X33+X43+X53 = 1+1+1+1+1 = 5
Here, a summary consisting of the first and the second sentence is expressed by the intersection of hyperplanes Hi and H2.
Fig. 1 illustrates a two-dimensional projection of hyperplanes Hi, H2, H3 and their intersections.
Summary constraints - The length constraint on the number of words in the summary can be easily expressed as a constraint on the sum of these variables.
In order to limit the number of words and terms within the summary, the invention defines summarization constraints in a form of linear inequalities. The limit on the minimal and maximal number of terms in the summary is expressed as follows:
m n
T mm y / 'y / '.v ij.. 7 max (4) '
i=\ j=\
Tmin stands for the minimal number of terms in the resulted summary, and Tmax stands for the maximal number of terms in the resulted summary.
Example 4- for Example 1, given Tmin = 6 and Tmax = 11, Equation (4) has the following form: 6 < XH+X21+X31+X51+X12+X22+X32+X42+ X13+X23+X33+X43+X53 < 10
The limit on the minimal and maximal number of words in the summary is expressed as follows:
W mm . ≤Y / Y a..x..≤W max (5) 7
=l
Wmin stands for the minimal number of words in the resulted summary, and Wmax stands for the maximal number of words in the resulted summary.
The difference between the number terms and the number of words in a summary is that a single term can appear more than once in a sentence.
Example 5. Equation (5) for the sentence-term matrix of Example 1 for Wmm = 6 Wmax = 11 has the form:
6 < 2XH+2X21+X31+X51+X12+X22+2X32+X42+ X13+X23+X33+X43+X53 < H
The polytope model So far, the method defined linear inequalities that describe all the sentences of the document, length constraints on the summary, and bounding constraints on the variables that are used in said inequahties. The method continues by combining all the above inequalities into one system which in turn describes the polytope. This is done in the following manner:
a. X. ≤ > a.
T mm < y / im j-\ y / i" j- x .≤T max
w " mm. < y L-lm y L-l a..x..≤w m
o < ¾ < i The first n inequalities describe sentences Si, Sn, while the next two inequalities describe constraints on the total number of terms and words in the summary, and the last inequality determines upper and lower boundaries for all the sentence-term variables.
Since every inequality in the equation (6) is linear, the entire system in (6) describes a closed convex polyhedron (polytope) P. The faces of Pare defined by the intersections of its bounding hyperplanes.
Defining an objective function - As said above, the surface of the polyhedron P is a representation of all the possible sentence subsets (the number of vertices of P can reach 0(2°)). According to the method of the invention, there is no necessity to scan the entire set of the facets of P. More specifically, the method finds a point on P which optimizes (minimizes or maximizes) a selected objective function.
The method of the invention may comprise several objective functions, while one objective function may be selected according to the user needs. The present application discusses the considerations for defining the objective function. The application also provides examples for various objective functions.
Example 6:
This example discusses an objective function that maximizes the Weighted Term Sum, as follows:
m
max ^ (7)
;=i
In equation (7) each ti is a variable which represents all appearances of term 7] in sentences Si, Sn, as follows: ti = j xij ,\≤i≤m (8) where variables xy are taken from equation (6).
In equation (7) each wi denotes the weight of term ti. The present method suggests six exemplary weight types, as follows:
1. POS_EQ: unweighted sum where all the terms T are equally important wt = 1 for all i,
2. POS _F: closeness to the beginning of the document - the term 7] is more important if it first appears closer to the beginning of the respective document:
1
αρρ(ϊ) where app(i) ' is the index of a sentence in the document where the term 7] first appears;
POS_L: closeness to the end of the document - the term T is more important if it first appears closer to the end of a document: wt = cipp{i) , '
POS_B: closeness to borders of the document - the term Ti is more important if it first appears closer to the beginning or the end of a document:
Figure imgf000013_0001
TF: normalized term frequency: w,- = tf i) , where tf(i) is the normalized term frequency of term 7/ computed as the number of times that the term appears in the document, divided by the total number appearances of all the terms in the document.
6. TF-ISF: term frequency multiplied by inverse sentence frequency. In this case, the method gives weight to term 7] in inverse proportion to the number of sentences in which it appears.
Figure imgf000014_0001
where tf(i) is the normalized term frequency of and isf(i) is inverse sentence frequency of 7] defined as
Figure imgf000014_0002
log n
Example 6.1 For the text of example 1, terms variables are defined as follows:
Figure imgf000014_0003
= X21 +X22+X23
ts = X31 +X32+X33
Figure imgf000014_0004
ts = X51 +X52+X53
his case, the "weighted term sum" objective function has the following form:
Figure imgf000014_0005
the weight type POS_EQ, this objective function takes the following form:
max ti + t2 + ts + + ts Example 7-
This example discusses an objective function that minimizes the Euclidean Distance between the term vector t = (ti, ... , tm), and a point p = (pi, ... , pm), as follows: m
m n∑<¾ - p^ 2 (9)
=l
The method uses term variables ti, ... , tm defined in equation (8).
One example for the point p can be the point which contains all the terms precisely once, thus minimizing term repetition, but increasing term coverage in the summary: p = (!, ... , !) (10)
As an alternative to the use of Euclidean Distance, the method may use Manhattan distance instead, as follows:
Figure imgf000015_0001
Example 1.1- For the text of Example 1, the Euclidean distance objective function has the form: min∑(i,. -l)
Example 8:
This example discusses an objective function that minimizes the Euclidean Distance between the term frequency vector sf= (s£,... , s£n) oi the summary, and the document term frequency vector df= (d£,... , dfm), as follows: min /,. - ^ )2 (12)
ί=1
where dfiis the normalized term frequency of term 7] in the document computed as
Figure imgf000016_0001
The term frequency vector of the summary is defined as follows:
Figure imgf000016_0002
where term variables tare defined by equation (8).
As an alternative to the use of Euclidean Distance, the method may use Manhattan distance instead, as follows:
Figure imgf000016_0003
Example 8.1 For the text of Example 1 the term frequency vector of the document is as follows: df= (0.25, 0.3125, 0.1875, 0.125, 0.125)
The objective function of Equation (12) has the form: minf(s i - 0.25) 2 + (sf2 - 0.3125) 2 + (sf3 - 0.1875) 2 + (sf4 - 0.125) 2 + (sf5 - 0.125)0 Example 9:
This example discusses an objective function that minimizes the sentence overlap in the summary, as follows:
Figure imgf000017_0001
where variables ovljk express the similarity between sentence Sj and sentence Si. Overlap variable ovlji for the pair of sentences SjSi is defined as sentence intersection divided by sentence union, according to Jaccard similarity coefficient formula, as follows :
Figure imgf000017_0002
For every term 7/ that appears in both sentences the weight w(aij an) of sentence overlap is defined as:
\ il if Tt is present in both S . and Sk
w[ai., aik ) =
[0 otherwise
Example 10:
This example discusses an objective function that maximizes the sum of bigrams in the summary. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram could be any combination of text units, letters or terms/words. An n-gram of size 1 is referred to as a "unigram"; An n- gram of size 2 is called a "bigram". The method of the invention considers any continuous pair of words or words separated by stopwords to be a bigram.
Said objective function is defined as follows:
max b (16) where Vz, j 0 < bitj < 1 Bigram variables bi are defined for every bigram consisting of terms 7} and 7} that appear in the document.
The system of equations (6) is extended by adding expressions that describe each sentence as a weighted sum of the bi rams contained within it:
Figure imgf000018_0001
Where the weight wjk denotes the number of times a bigram appears in a sentence.
Example 10.1. The text of Example 1 contains the following bigrams: "fat, cat", "cat, eat", "eat, fat", "fat, meat", "eat, fish", "fish, meat". Bigram variables are bil2, bl23, bl31, bil5, bl34, bl45.
Equalities (17) have the form:
Si = bii2+bi23+bi3i+bii5
Figure imgf000018_0002
S3 = bii2+bi23+bi34+bi45
The "bigram sum " objective function is defined as:
m ax bi i2+bi23+bi3 ι+bi i5+bi34+bi45
Extracting The Summary: The above discussion has described the following steps:
(a) The preprocessing step in which sentences and terms were recognized;
(b) Construction of sentence-term matrix;
(c) Construction of polytope model for the text according to equation (6);
(d) Defining of an objective function according to the user requirements;
(e) Having performed steps (a)-(d), the optimal value of the defined objective function on the polytope is obtained. This value is obtained at a point x=(xij) on the polytope. The method uses this point in order to construct the summary, as follows: 1. Determining of a normalized distance from x to each sentence hy erplane^
, =∑:,(¾ -¾)
vn »i
The point jsrlies in lower half-spaces of sentence hyperplanes.
2. Sorting the distances in increasing order and obtaining a sorted list
Figure imgf000019_0001
3. Adding sentences Sji,... ,Sjk, where k≤ m , to the summary until a maximal summary length is reached.
Experiments
Experiment Setup The method of the present invention was compared to several summarizing techniques that participated in the generic multi-document summarization tasks of the DUC 2002 (DUC, 2002) and MultiLing 2013 (MultiLing, 2013) competitions. The method was implemented in Java using lpsolve software. The Experiment used the following objective functions^
1. Maximal weighted term sum function, defined in equation (7) and denoted by OBJ^ight-me where weight_type is one of POS_EQ, POS_L, POS_F, POS_B, TF, TF-ISF.
2. Minimal distance function denoted by OBJ2 , as defined by equation (l l).
3. Minimal distance to term frequency function denoted by OBJ3 and defined in equation (13).
4. Minimal sentence overlap function, denoted by OBJ4 and defined in equation (14).
5. Maximal bigram sum function, denoted by OBJ5 and defined in equation (16).
Two experiments were performed having available public datasets^ (a) a first experiment was conducted on the datasets from the Document Understanding Conference (DUC) run by NIST in 2002. The dataset of DUC 2002 contains 59 document collections, each having about 10 documents and two manually created abstracts for lengths 50, 100, and 200 words. The experiment generated summaries of size 200 words and the quality of those summaries was compared to the Gold Standard summaries having a same size.
(b) a second experiment was conducted on the MultiLing 2013 dataset. Said MultiLing dataset is composed of 15 document sets, each of 10 related documents, and three manually created abstracts of 250 words length. The same length constraint was applied to the summaries generated. The dataset contains parallel corpora in several languages.
The well-known automatic summarization evaluation package ROUGE, was used to evaluate the effectiveness of the present technique against other summarizers. The recall scores of ROUGE-N for N e{l, 2}, ROUGE-L, ROUGE- W- 1.2, and ROUGE-SU4 which are based on N-gram, Longest Common Subsequence (LCS), Weighted Longest Common Subsequence (WLCS), and Skip-bigram plus unigram, with maximum skip-distance of 4, matching between system summaries and reference summaries, respectively, are reported in Tables 1-4 below.
Experimental Results
As it can be seen from Table 1, the technique of the present invention, given best performed objective functions, received better scores than four other techniques that participated in said DUC 2002 competition, in terms of ROUGE- 1, ROUGE-2, ROUGE-L, and ROUGE-W-1.2, and 3 systems in terms of ROUGE-SU4. Position-based weights behave as expected on said dataset: closeness to the beginning of a document is the best indication of relevance, while closeness to the end of a document, conversely, the worst one. The evaluation results on the MultiLing'13 English (the second experiment), Hebrew and Arabic data can be seen in Tables 2, 3 and 4, respectively. As it can be seen from the results, only 3 to 4 systems are outperformed by the present technique in English, 6 to 7 systems in Hebrew, and 5 to 8 systems in Arabic.
Figure imgf000021_0001
Table l: Evaluation results. Dataset ofDUC2002. English.
Figure imgf000022_0001
Table 2: Evaluation results. Dataset ofMultiLing2013. English.
Figure imgf000022_0002
Table 3- Evaluation results. Dataset ofMultiLing2013. Hebrew.
Figure imgf000023_0001
Table 4- Evaluation results. Dataset ofMultiLing2013. Arabic.
The experiments that were conducted on multiple datasets and languages show at least partial superiority of the present technique over other techniques. The following conclusions are made based on the obtained results.
1. The present technique performs better on Hebrew and Arabic than on English data. Possible reasons for this situation are as follows: Hebrew and Arabic, unlike English, have simple sentence splitting rules, where particular punctuation marks indicate sentence boundaries. Also, normalization of terms (stopwords removal and stemming) was not performed for Hebrew and Arabic. Apparently, the filtered during this stage information cause accuracy loss in distance measurement.
2. The lack of deep morphological analysis and NLP techniques also may affect the quality of the generated summaries, while permitting multilingual text processing.
3. Striving to preserve terms collection of the summarized documents as much as possible does not perform well when the results compared to Gold Standard abstracts. Despite better coverage of document terms, resulted extracts may contain different vocabulary, affecting the ROUGE scores.
4. Adding entire sentences to summaries decreases precision and recall metrics due to "garbage" information they carry. Therefore, sentence compression is required in order to see the "actual" results of the optimization procedure.
As shown, the present invention provides a summarization technique having the following advantages:
1. The method of the invention is an unsupervised method that does not require any annotated data and training sources.
2. The method of the invention considers all possible extracts and constructs a summary in polynomial time.
3. The method of the invention defines a novel text representation model independently from the objective functions that describe the summary quality. As such, one can easily to add more functions without modifying the model. This is in contrast to most prior art techniques that use linear programming for extracting sentences and embed objective functions into the model.
4. Since the technique of the present invention does not require any morphological analysis, this technique is language -independent.
While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be carried into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

1. Method for finding an optimal set of k sentences that summarize an article of n sentences, which comprises :
a. defining a sentence vector S=(Si,....,Sn) which relates to all the sentences in the article;
b. defining a terms vector
Figure imgf000025_0001
Tm) which relates to all the terms in the article;
c. preparing a sentence-term matrix which defines the number of appearances of each term 7) within each sentence Si,
d. representing each sentence as a hyperplane which divides a space to lower and upper half-spaces, and each intersection of several hyperplanes represents a summary which comprises said several sentences respectively;
e. defining a polytope P as a body which is formed from a plurality of hypersurfaces representing sentences; and
f. defining an objective function on said polytope, and finding a point on a surface of said polytope having an optimal value of said objective function,
g. adding sentences that are closest to said point to a summary in a greedy manner, until the maximal predefined summary length is reached.
2. Method according to claim 1, wherein said objective function can be defined as a distance function.
3. Method according to claim 1, wherein said maximal predefined summary length is expressed by an additional hyperplane and its lower half space.
4. Method according to claim 1, further comprising definition of a minimal summary length, which is expressed by an additional hyperplane and its upper half space.
5. Method according to claim 1, further comprising definition of constraints to ensure that said polytope is bounded.
6. Method according to claim 1, further comprising a preprocessing stage, which in turn comprises a step of sentence splitting, and a stage of tokenization.
7. Method according to claim 1, wherein said preprocessing stage further comprises one or more of- (a) stemming and (b) synonym resolution.
8. Method according to claim 1, wherein the objective function is a weighted sum, selected from several measures of terms importance.
9. Method according to claim 1, wherein the objective function is a weighted sum of bigrams.
10. Method according to claim 1, further comprising definition of a distance function between a sentence and a point on a surface of said polytope having an optimal value of said objective function.
PCT/IL2014/050791 2013-09-09 2014-09-03 Polytope based summarization method WO2015033341A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
IL244470A IL244470B (en) 2013-09-09 2016-03-07 Polytope based summarization method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361875141P 2013-09-09 2013-09-09
US61/875,141 2013-09-09

Publications (1)

Publication Number Publication Date
WO2015033341A1 true WO2015033341A1 (en) 2015-03-12

Family

ID=52627876

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2014/050791 WO2015033341A1 (en) 2013-09-09 2014-09-03 Polytope based summarization method

Country Status (1)

Country Link
WO (1) WO2015033341A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516145A (en) * 2019-07-10 2019-11-29 中国人民解放军国防科技大学 Information searching method based on sentence vector coding
CN111159347A (en) * 2019-12-30 2020-05-15 掌阅科技股份有限公司 Article content quality data calculation method, calculation device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030167245A1 (en) * 2002-01-31 2003-09-04 Communications Research Laboratory, Independent Administrative Institution Summary evaluation apparatus and method, and computer-readable recording medium in which summary evaluation program is recorded
WO2013066497A1 (en) * 2011-10-14 2013-05-10 Summly Ltd. Method and apparatus for automatically summarizing the contents of electronic documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030167245A1 (en) * 2002-01-31 2003-09-04 Communications Research Laboratory, Independent Administrative Institution Summary evaluation apparatus and method, and computer-readable recording medium in which summary evaluation program is recorded
WO2013066497A1 (en) * 2011-10-14 2013-05-10 Summly Ltd. Method and apparatus for automatically summarizing the contents of electronic documents

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HE, Z. ET AL.: "Document Summarization Based on Data Reconstruction", AAAI. 2012., 31 December 2012 (2012-12-31), Retrieved from the Internet <URL:http://www.aaai.org/ocs/index.php/AAAI/AAA112/paper/viewFile/4991/5247> [retrieved on 20121231] *
HONG, K. ET AL.: "A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization.", PROCEEDINGS OF LREC, May 2014 (2014-05-01), Retrieved from the Internet <URL:http://www.Irec-conf.org/proceedings/Irec2014/pdf/1093_Paper.pdf> [retrieved on 20140531] *
LIN H. ET AL.: "Graph-based submodular selection for extractive summarization.", AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING, ASRU 2009 ., 31 December 2009 (2009-12-31), Retrieved from the Internet <URL:http://melodi.ee.washington.edu/people/hlin/papers/lin2009-submodsum.pdf> [retrieved on 20091231] *
LITVAK, M. ET AL.: "Multilingual Multi-Document Summarization with POLY2", PROCEEDINGS OF THE MULTILING 2013 WORKSHOP ON MULTILINGUAL MULTI-DOCUMENT SUMMARIZATION, 9 August 2013 (2013-08-09), pages 45 - 49, Retrieved from the Internet <URL:http://www.aclweb.org/anthology/W/W13/W13-31.pdf#page=87> *
LITVAK, M. ET AL.: "Polytope Model for Extractive Summarization.", KDIR, 2012, pages 281 - 286, Retrieved from the Internet <URL:http://www.cs.bgu.ac.il/~litvakm/papers/quadratic-main-paper.pdf> [retrieved on 20121007] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516145A (en) * 2019-07-10 2019-11-29 中国人民解放军国防科技大学 Information searching method based on sentence vector coding
CN111159347A (en) * 2019-12-30 2020-05-15 掌阅科技股份有限公司 Article content quality data calculation method, calculation device and storage medium
CN111159347B (en) * 2019-12-30 2023-03-21 掌阅科技股份有限公司 Article content quality data calculation method, calculation device and storage medium

Similar Documents

Publication Publication Date Title
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
US7401077B2 (en) Systems and methods for using and constructing user-interest sensitive indicators of search results
US8078551B2 (en) Decision-support expert system and methods for real-time exploitation of documents in non-english languages
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
US20110282858A1 (en) Hierarchical Content Classification Into Deep Taxonomies
US9355372B2 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
Chen et al. Towards robust unsupervised personal name disambiguation
AU2014285073B2 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
WO2018056423A1 (en) Scenario passage classifier, scenario classifier, and computer program therefor
Lynn et al. An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms
Zhang et al. Multilingual sentence categorization and novelty mining
Sadeghi et al. Automatic identification of light stop words for Persian information retrieval systems
Ehrler et al. Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot
Yeshambel et al. Learned text representation for Amharic information retrieval and natural language processing
CN104216880B (en) Term based on internet defines discrimination method
WO2015033341A1 (en) Polytope based summarization method
Li et al. Chinese text emotion classification based on emotion dictionary
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
CN112949287B (en) Hot word mining method, system, computer equipment and storage medium
JP4895645B2 (en) Information search apparatus and information search program
Tagarelli et al. Word sense disambiguation for XML structure feature generation
Gella et al. Unimelb_nlp-core: Integrating predictions from multiple domains and feature sets for estimating semantic textual similarity
Bonnefoy et al. The web as a source of evidence for filtering candidate answers to natural language questions
Gebre Part of speech tagging for Amharic

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14843053

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 244470

Country of ref document: IL

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14843053

Country of ref document: EP

Kind code of ref document: A1