SG193995A1

SG193995A1 - A method, an apparatus and a computer-readable medium for indexing a document for document retrieval

Info

Publication number: SG193995A1
Application number: SG2013072921A
Authority: SG
Inventors: Huang Chien-Lin; Ma Bin; Li Haizhou
Original assignee: Agency Science Tech & Res
Priority date: 2011-03-28
Filing date: 2012-03-28
Publication date: 2013-11-29
Also published as: CN103548015B; CN103548015A; WO2012134396A1

Abstract

Various embodiments provide a method for indexing a document for document retrieval. The document may include: generating a document vector indicating if each of a plurality of terms are present in the document; calculating a document semantic inference vector for each of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the plurality of terms; and, indexing the document using a document semantic context inference vector calculated based on the document semantic inference vectors. Various embodiments provide a corresponding apparatus and computer readable medium.

Description

A METHOD, AN APPARATUS AND A COMPUTER-READABLE

MEDIUM FOR INDEXING A DOCUMENT FOR DOCUMENT

: RETRIEVAL oo 5 ‘TECHNICAL FIELD

Various embodiments relate to a method, an apparatus and a computer- readable medium for indexing a document for document retrieval. :

BACKGROUND

Speech is the most convenient way for the interaction of human-to-human and human-to-machine. The applications of spoken document retrieval (SDR) in education, business and entertainment are rapidly growing.

Successful examples include multilingual oral history archives access.

Conventional approaches focus on retrieving the information and attempt : 20 to satisfy user requirements. Due to the variation in speech, it is difficult to directly compare the speech query with the spoken documents in the database. In order to construct an efficient and effective retrieval system, the state-of-the-art spoken document retrieval (SDR) technologies adopt transcriptions obtained from automatic speech recognition for indexing.

Vector space model and probabilistic models rely on some similarity ~ functions that assume a document is more likely to be relevant to a query = if it contains more occurrences of query terms.

The indexing techniques of text-based information retrieval have been widely adopted in spoken document retrieval. However, due to imperfect speech recognition, out-of-vocabulary, ambiguity in homophone and word tokenization, the methods of conventional text-based indexing techniques are not always appropriate for spoken document retrieval. The transcription errors may cause undesired semantic and syntactic expression, thus inadequate indexing. Several approaches have been proposed to address these problems with various indexing units such as word, sub-word, phone, and so on. . 5 :

SUMMARY

Various embodiments provide a method for indexing a document for document retrieval, the method comprising: generating a document vector indicating if each of a plurality of terms are present in the document; : calculating a document semantic inference vector for one or more of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying 16 semantic relationships between different terms of the plurality of terms; and indexing the document using a document semantic context inference vector calculated based on each document semantic inference vector.

In an embodiment, a document semantic inference vector is calculated for each of the plurality of terms present in the document using the document vector and a semantic relation matrix. : - In. an embodiment, the document semantic context inference vector is : calculated by summing together the document semantic inference vectors. oo

In an embodiment, the method further comprises generating the semantic - relation matrix by: generating a term-by-document matrix using a plurality of documents, the term-by-document matrix identifying if each of the plurality of terms exists in each of the plurality of documents; and, generating a term-by-term matrix by performing singular value decomposition of the term-by-document matrix, the term-by-term matrix being the semantic relation matrix.

In an embodiment, a term weighting scheme is applied to the term-by- document matrix to suppress noisy terms.

In an embodiment, the term weighting scheme is applied in accordance oo 5 with the following expression: f(a, d)+1 D af = LD jog Dy ny df(a)+1" ng =. tf (ad) - where 4! are weighted terms of the term-by-document matrix W: D denotes a total number of documents in the plurality of documents; K is the number of terms in the plurality of terms; tf (a,,d) represents a number of occurrences of term & in the document d: df (a,) is a number of documents that contain at least one occurrence of the term &;.

In an embodiment, the term-by-term matrix is performed according to the following expression: where Wis the term-by-term matrix; W is the term-by-document matrix; and

T denotes a matrix transposition. : In an embodiment, singular value decomposition of the term-by-term matrix is performed according to the following expression: :

W=UzV’ where W is the term-by-term matrix; U is a left singular matrix; V is a right singular matrix; £ is an Rx R diagonal matrix whose nonnegative entries are singular R values in a descending order where R is the order of the decomposition; and T denotes a matrix transposition.

In an embodiment, dimensionality of the term-by-document matrix is reduced based on the following expression: . 1 R K —>0,26; F=) 0, 0 rai k=1 : 10 where © “is a threshold empirically adopted to select the eigenvector

U =[u,u,,....uz] based on the eigenvalue X =[0),0,,...,0;] with the first:

R dimensions, where R < K denotes the projected dimensions of the original term vector in the eigenspace.

In an embodiment, the term-by-term matrix is generated according to the following expression: w= URED" oo . 20 where W is the term-by-term semantic relation matrix and T denotes the matrix transposition.

In an embodiment, the method further comprises: receiving a search query; and, retrieving the document based on a comparison using the document semantic context inference vector and the search-query.

In an embodiment, retrieving the document further comprises. generating a search-query vector indicating if each of the plurality of terms are present in the search-query; calculating a search-query semantic inference vector for one or more of the plurality of terms present in the search-query using the search-query vector and the semantic relation matrix; calculating a search-query semantic context inference vector based on each search-query semantic inference vector; and retrieving the document based on a comparison between the document semantic : 5 context inference vector and the search-query semantic context inference vector.

In an embodiment, a search-query semantic inference vector is calculated for each of the plurality of terms present in the search-query using the search-query vector and a semantic relation matrix.

In an embodiment, the search-query semantic context inference vector is = calculated by summing together the search-query semantic inference vectors.

In an embodiment, the comparison between the document semantic context inference vector and the search-query semantic context inference vector is performed in accordance with the following expression: ~. ged ° q xd, : sim(q,d) = A = add 1 Vd x V2.4 where q and d denote the semantic context inference vectors of search- query g and document d; and e denotes the dimension of the semantic context inference vector. } :

Co in an embodiment, each document is a spoken document.

In an embodiment, a term is a word.

Various embodiments provide an apparatus for indexing a document for document retrieval, the apparatus comprising: at least one processor and at least one memory including computer program code the at least one ~~ memory and the computer program code configured to, with the at least one processor, cause the ‘apparatus at least to perform: generate a document vector indicating if each of a plurality of terms are present in the document; calculate a document semantic inference vector for one or more of the plurality of terms present in the document using the document ‘vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the plurality of terms; and index the document using a document semantic context inference vector calculated based on each document semantic inference oo vector. )

Various embodiments provide a computer readable medium for indexing a document for document retrieval, the computer readable medium having stored thereon computer program code which when executed by a computer causes the computer to perform at least the following: ~ generating a document vector indicating if each of a plurality of terms are present in the document; calculating a document semantic inference vector for one or more of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the plurality of terms; and indexing the document using -a document semantic context inference vector calculated based on each document semantic.inference vector.

The additional features and advantages stated in respect of the above- - described method are equally applicable to, and are restated here in respect of, the above-described apparatus and computer readable medium.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which: : Figure 1A illustrates a functional structure of an apparatus for indexing spoken documents according to an embodiment, whereas Figure 1B is a flow diagram of a corresponding method according to an embodiment;

Figure 2A illustrates a functional structure of an apparatus for retrieving spoken documents according to an embodiment, whereas Figure 2B is a : flow diagram of a corresponding method according to an embodiment;

Figure 3 illustrates a method of singular value decomposition according to an embodiment; ~ Figure 4A illustrates a method of generating a semantic context inference vector in accordance with an embodiment, whereas Figure 4B is a flow diagram of the method; - :

Figure 5 illustrates an exemplary computer interface for document retrieval according to an embodiment. :

Figures 6 to 8 are experimental results from a simulation of an embodiment; and

Figure 9 iliustrates the physical structure of an apparatus according to an embodiment.

DETAILED DESCRIPTION

Some portions of the description which follow are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated. : Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as “scanning”, “calculating”, “determining”, “replacing”, “generating”, “initializing”, “outputting”, or the like, refer to the action and processes of a computer system, or similar electronic device, that . manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices. :

The present specification also discloses apparatuses for performing the operations of the methods. Such apparatus may be specially constructed for : the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not - inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.

In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.

Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.

Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks,

Bp ~ memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard- wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method. } } . ;

Figure. 1A shows an exemplary apparatus 2 for indexing a document for document retrieval. It is to be understood that Figure 1A illustrates the - functional structure of the apparatus. Figure 1B provides a flow diagram of a corresponding method S2 for indexing a document for document retrieval.

The following description relates equally to the apparatus of Figure 1A and the method of Figure 1B.

In an embodiment, the apparatus 2 comprises a spoken document database 4 for storing spoken documents (S4); an automatic speech recognition processor 6 for performing automatic speech recognition (S6); a recognized results processor 8 for recognising terms in text (S8); a semantic relation matrix processor 10 for generating a semantic relation matrix (S10); a semantic context inference vector processor 12 for generating semantic . 5 context inference vectors (S12); and, an indexing database 14 for providing an index of spoken documents (S14).

In an embodiment, the spoken document database 4 stores a plurality of spoken documents (S4). For example, each document may be stored on the database as an audio file, such as, for example, a WAV file, an MP3 file or the like. In an embodiment, the database 4 may comprise one or more databases.

In some embodiments spoken documents may be provided in other formats, such as, for example, as a video file, such as, for example, a WMV file or an :

MP4 file.

In an embodiment, the automatic speech recognition processor 6 performs automatic speech recognition (S6) in respect of the spoken documents stored in database 4. Specifically, the processor 6 may convert spoken documents from audio speech into written text. :

In an embodiment, the recognized results processor 8 analyses the text generated by the processor 6 to identify or recognise terms (S8). In an embodiment, a term may be a word. However, in some other embodiments, a term may be smaller than a word, such as, for example, a syllable or a letter.

Additionally or alternatively, a term may be larger than a word, such as, for example, a phrase or a sentence. :

In an embodiment, the semantic relation matrix processor 10 generates a semantic relation matrix (S10) using the terms recognised by processor 8.

The semantic relation matrix identifies semantic relationships between different recognized terms. Stated differently, the semantic relation matrix may identify the related meaning between a pair of recognized terms. For example, one thousand terms may have been recognized by processor 8. Accordingly,

the processor 10 may generate a matrix having one thousand rows and one thousand columns, wherein each recognized term corresponds to one row and one column. Accordingly, the matrix may be populated with values to indicate a semantic relationship between pairs of recognized terms. In an embodiment, the size of the value may indicate the strength of the oo relationship. For example, a semantic relationship between the two terms ‘house’ and ‘home’ may be high and, therefore, a value indicating this relationship may either be present or correspondingly high. Alternatively, a semantic relationship between the two terms ‘house’ and ‘door’ may be lower and, therefore, a value indicating this relationship may either be present or correspondingly lower. Alternatively, a semantic relationship between the two terms ‘house’ and ‘writing’ may be low and, therefore, a value indicating this relationship may either be absent or correspondingly low.

In an embodiment, the value indicating the semantic relationship between a pair of terms is determined on the semantic similarity of the two terms. For example, the semantic similarity of the two terms may be analysed from all spoken documents stored on database 4. Stated differently, each spoken document may be analysed to identify whether or not both terms are present.

Additionally or alternatively, if both terms are present in a spoken document, the spoken document may be further analysed to identify how semantically close together both terms are, i.e. how many intervening terms are present in- between the two terms. In an embodiment, a value indicating the pair's semantic relationship may be added to the semantic relation matrix in dependence on one or more of these factors. Additionally or alternatively, the magnitude of the value may be set in dependence on one or more of these factors.

In an embodiment, the semantic context inference vector processor 12 generates semantic context inference vectors (S12) for spoken documents transcribed to text by processor 6. In an embodiment, one semantic context ~ inference vector relates to one document. Specifically, a transcribed document may be represented by a corresponding document vector. In an embodiment, the document vector comprises a list of all the recognized terms.

Also, the document vector may indicate which recognised terms are present in the document. For instance, considering the above example, the document vector may comprise a vector of one thousand values, wherein each value : 5 corresponds to one of the one thousand terms recognized by processor 8. If any of the one thousand terms is present in the document then its corresponding value in the document vector may be updated to indicate its presence. In an embodiment, the magnitude of the value may be proportional to the number of occurrences of that term in the document. For example, if a term appears fifty times in a document, in the document vector corresponding to that document the vector value corresponding to the term may be set to : fifty. Accordingly, the document vector provides an indication which recognised terms are present in the document corresponding to that vector.

Additionally, the document vector provides an indication of which recognised terms are not present in the document corresponding to that vector.

In an embodiment, all document vectors represent the same list of recognised terms. In an embodiment, all document vectors represent the same sequence of recognised terms. Accordingly, two document vectors may be analysed to identify which terms are common to both documents corresponding to the two document vectors. : In an embodiment, once the document vector has been generated it is combined with the semantic relation matrix to generate one document semantic inference vector for each recognised term present in the corresponding document. In an embodiment, a semantic context inference - vector is then generated using each of the generated semantic inference vectors. The semantic context inference vector relates to the document : corresponding to the document vector. According to this operation, a semantic context inference vector is generated for each document stored on database 4 and processed by processors 6 and 8.

It is to be understood that in some other embodiments, a semantic inference vector may only be generated for one or more of the recognised terms present in the document corresponding to the document vector. For example, in an embodiment, a semantic inference vector may only be generated for two : 5 three, four or any predefined number of recognised terms present in the document. Further, the semantic context inference vector may be generated based on any predefined number of semantic inference vectors.

In an embodiment, the indexing database 14 indexes or arranges spoken documents stored on database 4 according to their respective semantic context inference vectors (S14). In an embodiment, the database 14 may store data pairs comprising a document identifier together with a corresponding semantic context inference vector. In an embodiment, the indexing database 14 may be the same as, or part of, the spoken document database 4. Accordingly, each document may be stored in the combined database then indexed and identified by its semantic context inference vector.

According to the above-described method of operation, a plurality of documents may be indexed. : . ; :

Figure 2A shows an exemplary apparatus 18 for retrieving a document indexed by the apparatus 2. It is to be understood that Figure 2A illustrates the functional structure of the apparatus 18. Figure 2B provides a flow : diagram of a corresponding method (S18) for retrieving an indexed document.

The following description relates equally to the apparatus of Figure 2A and the method of Figure 2B. : oo

It is noted that the apparatus 18 may comprise some or all of the same components as the apparatus 2. Accordingly, a single apparatus may provide both the apparatus 2 and the apparatus 18. Specifically, the apparatus 18 may comprise the automatic speech recognition processor 6, the recognized results processor 8, the semantic relation matrix processor 10, the semantic context inference vector processor 12, and an indexing database 14.

In an embodiment, the automatic speech recognition processor 6 is configured to receive a spoken search query (S6), such as, for example, from oo a human user. In an example, the apparatus 2 and the apparatus 18 may be : 5 installed in a library and a library customer may provide the spoken search query. The automatic speech recognition processor 6, the recognized results processor 8, the semantic relation matrix processor 10, and the semantic context inference vector processor 12 all operate (S6, S8, S10 and S12) in an analogous way to as described above. Accordingly, a semantic context inference vector is generated (S12) for the spoken search query. In an embodiment, the semantic relation matrix used to generate the search query : semantic context inference vector is the same as the one used to generate the document semantic context inference vector.

Additionally or alternatively, in an embodiment, the semantic context inference processor 12 may be configured to directly receive the search query in text ~ form (S12), as indicated in Figures 2A and 2B. In this case, the semantic context inference processor 12 may convert the text query into a search query vector indicating which recognized terms are present. A search query : 20 semantic context inference vector may then be generated as described above. In an embodiment, a text document may be provided in an analogous fashion.

In an embodiment, once the search query semantic context inference vector has been generated (S12) by the semantic context inference processor 12, a search is performed to identify one or more appropriate documents. In an embodiment, a comparison is performed using the document semantic : context inference vectors and the search-query to identify one or more appropriate documents. More specifically, the search query semantic context inference vector may be compared to the document semantic context inference vectors to identify one or more appropriate documents. For example, the search may look for document semantic context inference vector(s) which most closely match the search query semantic context )

inference vector. In any case, the document context inference vectors may be obtained (S14) from the indexing database 14 by the semantic context inference processor 12. :

According to the above-described method of operation, one or more documents relevant to a search query may be identified using the above- described indexing method.

The following describes in greater detail the formation of the semantic relation - matrix in an embodiment.

Firstly, the formation of a document-by-term matrix may be considered.

In an embodiment, the spoken document database 4 comprises an accumulation of spoken documents. A spoken document may be represented by a row vector of terms v,={d/,d,...ai] derived from the statistics of transcription with weighting terms 42. D denotes the total number of spoken documents for indexing. K is the dimension of the indexing term vector. From this information a document-by-term matrix W=[V;,V,,...V;] may be derived. The document-by-term matrix may indicate which terms are present in each spoken document. For example, each row of the document-by-term matrix may represent a different document and each column may represent a different term. Accordingly, it may be possible to use the matrix to identify which terms are present in each document.

Specifically, the d-th spoken document may be represented by a row vector of terms Vv, =l[a/.,a;,...,ay] derived from the statistics. of transcriptions with weighting terms 4; . In an embodiment, the transcriptions are generated by the automatic speech recognition processor 6. K is the dimension of indexing term vector. Stated differently, K is the total number of terms recognised in the documents, and the row vector indicates which ones of these are present in the document corresponding to that row vector. For example, those not present are indicated by a value of ‘0’, whereas those present are indicated by a value other than ‘0’.

The following describes how a term weighting scheme may be applied to the document-by-term matrix in an embodiment.

Due to imperfect speech recognition and the redundancy of transcription, not all of the recognized terms are valid and meaningful. To eliminate noisy terms, terms which have low frequency in a document and occur in few documents may be discarded by the following term weighting scheme: =~ +1 : al = ¥(a,d) +1 x log(—2——) n, df (a,)+1 ny; = Dif (ad) : k where (4,4) may represent the number of occurrences of recognized term & in the spoken document d; 4f(a,) may be the number of documents in the spoken document database 4 that contain at least one occurrence of the term 4. ) :

An advantage of the term weighting scheme is to provide useful information” about how important a term is to a document in the spoken document database. Accordingly, it is possible to suppress terms which occur very infrequently in the documents, such as, for example, typographical errors.

Also, it is possible to suppress terms which occur very frequently in the documents, such as, for example, ‘and’, ‘of and other such terms which are unlikely to indicate underlying concepts of a document. Therefore, the document-by-term matrix may be enhanced by the application of the term weighting scheme.

In an embodiment, the semantic relation matrix is generated from the weighted document-by-term matrix as follows.

The semantic relation matrix is a term-by-term matrix, rather than a document-by-term matrix. Stated differently, the semantic relation matrix defines the semantic relationships between different pairs of terms. On the other hand, the document-by-term matrix defines the relationships between documents and terms, i.e. whether or not a document includes a term. The term-by-term semantic relation matrix may be used to describe co-relations oe 10 between terms through a collection of documents. .

In an embodiment, to construct a term-by-term matrix, a. covariance estimation is performed in accordance with the following expression:

W=wW’ - where W is the document-by-term matrix mentioned above and T denotes the matrix transposition. In this embodiment, Wis a term-by-term matrix used to

IE describe co-relations between terms through documents. The diagonal of the © 20 matrix W means the self-terms and shows the highest co-relation scores.

Stated differently, the closest relationship is found between two terms which are identical.

In an embodiment, the next step is to perform singular value decomposition (SVD) which finds the optimal projection to explore term co-occurrence patterns. SVD is related to eigenvector decomposition and factor analysis.

SVD may be used to find an optimal projection to explore term occurrence patterns. Figure 3 illustrates the process of performing SVD.

SVD is related to eigenvector decomposition and factor analysis. As can be seen on Figure 3, in an embodiment, the SVD of the matrix w is performed as follows:

W=UzV? where U is a left singular matrix and V is a right singular matrix. Both U and V : 5 show the orthogonal character. X is an R x R diagonal matrix whose nonnegative entries are singular R values in a descending order, i.e. 0,20,2..20;>0 Ris the order of the decomposition and R<X .

Co In an embodiment, the column vectors of U and V each perform an orthornormal basis for the space of dimension R spanned by uX and vZ. This : leads to a representation of documents and terms in a continuous vector space of low dimension, namely the latent semantic indexing (LSI) space. To find co-occurrences between terms, a term-by-term (K x K) matrix may be generated as follows.

In an embodiment, the SVD can be used to project all the dimensions of the term vectors onto a latent information space with significantly reduced - dimensionality. This has the advantage of reducing the size of the term vectors through the removal of less important factors. In an embodiment, the :

SVD is applied to select the major factors based on a threshold & . : . | : 1 R XK . —>0,26; F=)Y 0, 0 a1 k=1 where 8 is empirically adopted to select the eigenvector U = [250,05 Ug] based on the eigenvalue X =[0,,0,,...,0,] with the first R dimensions, where R < K denotes the projected dimensions of the original term vector in the eigenspace. In an embodiment, this eigenvector U is treated as a transform basis in LSI.

In view of the above, the larger ¢, the more important or significant the term which corresponds to ¢ is. For example, the value of 6 maybe relatively small for terms such as ‘and’, ‘of, ‘for’, whereas the value of & maybe relatively large for terms such as ‘Australia’, ‘money’, ‘house’. Accordingly, eigenvectors may be ranked in order of their eigenvalues. Then, eigenvectors with eigenvalues below a threshold may be disregarded. In this way, significant terms may be considered whereas insignificant terms may be ignored. By selecting eigenvectors on the basis of their eigenvalues it is possible to consider only the relatively significant terms.

As a result of the above, the semantic relation matrix Ww may be reconstructed as follows: : w= 020"

Different from the matrix Ww, the matrix W may remove noisy factors and capture the most important term-to-term associations or relationships. The matrix W , which contains all of the term-to-term dot products, is a representation of the co-occurrences and semantic relations among terms.

The co-relation scores of the matrix W are estimated globally based on the similarity between concepts. :

In an embodiment, each recognized term 4; in the spoken document d can be mapped onto a semantic inference vector 4, = Vv, through the semantic relation matrix W =[¥},¥,,...¥]. In an embodiment, the semantic inference vector vi is actually a representation of the associated terms of term 4; .

The following describes this process in accordance with the illustration of

Figure 4A and the flow diagram of Figure 4B.

In an embodiment, at S30, a document vector is calculated, as described above with reference to Figures 1A, 1B, 2A and 2B. As mentioned above, the document vector may be a vector which identifies all recognised terms and also identifies which recognised terms are present in a corresponding : 5 document. A weighting vector 50 may represent the document vector. Also, the weighting vector 50 may represent a document vector to which a term weighting scheme has been applied, as described above.

Shading of a cell of the weighting vector 50 indicates that the term corresponding to the cell is present in the document corresponding to the weighting vector. In this example, only a first cell 52 and a fourth cell 54 are shaded thereby indicating that only the first and fourth recognised terms are present in the document.

In an embodiment, at S32, a semantic inference vector is generated for each shaded cell using the semantic relation matrix, i.e. a semantic inference ~ vector is generated for each recognised term present in the document.

Accordingly, two semantic inference vectors 56 and 58 are generated. The vector 56 corresponds to the recognised term in cell 52, whereas the vector 58 corresponds to the recognised term in cell 54. As can be seen from the shading, the vector 56 comprises two values, meaning that the term in cell 52 has a semantic relationship with two recognised terms. Also, the vector 58 comprises three values, meaning that the term in cell 54 has a semantic relationship with three recognised terms. 25 .

In an embodiment, the sequence of terms represented by successive cells of the weighting vector 50 is the same as the sequence of terms represented by successive columns of the semantic relation matrix. For example, the top cell : of weighting vector 50 may represent the same term as the leftmost column of the semantic relation matrix, whereas the bottom cell of weighting vector 50 may represent the same term as the rightmost column of the semantic relation matrix. Accordingly, the cells of a diagonal of the semantic relation matrix may always indicate a semantic relation, or a strong semantic relation, since the :

cells of the diagonal relate to the semantic relationship between a pair of identical terms. For example, considering the above example, the cells of the top-left to bottom-right diagonal of the semantic relation matrix may each indicate the presence of a semantic relation, or a strong semantic relation. : 5 This can be seen more particularly on Figure 4A, wherein the leftmost cell of vector 56, which corresponds to the top cell of vector 50, is shaded. Also, the fourth cell from the left of vector 58, which corresponds to the fourth cell from the top of vector 50, is shaded.

In an embodiment, at S34, all the semantic inference vectors for the spoken “document d are summed to obtain the semantic context inference vector 60 : as follows: v= > = :

As can be seen from Figure 4A, the vector 60 comprises four values as a result of the sum operation. In an embodiment, the four recognised terms represented by these four values provide a means for identifying and indexing the document which corresponds to the weighting vector 50. Furthermore, by virtue of the summing operation, if a recognised term is represented in multiple semantic inference vectors then its associated value is increased, i.e. reinforced. Alternatively, if a recognised term is represented in only a single semantic inference vector it is not reinforced. Accordingly, in an embodiment, the semantic context inference vector not only indicates which recognised terms are relevant to a document, it also indicates a level of relevance for . each term.

In an embodiment, the semantic context inference vector can be regarded as a re-weighing indexing vector which expands the indexing terms based on the related terms in the semantic inference vectors ve Normally, the semantic inference (i.e. the underlying concept) of the terms in a spoken document is associated with the same topic. With the semantic context inference, the terms which are present in many inference vectors are reinforced, and the terms with fewer occurrences are weakened. Since the semantic inferences of the wrongly recognized terms are diverse, the effect due to speech : 5 recognition errors can be averaged and thus eliminated. Furthermore, the procedure for deriving the semantic context inference vector can be entirely data driven without any pre-defined knowledge, such as WordNet and

HowNet which require a pre-defined concept or knowledge database.

It is to be understood that in some other embodiments, an alternative operation to the sum operation may be performed. For example, the semantic : inference vectors may be multiplied together or averaged. Furthermore, in some other embodiments, some but not all of the semantic inference vectors may be used to generate the semantic context inference vector. For example, only semantic inference vectors having more than a certain number of values may be used.

According to some above-described embodiments, the proposed semantic ~~ context inference (SCI) is different from the latent semantic indexing (LSI).

Specifically, a different basis, U may be used for LSI whereas the semantic relation matrix W may be used for SCI. LSI aims to reduce the data dimensionality to a low-dimensional space, and to project the elements in the document-by-term matrix to the orthogonal axes using the basis U.

Alternatively, SCI takes into account the semantic relation matrix W which shows term-by-term associations.

In an embodiment, the search query and spoken documents are represented as semantic context inference vectors for highly efficient retrieval. Each : component in the semantic context inference vector may be estimated using the above-proposed latent semantic inference from the query and the spoken documents. A cosine measure may then be used to estimate the similarity between search query q and spoken document d as follows: [email protected]= 308 2ntxd lal lal 30 x (Xe where q and d denote the semantic context inference vectors of query q and spoken document d, and e denotes the dimension of the semantic context - inference vectors. Retrieval results may then be ranked according to the similarities obtained in the retrieval process. : 10 : i According to the above operation a ranked list of spoken document may be provided in dependence on a spoken or text search-query. Since the search is performed on the basis of inference, speech recognition errors are less problematic. Specifically, some terms may be incorrectly recognised or completely missed; however, the underlying concept or inference of the document or search-query should still be identifiable. Stated differently, term recognition errors may be smoothed. Accordingly, various embodiments ~~ provide an improved technique for indexing documents for document retrieval.

Various embodiments provide the following advantage. A semantic relation .. matrix representing term-by-term associations is generated using a document-by-term dataset. In order to remove noisy factors caused by speech recognition errors and capture the most important term-to-term associations, only the eigenvectors with higher eigenvalues are used to : 25 estimate the semantic relation matrix. With a semantic relation matrix that ~ reflects an expansion of the semantic relations among terms, each term in a spoken document or a search query can be mapped onto a semantic inference vector that represents the co-occurrences and semantic relations between the specific term and all other terms.

Various embodiments use latent semantic indexing to infer related concept terms for spoken document retrieval. The term significance is used to weight a term sequence of the document considering the recognition confidence and

TF-IDF score. The latent semantic indexing is used to construct a term-by- term matrix for inference. The recognized term string is automatically mapped onto a set of semantic vectors through the inference matrix. Finally, the semantic indexing is estimated by sum of mapped semantic vectors of the document. Latent semantic inference has a number of advantages. For example, it can learn related terms and apply them as the new - 10 representations of documents. Further, the procedure of latent semantic inference is entirely data driven.

Various embodiments provide the following advantage. Based on the estimated semantic relation matrix, a re-weighted indexing vector for a spoken document or a query is generated using (e.g. by summing together) all the semantic inference vectors related to the terms in the spoken document or the search query. Accordingly, the semantic concepts in the spoken document or search query are reinforced by promoting the terms which are likely to be - valid and demoting the terms which are likely to be invalid.

According to some above-described embodiments, spoken document retrieval is based on semantic context inference for speech indexing. Each recognized cL term in a spoken document is mapped onto a semantic inference vector : containing a bag of semantic terms through a semantic relation matrix. The semantic context inference vector is then constructed by summing up all the semantic inference vectors. Semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech "indexing and substantially improves the performance of spoken document retrieval.

According to some above-described embodiments, the concept of mapping and context expansion of spoken documents is introduced by using the

Semantic Context Inference (SCI). Firstly, the term association for inference is determined. Then, the semantic relation matrix is constructed considering the " term-by-term association by the document-by-term dataset. Then, each recognized term is mapped into a bag of semantic related terms based on the semantic relation matrix. With the semantic term expansion and re-weighting indexing, the above-described embodiments deal with problems resulting from a speech recognition errors by reinforcing the correctly recognized terms.

Conventional approaches only take into account multiple candidates of recognised terms or types indexing to enhance the retrieved information. The : 10 semantic content and semantic relation of speech, which play an important role in the way humans perceive speech transcriptions and measure their ) similarity, are not well considered. In contrast, the above-described embodiments consider semantic content and the semantic relation of speech.

Various embodiments provide the following advantage. Semantic inference considers ontology, i.e. terms are interpreted at a conceptual level. A common ontology database used in Chinese is HowNet and in English is WordNet.

However, an ontology approach may need a pre-defined knowledge database. In various embodiments, the knowledge database is provided by the plurality of spoken documents stored in the database 4.

According to some above-described embodiments, each recognized term is - automatically mapped onto a set of semantically related terms using the } . semantic relation matrix. As a result, one term can be represented as a semantic inference vector, i.e. a vector of semantically related terms. Finally, semantic indexing may be estimated by summing together the semantic inference vectors of one document.

According to some above-described embodiments, semantic context inference is used to explore the latent semantic information and extend the semantic related terms to speech indexing. The semantic context inference vector can be regarded as a re-weighing indexing vector which organizes :

relationships between document terms and semantic terms which are associative to the document terms.

Various embodiments provide the following advantage. To alleviate the effect of recognition errors, the above-described embodiments use the semantic context inference representation by finding the semantic relation between terms, and suggesting semantic term expansion for speech indexing. These associated terms are re-weighted as the new representation on queries and documents for spoken document retrieval. :

In the above discussion, previous SDR systems are based on a speech recognition system with various indexing transcriptions. While speech content can be recognized from speech signal to text transcriptions, significant terms and semantic knowledge of the transcribed terms are not well adopted for spoken document retrieval. Due to the redundant property of spontaneous speech and recognition errors from Large-Vocabulary Continuous Speech ~ Recognition (LVCSR), transcriptions contaminated by redundant/noisy data Co are adopted in spoken document retrieval, degrading the retrieval performance. Various embodiments aim fo solve these shortcomings by indexing and retrieving spoken documents based on the semantic content of the documents. Stated differently, indexing and retrieval is based on the underlying concepts of document rather than only their terms.

It is an advantage of various embodiments that speech indexing is performed : 25 using latent semantic inference which considers a term significance score and latent semantic inference score. Various embodiments: estimate the term significance with speech recognition confidence and TF-IDF score to obtain the term weighting. Based on the term significance score, the latent semantic indexing is used to build a term-by-term matrix for semantic inference.

Exploiting co-occurrences between terms is an instance of semantic inference.

Figure 5 illustrates an embodiment of a computer interface used to perform spoken document retrieval using semantic inference speech indexing.

Specifically, a search query may be provided at search box 100 and a search button activated. The search results may then be presented in a results box 102. As can be seen, the results box 102 may provide a list of spoken documents ranked in order of relevance considering the search query.

Further, a document box 104 may be provided so that any one of the spoken : documents in the results box 102 may be selected and played.

The following describes experiments performed in respect of an embodiment : to determine spoken document retrieval ‘performance. In summary, the : experimental results show that the speech indexing using the semantic context inference (SCI) embodiment outperforms the conventional TF-IDF word vector and LSI indexing schemes. 16

To validate the abovementioned approach, a standard Mel-frequency cepstral coefficients (MFCCs) may be applied for speech recognition. Each frame of the speech data may be represented by a 36 dimensional feature vector, consisting of 12 MFCCs, along with their deltas, and double-deltas. The features may be normalized to zero mean and unit variance for improving discrimination ability. The speech recognition system may be based on the statistical hidden Markov model (HMM) and the phonetic structure of . Mandarin Chinese with 137 sub-syllables including 100 right-context- dependent INITIALs and 37 context-independent FINALs as the basic units.

The decision-based state-tying context-dependent sub-syllable units are used for acoustic modeling. The number of Gaussian mixtures per state of the acoustic HMMs ranged from two to 32, depending on the quantity of the training data. Each sub-syllable unit was modeled with three states for the

INITIALs and four states for the FINALs. The silence model was a one-state

HMM with 64 Gaussian mixtures trained with the non-speech segments

The spoken document corpus was acquired from a published Mandarin

Chinese broadcast news corpus (‘MATBN”). The corpus contains a total of

198 hours of broadcast news with the corresponding transcripts. 1550 anchor news stories ranging over three years were extracted for experiments. The average news story length was 16.38 seconds with an average of 51.85 words. The speech data in MATBN were recognized by the speech recognition system with the word accuracy of 78.92%.

Moreover, a Topic Detection and Tracking collection (TDT2) was also used for the validation. 2112 Mandarin Chinese audio news stories from another publicly available source were used in the experiments. The average document length of TDT2 was 174.20 words. The word accuracy of TDT2 was about 75.49%. For TDT2, the speech recognition transcription was provided : by LDC.

To measure the accuracy of retrieved documents and the ranking position of relevant documents, the mean average precision is estimated as follows: - - 1 N, 1 N; J “VIN rank gq i=1 i j=1 ij where n~, denotes the number of search queries, and N, represents the number of relevant documents contained in the retrieved documents for query ir ank;; denotes the rank of the j-th relevant document for the i-th query gq. In order to evaluate the robustness of speech indexing based on the semantic context inference, the same pool of 164 keyword queries (from two to four

Chinese characters) were used for both MATBN and TDT2. The average length of queries was 3.02 Chinese characters. There are 15.71 and 21.20 relevant spoken documents in MATBN and TDT2, respectively.

To remove the noisy factors in eigen decomposition, a threshold ¢ was chosen for keeping the major factors. A higher value of ¢ indicates that more eigenvectors are used for latent semantic analysis as well as the reconstruction of the semantic relation matrix. The experimental results shown in Figure 6 were obtained with MATBN broadcast news corpus using different thresholds for the indexing of LSI and SCI (an embodiment), while the popular word vector indexing (TF-IDF) was used as the baseline which achieved 69.56% mAP. The experiments showed that the complete LSispace does not oo give as good performance as the reduced dimensional LSispace. It is shown that the best results can be achieved when the threshold 80% for LSI and 70% for SCI are selected separately. The results confirm that a better performance can be achieved by removing the noisy factors. The experimental results also show that the embodiment SCI outperforms both

Co TF-IDF and LSI indexing approaches. :

To evaluate the effect of the semantic context inference, an embodiment was applied on TDT2 and MATBN corpus using both automatic speech recognition results (ASR scripts) and perfect text (text scripts). The experimental results, shown in Figure 7. indicate that consistent spoken document retrieval - ~ improvements have been obtained on TDT2 and MATBN based on SCI . indexing, compared with TF-IDF indexing. To understand the upper-bound of spoken document retrieval, the indexing by perfect text scripts was evaluated © 20 as the reference. There was a gap (about 15%~20% mAP) between indexing oo using speech and text scripts because of imperfect speech recognition. - :

Pr “In the conditions of noisy environments, spontaneous speech, and low recording quality device, the predictable speech transcriptions were far from perfect. Figure 8 summarizes the experiments with various speech recognition word accuracies. To study the impact of speech recognition accuracy variance on the semantic context inference, different settings of a speech recognition system were used. Experiments were conducted on MATBN broadcast news. Compared with imperfect speech recognition results, the correct transcriptions were derived manually and the retrieval was treated as text document retrieval. With the text-based document retrieval, the proposed semantic context inference approach still performed well with a minor improvement compared with conventional word vector retrieval (TF-IDF)

method. The embodiment SCI indexing showed a 4.72% improvement from 69.56% MAP to 74.28% mAP when the word accuracy of speech recognition is 80%. Actually, the word accuracy is important to construct the semantic relation matrix for context inference. Figure 7. shows that the improvement : 5 narrows down when the word accuracy becomes lower. In summary, speech indexing by the embodiment SCI shows better retrieval effectiveness than LSI or TF-IDF.

The following describes an exemplary physical structure of an apparatus used to perform various embodiments. : ~The above-described method and functional apparatus of the example embodiments can be implemented on a computer system 800, schematically shown in Figure 9. It may be implemented as software, 156 such as a computer program being executed within the computer system 800, and instructing the computer system 800 to conduct the method of . ~ the example embodiment.

The computer system 800 comprises a computer module 802, input modules such as a keyboard 804 and mouse 806 and a plurality of output devices such as a display 808, and printer 810. - The computer module 802 is connected to a computer network 812 via a . suitable transceiver device 814, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area

Network (WAN). : Co

The computer module 802 in the example includes a processor 818, a

Random Access Memory (RAM) 820 and a Read Only Memory (ROM) 822. The computer module 802 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 824 to the display 808, and I/O interface 826 to the keyboard 804.

The components of the computer module 802 typically communicate via an interconnected bus 828 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 800 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilizing a corresponding data storage medium drive of a data storage device 830. The application program is read and controlled in its execution by the processor 818. Intermediate storage of program data maybe accomplished using RAM 820.

It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

Claims

1. A method for indexing a document for document retrieval, the method comprising: : 5 generating a document vector indicating if each of a plurality of = terms are present in the document; calculating a document semantic inference vector for one or more of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the piurality of terms; and : indexing the document using a document semantic context inference vector calculated based on each document semantic inference vector.

2. The method of claim 1, wherein a document semantic inference vector is calculated for each of the plurality of terms present in the document using the document vector and a semantic relation matrix.

3. The method of claim 2, wherein the document semantic context inference vector is calculated by summing together the document semantic inference vectors. 4 The method-of.any one of claims 1 to 3, further comprising generating the semantic relation matrix by: generating a term-by-document matrix using a plurality of documents, the term-by-document matrix identifying if each of the plurality of terms exists in each of the plurality of documents; and, generating a term-by-term matrix by performing singular value decomposition of the term-by-document matrix, the term-by-term matrix being the semantic relation matrix.

5. The method of claim 4, wherein a term weighting scheme is applied to the term-by-document matrix to suppress noisy terms.

6. The method of claim 5, wherein the term weighting scheme is applied : 5 in accordance with the following expression: t d)+1 a; — if (a, ) x log( D ) hy df (a) +1 hy => tf (a;.d) : : k where 4 are weighted terms of the term-by-document matrix W, D denotes a total number of documents in the plurality of documents; K is the number of terms in the plurality of terms; #(a,.d) represents a number of occurrences of term & in the document d; df (a,) is a number of documents that contain at least one occurrence of the term 4.

7. The method of any one of claims 4 to 6, wherein the term-by-term matrix is performed according to the following expression: where Wis the term-by-term matrix; W is the term-by-document matrix; and T denotes a matrix transposition.

8. The method of claim 7, wherein singular value decomposition of the term-by-term matrix is performed according to the following expression: : W=UzZV’ where W is the term-by-term matrix; U is a left singular matrix; V is a right singular matrix; £ is an Rx R diagonal matrix whose nonnegative entries are singular R values in a descending order where R is the order of the decomposition; and T denotes a matrix transposition.

9. The method of claim 8, wherein dimensionality of the term-by- document matrix is reduced based on the following expression: : 1 _R K —>0,26; F=) 0, : O =i k=l where & is a threshold empirically adopted to select the eigenvector + U=[u,u,....ug] based on the eigenvalue X =[0,,0,,...,0;] with the first - R dimensions, where R < K denotes the projected dimensions of the original term vector in the eigenspace. = 10.The method of claim 9, wherein the term-by-term matrix is generated - according to the following expression: w= OED" oo “where W is the term-by-term semantic relation matrix and T denotes the matrix transposition.

11. The method of any preceding claim, further comprising: receiving a search query; and, retrieving the document based on a comparison using the document semantic context inference vector and the search-query.

12. The method of claim 11, wherein retrieving the document further . comprises:

generating a search-query vector indicating if each of the plurality of terms are present in the search-query; calculating a search-query semantic inference vector for one or more of the plurality of terms present in the search-query using the search-query vector and the semantic relation matrix; calculating a search-query semantic context inference vector based on each search-query semantic inference vector; and retrieving the document based on a comparison between the document semantic context inference vector and the search-query semantic context inference vector.

13. The method of claim 12, wherein a search-query semantic inference vector is calculated for each of the plurality of terms present in the search- query using the search-query vector and a semantic relation matrix.

15 .

14. The method of claim 13, wherein the search-query semantic context inference vector is calculated by summing together the search-query semantic inference vectors.

15.The method of any one of claims 12 to 14, wherein the comparison between the document semantic context inference vector and the search- query semantic context inference vector is performed in accordance with the following expression: : . o_o ged oaxd, * I CS or where gq and d denote the semantic context inference vectors of search- query q and document d; and e denotes the dimension of the semantic context inference vector. :

16.The method of any preceding claim, wherein each document is a spoken document. - 17.The method of any preceding claim, wherein a term is a word.

18.An apparatus for indexing a document for document retrieval, the apparatus comprising: at least one processor and at least one memory including computer program code the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: generate a document vector indicating if each of a plurality of terms are present in the document; calculate a document semantic inference vector for one or more of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the plurality of terms; and index the document using a document semantic context inference vector calculated based on each document semantic inference vector. Co

19.A computer readable medium for indexing a document for document : | retrieval, the computer readable medium having stored thereon computer program code which when executed by a computer causes the computer to perform at least the following: : oo generating a document vector indicating if each of a plurality of terms are present in the document; calculating a document semantic inference vector for one or more of the plurality of terms present in the document using the document vector and a semantic relation matrix, the semantic relation matrix identifying semantic relationships between different terms of the plurality of terms; and indexing the document using a document semantic context inference vector calculated based on each document semantic inference vector.