WO2014189399A1

WO2014189399A1 - A mixed-structure n-gram language model

Info

Publication number: WO2014189399A1
Application number: PCT/RS2013/000009
Authority: WO
Inventors: Stevan OSTROGONAC; Milan SEĈUJSKI; Vlado Delić; Dragiša MIŠKOVIĆ; Nikša JAKOVLJEVIĆ; Nataša VUJNOVIĆ SEDLAR
Original assignee: Axon Doo
Priority date: 2013-05-22
Filing date: 2013-05-22
Publication date: 2014-11-27

Abstract

The invention relates to the language model based on mixed-structure «-grams and the method of determining a word sequence probability based on this type of model. The mixed- structure comprising lemma and morphologic class information for all the words of an «-gram enables a modeling technique which insures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when a small amount of data is available. Also, different pruning techniques may be used in order to reduce the number of «-grams included in the model if a large amount of data is available for training.

Description

A MIXED-STRUCTURE N-GRAM LANGUAGE MODEL

Technical Field

The invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation. Background Art

Language models (LMs) have been used in applications where there is a need to predict the next word based on some context, or to score word sequence hypotheses in order to determine the most probable input sequence as in speech recognition systems. Statistical n-gram language modeling has been the predominant language modeling framework for decades. The «-gram models play an important role in speech recognition systems. Large vocabulary continuous speech recognition (LVCSR) systems are developed for many languages and for each of them training corpora needs to be prepared in order to obtain high-quality language models. For highly inflective languages, as well as for domain-specific applications, there is usually insufficient training data. Furthermore, for many purposes LMs of acceptable quality can become unpractically large. The main object of statistical language modeling is to create a good language representation by using a relatively small amount of «-grams. One way to implement such a LM is by grouping words into classes by using some criteria, usually correlated with morphologic or semantic information. Class «-gram models created in this way show promising results since the problem of insufficient training data is partly solved. Furthermore, class n-gram models are generally of acceptable size. The problem with these models lies in the fact that many words are not adequately represented by the classes they belong to. In addition to defining the classes of «-gram LMs, history clustering techniques are used to create relatively small LMs with good accuracy. Combining «-gram models with context-free grammars (CFGs) also showed good results. In these cases LM sizes can be reduced, usually by applying the entropy-based pruning technique, and still give satisfying quality.

Different types of models can be combined in order to obtain better results for particular applications. This is usually done by combining the probabilities returned by different LMs for a given word sequence with appropriate weighting coefficients which are usually determined empirically.

Some of the protected solutions and research papers that represent a prior art of the proposed invention are listed below. The patent EP 1290676 B l filed May 23^rd 2001, entitled "Creating a unified task dependent language models with information retrieval techniques" relates to a method for creating a language model from a task-independent corpus for a language processing system. The language model includes a plurality of context-free grammar and a hybrid rc-gram model. The patent EP1046157 Bl filed October 11^th 1999, entitled "Method of determining parameters of a statistical language model" relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.

The patent application EP1320086 Al filed December 13^th 2001, entitled "Method for generating and/or adapting language models" relates to a method for generating and/or adapting language models for continuously spoken speech recognition purposes, comprising two steps: first generating an initial language model (ILM) and second adapting said initial language model (ILM) by using a specially defined vocabulary.

The patent application EP 1528539 Al filed October 29^th 2004, entitled "A system and method of using Meta-Data in language modeling" relates to a system and method using meta-data for building language models to improve speech processing.

The patent US5640487 A filed June 7^th 1995, entitled "Building scalable n-gram language models using maximum likelihood maximum entropy «-gram models" relates to a system and method which involves building scalable «-gram language models in a way that each «-gram is aligned with one of "n" number of non-intersecting classes. A count is determined for each n-gram representing the number of times each «-gram occurred in the training data. The digrams are separated into classes and complement counts are determined. Using these counts and complement counts factors are determined. The language model probability is determined using these factors.

The patent application US201 1/0161072 Al filed August 20^th 2009, entitled "Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium" relates to a natural language processing technique and to a language model creation technique used in speech recognition, character recognition.

The patent US6154722 A filed December 18^th 1997, entitled "Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability" is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability. The patent US7020606 B l filed December 2^nd 1998, entitled "Voice recognition using a grammar or «-gram procedures" relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.

The patent US6073091 filed August 6^th 1997, entitled "Apparatus and method for forming a filtered inflected language model for automatic speech recognition" relates to methods and apparatus for forming an inflected language model component of an automatic speech recognizer (ASR) that produces a compact and efficiently accessible set of 2-gram and 3 -gram language model probabilities.

The patent EP0801786 B l filed November 4^th 1995, entitled "Method and apparatus for adapting the language model's size in a speech recognition system" relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system. The patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly. Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages" proposes a hybrid method that joins word-based and morpheme-based language modeling.

Scientific paper "Morphology-Based Language Modeling for Conversational Arabic Speech Recognition" proposes joining some set of morphologic information (chosen by using a generative algorithm) with the words and implementing a back-off procedure in which this morphologic data is dropped one by one if there aren't enough training instances in the corpus.

The main difference between proposed invention and all of these mentioned solutions is given in the fact that the invention proposes the combination of three kinds of rc-gram constituents: words, lemmas and morphologic classes in order to obtain a mixed-structure language model.

Disclosure of the Invention

The subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes. In order to obtain such a mixed-structure LM, a training corpus has to be created first. For this purpose, a part-of-speech (POS) tagging tool, a lemmatizer and a morphologic dictionary are needed. The morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus. The POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples. In this way, a sentence from the original training corpus consisting of e.g. three words (w] >₂w₃) becomes a sentence of three triples ([wj/ici] [M>₂/₂C₂] [w₃ 3C₃]), where the letter / marks the lemmas, and c marks the morphologic classes. Let us consider a sentence consisting of three words as an example: Maja voli cvece. (Maja likes flowers.)

The corresponding string of lemmas would be:

Maja voleti cvet. (Maja like flower.), and the corresponding string of morphologic classes would be:

CI C2 C3 where C_\ represents e.g. all the proper, feminine nouns in nominative singular case, C₂ represents all the verbs in the third person singular of the present tense, and C₃ represents the neuter gender nouns in nominative plural case.

The string of triples would then look like this:

[Maja Maja CI] [voli voleti C2J [cvece cvet C3J.

When building the LM, n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account. Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece^'" appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):

Maja voli cvece

Maja voli {cvet}

Maja voli C3

Maja (voleti) cvece

Maja (voleti) (cvet)

Maja (voleti) C3

Maja C2 cvece

Maja C2 (cvet)

Maja C2 C3

(Maja) voli cvece

(Maja) voli (cvet)

(Maja) voli C3

(Maja) (voleti) cvece

(Maja) (voleti) (cvet)

(Maja) (voleti) C3

(Maja) C2 cvece

(Maja) C2 (cvet)

(Maja) C2 C3

CI voli cvece {Majaj voli {cvetj

Clvoli C3

Cl{voletij cvece

Cl fvoleti} {cvetj

CI {vole tij C3

CI C2 cvece

CI C2 {cvetj

CI C2 C3

Fortunately, different pruning techniques can be applied in order to significantly reduce the initial models with a relatively small decrease in modeling accuracy. It has been shown that entropy-based pruning gives good results when reducing the number of n-grams in word- based models, but in the case of mixed-structure M-grams this method is not adequate since it favors morphologic class n-grams which generally appear more frequently in the training corpus than the 77-grams containing lemmas or words. Thus it is more appropriate to set numbers of particular n-gram structures to be included in the model. This can be done empirically for concrete applications, but it can also be done according to the ratios of numbers of word, lemma and morphologic class types appearing in the training corpus or by some other criteria.

The main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic₂ (½). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c_\c₂W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry. Therefore, the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures. The mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.

The described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm. The probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history. The order of n-grams which is commonly used is 3 (trigrams). The Katz algorithm implies using probabilities of rc-grams included in the model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead. The switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients. The main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.

The first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n. For example, assuming that the trigram LM is used, if the input word sequence consists of the words M>IW₂W₃, there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3,

and so on. When choosing the optimal trigram probability, it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account. The hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it. The probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words. In a situation when the model does not contain the word n- gram for which the probability estimate is needed and an appropriate w-gram containing lemmas and/or morphologic classes exists, the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate. One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200. Furthermore, if an «-gram W 1 /2C3 consists of one word, one lemma representing a lexeme of size 5 and one morphologic class of size 200, the probability of this rc-gram should be divided by a factor of 5 ^χ 200 = 1000. Of course, other ways of scaling the initial probabilities may be defined. In any case, the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.

The second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm). The switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively. The lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated. Instead, the penalization can be done simply by multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process. This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.

Brief Description of the Drawings Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.

Figure 2 - contains details about the mixed-structure language model training.

Figure 3 - describes how a mixed structure language model is used to estimate a word- sequence probability. Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.

Best Mode for Carrying Out of the Invention This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.

Figure 1 represents a block diagram of a speech recognition system. The object of the system is finding a word sequence W = w_\ ...w_n which maximizes the product of two probabilities P(W)P(X\ W), where represents the acoustic feature vector extracted from the speech signal carrying the message W in the feature extraction block 100. The acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103. The result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.

Figure 2 shows the training process for the mixed-structure language model proposed by this invention. The initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept. In order to create mixed-structures, a POS tagging and lemmatization tool 202 is required for assigning the lemmas and morphologic classes to all the words from the initial training corpus. The POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203. The mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus. The initial counts are then smoothed and some probability mass is reserved for "unseen" events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams. These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.

Figure 3 shows how the mixed-structure model can be used in the speech recognition phase. For the input word sequence lemmas and morphologic classes first have to be determined by the POS tagging and lemmatization tool 202 which uses the morphologic dictionary 203. The mixed structures are sent to the block for generating the list of «-grams 300. The block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model. The resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses. Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.

Industrial Applicability

This invention describes an w-gram language model containing 77-grams of mixed structures. Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes. This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent. The mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models. The mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones). This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.

Claims

1. A method of creating a mixed-structure rc-gram language model comprising steps of: assigning lemmas and morphologic classes to all the words from the training corpus (202,203);

calculating the sizes of lexemes corresponding to lemmas and the sizes of morphologic classes (201);

creating mixed-structure n-grams (204);

calculating the corresponding counts of all different w-grams;

applying a discounting technique (205) on the calculated counts of all different w-grams; determining the «-gram probabilities (206) for all said different n-grams; and

scaling said n-gram probabilities (207), characterized by said mixed-structure rc-grams created by combination of triples of words, lemmas and morphologic classes in all possible ways and said scaling of said «-gram probabilities by the sizes of lemmas and morphologic classes contained by the n-grams.

2. A method as claimed in claim 1, characterized in that the size is calculated as the numbers of different word forms which are assigned to the same lexeme or morphologic class.

3. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) means the smoothing of the corresponding counts and reserving some probability mass for unseen events.

4. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) may apply any of following algorithms: Good-Turing, absolute, Jelinek- Mercer, Laplace, Kneser-Ney and Bayessian.