WO2014189399A1 - A mixed-structure n-gram language model - Google Patents
A mixed-structure n-gram language model Download PDFInfo
- Publication number
- WO2014189399A1 WO2014189399A1 PCT/RS2013/000009 RS2013000009W WO2014189399A1 WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1 RS 2013000009 W RS2013000009 W RS 2013000009W WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grams
- gram
- morphologic
- mixed
- lemmas
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 44
- 230000000877 morphologic effect Effects 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000009499 grossing Methods 0.000 claims 1
- 238000013138 pruning Methods 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 13
- 239000000470 constituent Substances 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013549 information retrieval technique Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
Definitions
- the invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation.
- NLP natural language processing
- LVCSR Large vocabulary continuous speech recognition
- the patent EP 1290676 B l filed May 23 rd 2001, entitled “Creating a unified task dependent language models with information retrieval techniques” relates to a method for creating a language model from a task-independent corpus for a language processing system.
- the language model includes a plurality of context-free grammar and a hybrid rc-gram model.
- the patent EP1046157 Bl filed October 11 th 1999, entitled “Method of determining parameters of a statistical language model” relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.
- the patent US6154722 A filed December 18 th 1997, entitled “Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability” is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability.
- the patent US7020606 B l filed December 2 nd 1998, entitled “Voice recognition using a grammar or «-gram procedures” relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.
- the patent EP0801786 B l filed November 4 th 1995, entitled “Method and apparatus for adapting the language model's size in a speech recognition system” relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system.
- the patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly.
- Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages” proposes a hybrid method that joins word-based and morpheme-based language modeling.
- the subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes.
- a training corpus has to be created first.
- POS part-of-speech
- the morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus.
- the POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples.
- a sentence consisting of three words as an example: Maja voli cvece. (Maja likes flowers.)
- C ⁇ represents e.g. all the proper, feminine nouns in nominative singular case
- C 2 represents all the verbs in the third person singular of the present tense
- C 3 represents the neuter gender nouns in nominative plural case.
- n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account.
- Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece ' " appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):
- the main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic 2 (1 ⁇ 2). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c ⁇ c 2 W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry.
- the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures.
- the mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.
- the described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm.
- the probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history.
- the order of n-grams which is commonly used is 3 (trigrams).
- the Katz algorithm implies using probabilities of rc-grams included in the model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead.
- the switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients.
- the main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.
- the first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n.
- the trigram LM is used, if the input word sequence consists of the words M>IW 2 W 3 , there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3, and so on.
- W] W2Vt>3 W2Vt>3
- the optimal trigram probability it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account.
- the hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it.
- the probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words.
- the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate.
- One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200.
- the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.
- the second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm).
- the switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively.
- the lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated.
- the penalization can be done simply by multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process.
- This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.
- GT Good-Turing
- Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.
- Figure 2 - contains details about the mixed-structure language model training.
- Figure 3 describes how a mixed structure language model is used to estimate a word- sequence probability.
- Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.
- This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.
- Figure 1 represents a block diagram of a speech recognition system.
- the acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103.
- the result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.
- Figure 2 shows the training process for the mixed-structure language model proposed by this invention.
- the initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept.
- a POS tagging and lemmatization tool 202 is required for assigning the lemmas and morphologic classes to all the words from the initial training corpus.
- the POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203.
- the mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus.
- the initial counts are then smoothed and some probability mass is reserved for "unseen” events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams.
- These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.
- Figure 3 shows how the mixed-structure model can be used in the speech recognition phase.
- the mixed structures are sent to the block for generating the list of «-grams 300.
- the block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model.
- the resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses.
- Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.
- This invention describes an w-gram language model containing 77-grams of mixed structures.
- Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes.
- This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent.
- the mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models.
- the mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones).
- This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the language model based on mixed-structure «-grams and the method of determining a word sequence probability based on this type of model. The mixed- structure comprising lemma and morphologic class information for all the words of an «-gram enables a modeling technique which insures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when a small amount of data is available. Also, different pruning techniques may be used in order to reduce the number of «-grams included in the model if a large amount of data is available for training.
Description
A MIXED-STRUCTURE N-GRAM LANGUAGE MODEL
Technical Field
The invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation. Background Art
Language models (LMs) have been used in applications where there is a need to predict the next word based on some context, or to score word sequence hypotheses in order to determine the most probable input sequence as in speech recognition systems. Statistical n-gram language modeling has been the predominant language modeling framework for decades. The «-gram models play an important role in speech recognition systems. Large vocabulary continuous speech recognition (LVCSR) systems are developed for many languages and for each of them training corpora needs to be prepared in order to obtain high-quality language models. For highly inflective languages, as well as for domain-specific applications, there is usually insufficient training data. Furthermore, for many purposes LMs of acceptable quality can become unpractically large. The main object of statistical language modeling is to create a good language representation by using a relatively small amount of «-grams. One way to implement such a LM is by grouping words into classes by using some criteria, usually correlated with morphologic or semantic information. Class «-gram models created in this way show promising results since the problem of insufficient training data is partly solved. Furthermore, class n-gram models are generally of acceptable size. The problem with these models lies in the fact that many words are not adequately represented by the classes they belong to. In addition to defining the classes of «-gram LMs, history clustering techniques are used to create relatively small LMs with good accuracy. Combining «-gram models with context-free grammars (CFGs) also showed good results. In these cases LM sizes can be reduced, usually by applying the entropy-based pruning technique, and still give satisfying quality.
Different types of models can be combined in order to obtain better results for particular applications. This is usually done by combining the probabilities returned by different LMs for a given word sequence with appropriate weighting coefficients which are usually determined empirically.
Some of the protected solutions and research papers that represent a prior art of the proposed invention are listed below.
The patent EP 1290676 B l filed May 23rd 2001, entitled "Creating a unified task dependent language models with information retrieval techniques" relates to a method for creating a language model from a task-independent corpus for a language processing system. The language model includes a plurality of context-free grammar and a hybrid rc-gram model. The patent EP1046157 Bl filed October 11th 1999, entitled "Method of determining parameters of a statistical language model" relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.
The patent application EP1320086 Al filed December 13th 2001, entitled "Method for generating and/or adapting language models" relates to a method for generating and/or adapting language models for continuously spoken speech recognition purposes, comprising two steps: first generating an initial language model (ILM) and second adapting said initial language model (ILM) by using a specially defined vocabulary.
The patent application EP 1528539 Al filed October 29th 2004, entitled "A system and method of using Meta-Data in language modeling" relates to a system and method using meta-data for building language models to improve speech processing.
The patent US5640487 A filed June 7th 1995, entitled "Building scalable n-gram language models using maximum likelihood maximum entropy «-gram models" relates to a system and method which involves building scalable «-gram language models in a way that each «-gram is aligned with one of "n" number of non-intersecting classes. A count is determined for each n-gram representing the number of times each «-gram occurred in the training data. The digrams are separated into classes and complement counts are determined. Using these counts and complement counts factors are determined. The language model probability is determined using these factors.
The patent application US201 1/0161072 Al filed August 20th 2009, entitled "Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium" relates to a natural language processing technique and to a language model creation technique used in speech recognition, character recognition.
The patent US6154722 A filed December 18th 1997, entitled "Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability" is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability.
The patent US7020606 B l filed December 2nd 1998, entitled "Voice recognition using a grammar or «-gram procedures" relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.
The patent US6073091 filed August 6th 1997, entitled "Apparatus and method for forming a filtered inflected language model for automatic speech recognition" relates to methods and apparatus for forming an inflected language model component of an automatic speech recognizer (ASR) that produces a compact and efficiently accessible set of 2-gram and 3 -gram language model probabilities.
The patent EP0801786 B l filed November 4th 1995, entitled "Method and apparatus for adapting the language model's size in a speech recognition system" relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system. The patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly. Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages" proposes a hybrid method that joins word-based and morpheme-based language modeling.
Scientific paper "Morphology-Based Language Modeling for Conversational Arabic Speech Recognition" proposes joining some set of morphologic information (chosen by using a generative algorithm) with the words and implementing a back-off procedure in which this morphologic data is dropped one by one if there aren't enough training instances in the corpus.
The main difference between proposed invention and all of these mentioned solutions is given in the fact that the invention proposes the combination of three kinds of rc-gram constituents: words, lemmas and morphologic classes in order to obtain a mixed-structure language model.
Disclosure of the Invention
The subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes. In order to obtain such a mixed-structure LM, a training corpus has to be created first. For this purpose, a part-of-speech (POS) tagging tool, a lemmatizer and a morphologic dictionary are needed. The morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus. The POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples. In this way, a sentence from the original training corpus consisting of e.g. three words (w] >2w3) becomes a sentence of three triples ([wj/ici] [M>2/2C2] [w3 3C3]), where the letter / marks the lemmas, and c marks the morphologic classes. Let us consider a sentence consisting of three words as an example:
Maja voli cvece. (Maja likes flowers.)
The corresponding string of lemmas would be:
Maja voleti cvet. (Maja like flower.), and the corresponding string of morphologic classes would be:
CI C2 C3 where C\ represents e.g. all the proper, feminine nouns in nominative singular case, C2 represents all the verbs in the third person singular of the present tense, and C3 represents the neuter gender nouns in nominative plural case.
The string of triples would then look like this:
[Maja Maja CI] [voli voleti C2J [cvece cvet C3J.
When building the LM, n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account. Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece'" appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):
Maja voli cvece
Maja voli {cvet}
Maja voli C3
Maja (voleti) cvece
Maja (voleti) (cvet)
Maja (voleti) C3
Maja C2 cvece
Maja C2 (cvet)
Maja C2 C3
(Maja) voli cvece
(Maja) voli (cvet)
(Maja) voli C3
(Maja) (voleti) cvece
(Maja) (voleti) (cvet)
(Maja) (voleti) C3
(Maja) C2 cvece
(Maja) C2 (cvet)
(Maja) C2 C3
CI voli cvece
{Majaj voli {cvetj
Clvoli C3
Cl{voletij cvece
Cl fvoleti} {cvetj
CI {vole tij C3
CI C2 cvece
CI C2 {cvetj
CI C2 C3
Fortunately, different pruning techniques can be applied in order to significantly reduce the initial models with a relatively small decrease in modeling accuracy. It has been shown that entropy-based pruning gives good results when reducing the number of n-grams in word- based models, but in the case of mixed-structure M-grams this method is not adequate since it favors morphologic class n-grams which generally appear more frequently in the training corpus than the 77-grams containing lemmas or words. Thus it is more appropriate to set numbers of particular n-gram structures to be included in the model. This can be done empirically for concrete applications, but it can also be done according to the ratios of numbers of word, lemma and morphologic class types appearing in the training corpus or by some other criteria.
The main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic2 (½). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c\c2W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry. Therefore, the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures. The mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.
The described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm. The probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history. The order of n-grams which is commonly used is 3 (trigrams). The Katz algorithm implies using probabilities of rc-grams included in the
model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead. The switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients. The main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.
The first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n. For example, assuming that the trigram LM is used, if the input word sequence consists of the words M>IW2W3, there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3,
and so on. When choosing the optimal trigram probability, it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account. The hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it. The probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words. In a situation when the model does not contain the word n- gram for which the probability estimate is needed and an appropriate w-gram containing lemmas and/or morphologic classes exists, the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate. One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200. Furthermore, if an «-gram W 1 /2C3 consists of one word, one lemma representing a lexeme of size 5 and one morphologic class of size 200, the probability of this rc-gram should be divided by a factor of 5 χ 200 = 1000. Of course, other ways of scaling the initial probabilities may be defined. In any case, the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.
The second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm). The switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively. The lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated. Instead, the penalization can be done simply by
multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process. This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.
Brief Description of the Drawings Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.
Figure 2 - contains details about the mixed-structure language model training.
Figure 3 - describes how a mixed structure language model is used to estimate a word- sequence probability. Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.
Best Mode for Carrying Out of the Invention This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.
Figure 1 represents a block diagram of a speech recognition system. The object of the system is finding a word sequence W = w\ ...wn which maximizes the product of two probabilities P(W)P(X\ W), where represents the acoustic feature vector extracted from the speech signal carrying the message W in the feature extraction block 100. The acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103. The result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.
Figure 2 shows the training process for the mixed-structure language model proposed by this invention. The initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept. In order to create mixed-structures, a POS tagging and lemmatization tool 202 is required for assigning the
lemmas and morphologic classes to all the words from the initial training corpus. The POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203. The mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus. The initial counts are then smoothed and some probability mass is reserved for "unseen" events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams. These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.
Figure 3 shows how the mixed-structure model can be used in the speech recognition phase. For the input word sequence lemmas and morphologic classes first have to be determined by the POS tagging and lemmatization tool 202 which uses the morphologic dictionary 203. The mixed structures are sent to the block for generating the list of «-grams 300. The block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model. The resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses. Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.
Industrial Applicability
This invention describes an w-gram language model containing 77-grams of mixed structures. Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes. This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent. The mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models. The mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones). This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.
Claims
1. A method of creating a mixed-structure rc-gram language model comprising steps of: assigning lemmas and morphologic classes to all the words from the training corpus (202,203);
calculating the sizes of lexemes corresponding to lemmas and the sizes of morphologic classes (201);
creating mixed-structure n-grams (204);
calculating the corresponding counts of all different w-grams;
applying a discounting technique (205) on the calculated counts of all different w-grams; determining the «-gram probabilities (206) for all said different n-grams; and
scaling said n-gram probabilities (207), characterized by said mixed-structure rc-grams created by combination of triples of words, lemmas and morphologic classes in all possible ways and said scaling of said «-gram probabilities by the sizes of lemmas and morphologic classes contained by the n-grams.
2. A method as claimed in claim 1, characterized in that the size is calculated as the numbers of different word forms which are assigned to the same lexeme or morphologic class.
3. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) means the smoothing of the corresponding counts and reserving some probability mass for unseen events.
4. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) may apply any of following algorithms: Good-Turing, absolute, Jelinek- Mercer, Laplace, Kneser-Ney and Bayessian.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RS2013/000009 WO2014189399A1 (en) | 2013-05-22 | 2013-05-22 | A mixed-structure n-gram language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RS2013/000009 WO2014189399A1 (en) | 2013-05-22 | 2013-05-22 | A mixed-structure n-gram language model |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014189399A1 true WO2014189399A1 (en) | 2014-11-27 |
Family
ID=48747698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RS2013/000009 WO2014189399A1 (en) | 2013-05-22 | 2013-05-22 | A mixed-structure n-gram language model |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2014189399A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871534A (en) * | 2019-01-10 | 2019-06-11 | 北京海天瑞声科技股份有限公司 | Generation method, device, equipment and the storage medium of China and Britain's mixing corpus |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640487A (en) | 1993-02-26 | 1997-06-17 | International Business Machines Corporation | Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models |
US6073091A (en) | 1997-08-06 | 2000-06-06 | International Business Machines Corporation | Apparatus and method for forming a filtered inflected language model for automatic speech recognition |
EP0801786B1 (en) | 1995-11-04 | 2000-06-28 | International Business Machines Corporation | Method and apparatus for adapting the language model's size in a speech recognition system |
US6154722A (en) | 1997-12-18 | 2000-11-28 | Apple Computer, Inc. | Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability |
EP1320086A1 (en) | 2001-12-13 | 2003-06-18 | Sony International (Europe) GmbH | Method for generating and/or adapting language models |
EP1046157B1 (en) | 1998-10-21 | 2004-03-10 | Koninklijke Philips Electronics N.V. | Method of determining parameters of a statistical language model |
EP1528539A1 (en) | 2003-10-30 | 2005-05-04 | AT&T Corp. | A system and method of using Meta-Data in language modeling |
US7020606B1 (en) | 1997-12-11 | 2006-03-28 | Harman Becker Automotive Systems Gmbh | Voice recognition using a grammar or N-gram procedures |
EP1290676B1 (en) | 2000-06-01 | 2006-10-18 | Microsoft Corporation | Creating a unified task dependent language models with information retrieval techniques |
US20110161072A1 (en) | 2008-08-20 | 2011-06-30 | Nec Corporation | Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium |
US20120278060A1 (en) * | 2011-04-27 | 2012-11-01 | Xerox Corporation | Method and system for confidence-weighted learning of factored discriminative language models |
-
2013
- 2013-05-22 WO PCT/RS2013/000009 patent/WO2014189399A1/en active Application Filing
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640487A (en) | 1993-02-26 | 1997-06-17 | International Business Machines Corporation | Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models |
EP0801786B1 (en) | 1995-11-04 | 2000-06-28 | International Business Machines Corporation | Method and apparatus for adapting the language model's size in a speech recognition system |
US6073091A (en) | 1997-08-06 | 2000-06-06 | International Business Machines Corporation | Apparatus and method for forming a filtered inflected language model for automatic speech recognition |
US7020606B1 (en) | 1997-12-11 | 2006-03-28 | Harman Becker Automotive Systems Gmbh | Voice recognition using a grammar or N-gram procedures |
US6154722A (en) | 1997-12-18 | 2000-11-28 | Apple Computer, Inc. | Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability |
EP1046157B1 (en) | 1998-10-21 | 2004-03-10 | Koninklijke Philips Electronics N.V. | Method of determining parameters of a statistical language model |
EP1290676B1 (en) | 2000-06-01 | 2006-10-18 | Microsoft Corporation | Creating a unified task dependent language models with information retrieval techniques |
EP1320086A1 (en) | 2001-12-13 | 2003-06-18 | Sony International (Europe) GmbH | Method for generating and/or adapting language models |
EP1528539A1 (en) | 2003-10-30 | 2005-05-04 | AT&T Corp. | A system and method of using Meta-Data in language modeling |
US20110161072A1 (en) | 2008-08-20 | 2011-06-30 | Nec Corporation | Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium |
US20120278060A1 (en) * | 2011-04-27 | 2012-11-01 | Xerox Corporation | Method and system for confidence-weighted learning of factored discriminative language models |
Non-Patent Citations (4)
Title |
---|
BROWN P F ET AL: "CLASS-BASED N-GRAM MODELS OF NATURAL LANGUAGE", COMPUTATIONAL LINGUISTICS, CAMBRIDGE, MA, US, vol. 18, no. 4, 1 December 1992 (1992-12-01), pages 467 - 479, XP000892488 * |
KIRCHHOFF K ET AL: "Morphology-based language modeling for conversational Arabic speech recognition", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 20, no. 4, 1 October 2006 (2006-10-01), pages 589 - 608, XP024930246, ISSN: 0885-2308, [retrieved on 20061001], DOI: 10.1016/J.CSL.2005.10.001 * |
STEVAN OSTROGONAC ET AL: "A language model for highly inflective non-agglutinative languages", INTELLIGENT SYSTEMS AND INFORMATICS (SISY), 2012 IEEE 10TH JUBILEE INTERNATIONAL SYMPOSIUM ON, IEEE, 20 September 2012 (2012-09-20), pages 177 - 181, XP032265283, ISBN: 978-1-4673-4751-8, DOI: 10.1109/SISY.2012.6339510 * |
TOMAS BRYCHCIN ET AL: "Morphological based language models for inflectional languages", INTELLIGENT DATA ACQUISITION AND ADVANCED COMPUTING SYSTEMS (IDAACS), 2011 IEEE 6TH INTERNATIONAL CONFERENCE ON, IEEE, 15 September 2011 (2011-09-15), pages 560 - 563, XP031990283, ISBN: 978-1-4577-1426-9, DOI: 10.1109/IDAACS.2011.6072829 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871534A (en) * | 2019-01-10 | 2019-06-11 | 北京海天瑞声科技股份有限公司 | Generation method, device, equipment and the storage medium of China and Britain's mixing corpus |
CN109871534B (en) * | 2019-01-10 | 2020-03-24 | 北京海天瑞声科技股份有限公司 | Method, device and equipment for generating Chinese-English mixed corpus and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8719021B2 (en) | Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program | |
US5878390A (en) | Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition | |
JP2003505778A (en) | Phrase-based dialogue modeling with specific use in creating recognition grammars for voice control user interfaces | |
Sak et al. | Morphology-based and sub-word language modeling for Turkish speech recognition | |
US8255220B2 (en) | Device, method, and medium for establishing language model for expanding finite state grammar using a general grammar database | |
EP2950306A1 (en) | A method and system for building a language model | |
Kipyatkova et al. | Recurrent neural network-based language modeling for an automatic Russian speech recognition system | |
Tachbelie et al. | Syllable-based and hybrid acoustic models for amharic speech recognition | |
WO2014189399A1 (en) | A mixed-structure n-gram language model | |
Tanigaki et al. | A hierarchical language model incorporating class-dependent word models for OOV words recognition | |
EP4295358A1 (en) | Lookup-table recurrent language model | |
Al-Anzi et al. | Performance evaluation of sphinx and HTK speech recognizers for spoken Arabic language | |
Maskey et al. | A phrase-level machine translation approach for disfluency detection using weighted finite state transducers | |
Donaj et al. | Context-dependent factored language models | |
KR20050101694A (en) | A system for statistical speech recognition with grammatical constraints, and method thereof | |
Sakti et al. | Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model | |
Smaïli et al. | An hybrid language model for a continuous dictation prototype | |
Hasegawa-Johnson et al. | Fast transcription of speech in low-resource languages | |
Duchateau et al. | Handling disfluencies in spontaneous language models | |
Alumae | Sentence-adapted factored language model for transcribing Estonian speech | |
Isotani et al. | Speech recognition using a stochastic language model integrating local and global constraints | |
Sas et al. | Pipelined language model construction for Polish speech recognition | |
Ogawa et al. | Word class modeling for speech recognition with out-of-task words using a hierarchical language model. | |
Brugnara et al. | Techniques for approximating a trigram language model | |
Bahrani et al. | Building statistical language models for persian continuous speech recognition systems using the peykare corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13734523 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13734523 Country of ref document: EP Kind code of ref document: A1 |