WO2014189399A1 - A mixed-structure n-gram language model - Google Patents

A mixed-structure n-gram language model Download PDF

Info

Publication number
WO2014189399A1
WO2014189399A1 PCT/RS2013/000009 RS2013000009W WO2014189399A1 WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1 RS 2013000009 W RS2013000009 W RS 2013000009W WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1
Authority
WO
WIPO (PCT)
Prior art keywords
grams
gram
morphologic
mixed
lemmas
Prior art date
Application number
PCT/RS2013/000009
Other languages
French (fr)
Inventor
Stevan OSTROGONAC
Milan SEĈUJSKI
Vlado Delić
Dragiša MIŠKOVIĆ
Nikša JAKOVLJEVIĆ
Nataša VUJNOVIĆ SEDLAR
Original Assignee
Axon Doo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Axon Doo filed Critical Axon Doo
Priority to PCT/RS2013/000009 priority Critical patent/WO2014189399A1/en
Publication of WO2014189399A1 publication Critical patent/WO2014189399A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • the invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation.
  • NLP natural language processing
  • LVCSR Large vocabulary continuous speech recognition
  • the patent EP 1290676 B l filed May 23 rd 2001, entitled “Creating a unified task dependent language models with information retrieval techniques” relates to a method for creating a language model from a task-independent corpus for a language processing system.
  • the language model includes a plurality of context-free grammar and a hybrid rc-gram model.
  • the patent EP1046157 Bl filed October 11 th 1999, entitled “Method of determining parameters of a statistical language model” relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.
  • the patent US6154722 A filed December 18 th 1997, entitled “Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability” is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability.
  • the patent US7020606 B l filed December 2 nd 1998, entitled “Voice recognition using a grammar or «-gram procedures” relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.
  • the patent EP0801786 B l filed November 4 th 1995, entitled “Method and apparatus for adapting the language model's size in a speech recognition system” relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system.
  • the patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly.
  • Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages” proposes a hybrid method that joins word-based and morpheme-based language modeling.
  • the subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes.
  • a training corpus has to be created first.
  • POS part-of-speech
  • the morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus.
  • the POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples.
  • a sentence consisting of three words as an example: Maja voli cvece. (Maja likes flowers.)
  • C ⁇ represents e.g. all the proper, feminine nouns in nominative singular case
  • C 2 represents all the verbs in the third person singular of the present tense
  • C 3 represents the neuter gender nouns in nominative plural case.
  • n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account.
  • Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece ' " appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):
  • the main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic 2 (1 ⁇ 2). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c ⁇ c 2 W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry.
  • the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures.
  • the mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.
  • the described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm.
  • the probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history.
  • the order of n-grams which is commonly used is 3 (trigrams).
  • the Katz algorithm implies using probabilities of rc-grams included in the model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead.
  • the switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients.
  • the main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.
  • the first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n.
  • the trigram LM is used, if the input word sequence consists of the words M>IW 2 W 3 , there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3, and so on.
  • W] W2Vt>3 W2Vt>3
  • the optimal trigram probability it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account.
  • the hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it.
  • the probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words.
  • the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate.
  • One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200.
  • the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.
  • the second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm).
  • the switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively.
  • the lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated.
  • the penalization can be done simply by multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process.
  • This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.
  • GT Good-Turing
  • Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.
  • Figure 2 - contains details about the mixed-structure language model training.
  • Figure 3 describes how a mixed structure language model is used to estimate a word- sequence probability.
  • Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.
  • This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.
  • Figure 1 represents a block diagram of a speech recognition system.
  • the acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103.
  • the result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.
  • Figure 2 shows the training process for the mixed-structure language model proposed by this invention.
  • the initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept.
  • a POS tagging and lemmatization tool 202 is required for assigning the lemmas and morphologic classes to all the words from the initial training corpus.
  • the POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203.
  • the mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus.
  • the initial counts are then smoothed and some probability mass is reserved for "unseen” events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams.
  • These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.
  • Figure 3 shows how the mixed-structure model can be used in the speech recognition phase.
  • the mixed structures are sent to the block for generating the list of «-grams 300.
  • the block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model.
  • the resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses.
  • Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.
  • This invention describes an w-gram language model containing 77-grams of mixed structures.
  • Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes.
  • This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent.
  • the mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models.
  • the mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones).
  • This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the language model based on mixed-structure «-grams and the method of determining a word sequence probability based on this type of model. The mixed- structure comprising lemma and morphologic class information for all the words of an «-gram enables a modeling technique which insures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when a small amount of data is available. Also, different pruning techniques may be used in order to reduce the number of «-grams included in the model if a large amount of data is available for training.

Description

A MIXED-STRUCTURE N-GRAM LANGUAGE MODEL
Technical Field
The invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation. Background Art
Language models (LMs) have been used in applications where there is a need to predict the next word based on some context, or to score word sequence hypotheses in order to determine the most probable input sequence as in speech recognition systems. Statistical n-gram language modeling has been the predominant language modeling framework for decades. The «-gram models play an important role in speech recognition systems. Large vocabulary continuous speech recognition (LVCSR) systems are developed for many languages and for each of them training corpora needs to be prepared in order to obtain high-quality language models. For highly inflective languages, as well as for domain-specific applications, there is usually insufficient training data. Furthermore, for many purposes LMs of acceptable quality can become unpractically large. The main object of statistical language modeling is to create a good language representation by using a relatively small amount of «-grams. One way to implement such a LM is by grouping words into classes by using some criteria, usually correlated with morphologic or semantic information. Class «-gram models created in this way show promising results since the problem of insufficient training data is partly solved. Furthermore, class n-gram models are generally of acceptable size. The problem with these models lies in the fact that many words are not adequately represented by the classes they belong to. In addition to defining the classes of «-gram LMs, history clustering techniques are used to create relatively small LMs with good accuracy. Combining «-gram models with context-free grammars (CFGs) also showed good results. In these cases LM sizes can be reduced, usually by applying the entropy-based pruning technique, and still give satisfying quality.
Different types of models can be combined in order to obtain better results for particular applications. This is usually done by combining the probabilities returned by different LMs for a given word sequence with appropriate weighting coefficients which are usually determined empirically.
Some of the protected solutions and research papers that represent a prior art of the proposed invention are listed below. The patent EP 1290676 B l filed May 23rd 2001, entitled "Creating a unified task dependent language models with information retrieval techniques" relates to a method for creating a language model from a task-independent corpus for a language processing system. The language model includes a plurality of context-free grammar and a hybrid rc-gram model. The patent EP1046157 Bl filed October 11th 1999, entitled "Method of determining parameters of a statistical language model" relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.
The patent application EP1320086 Al filed December 13th 2001, entitled "Method for generating and/or adapting language models" relates to a method for generating and/or adapting language models for continuously spoken speech recognition purposes, comprising two steps: first generating an initial language model (ILM) and second adapting said initial language model (ILM) by using a specially defined vocabulary.
The patent application EP 1528539 Al filed October 29th 2004, entitled "A system and method of using Meta-Data in language modeling" relates to a system and method using meta-data for building language models to improve speech processing.
The patent US5640487 A filed June 7th 1995, entitled "Building scalable n-gram language models using maximum likelihood maximum entropy «-gram models" relates to a system and method which involves building scalable «-gram language models in a way that each «-gram is aligned with one of "n" number of non-intersecting classes. A count is determined for each n-gram representing the number of times each «-gram occurred in the training data. The digrams are separated into classes and complement counts are determined. Using these counts and complement counts factors are determined. The language model probability is determined using these factors.
The patent application US201 1/0161072 Al filed August 20th 2009, entitled "Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium" relates to a natural language processing technique and to a language model creation technique used in speech recognition, character recognition.
The patent US6154722 A filed December 18th 1997, entitled "Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability" is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability. The patent US7020606 B l filed December 2nd 1998, entitled "Voice recognition using a grammar or «-gram procedures" relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.
The patent US6073091 filed August 6th 1997, entitled "Apparatus and method for forming a filtered inflected language model for automatic speech recognition" relates to methods and apparatus for forming an inflected language model component of an automatic speech recognizer (ASR) that produces a compact and efficiently accessible set of 2-gram and 3 -gram language model probabilities.
The patent EP0801786 B l filed November 4th 1995, entitled "Method and apparatus for adapting the language model's size in a speech recognition system" relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system. The patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly. Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages" proposes a hybrid method that joins word-based and morpheme-based language modeling.
Scientific paper "Morphology-Based Language Modeling for Conversational Arabic Speech Recognition" proposes joining some set of morphologic information (chosen by using a generative algorithm) with the words and implementing a back-off procedure in which this morphologic data is dropped one by one if there aren't enough training instances in the corpus.
The main difference between proposed invention and all of these mentioned solutions is given in the fact that the invention proposes the combination of three kinds of rc-gram constituents: words, lemmas and morphologic classes in order to obtain a mixed-structure language model.
Disclosure of the Invention
The subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes. In order to obtain such a mixed-structure LM, a training corpus has to be created first. For this purpose, a part-of-speech (POS) tagging tool, a lemmatizer and a morphologic dictionary are needed. The morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus. The POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples. In this way, a sentence from the original training corpus consisting of e.g. three words (w] >2w3) becomes a sentence of three triples ([wj/ici] [M>2/2C2] [w3 3C3]), where the letter / marks the lemmas, and c marks the morphologic classes. Let us consider a sentence consisting of three words as an example: Maja voli cvece. (Maja likes flowers.)
The corresponding string of lemmas would be:
Maja voleti cvet. (Maja like flower.), and the corresponding string of morphologic classes would be:
CI C2 C3 where C\ represents e.g. all the proper, feminine nouns in nominative singular case, C2 represents all the verbs in the third person singular of the present tense, and C3 represents the neuter gender nouns in nominative plural case.
The string of triples would then look like this:
[Maja Maja CI] [voli voleti C2J [cvece cvet C3J.
When building the LM, n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account. Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece'" appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):
Maja voli cvece
Maja voli {cvet}
Maja voli C3
Maja (voleti) cvece
Maja (voleti) (cvet)
Maja (voleti) C3
Maja C2 cvece
Maja C2 (cvet)
Maja C2 C3
(Maja) voli cvece
(Maja) voli (cvet)
(Maja) voli C3
(Maja) (voleti) cvece
(Maja) (voleti) (cvet)
(Maja) (voleti) C3
(Maja) C2 cvece
(Maja) C2 (cvet)
(Maja) C2 C3
CI voli cvece {Majaj voli {cvetj
Clvoli C3
Cl{voletij cvece
Cl fvoleti} {cvetj
CI {vole tij C3
CI C2 cvece
CI C2 {cvetj
CI C2 C3
Fortunately, different pruning techniques can be applied in order to significantly reduce the initial models with a relatively small decrease in modeling accuracy. It has been shown that entropy-based pruning gives good results when reducing the number of n-grams in word- based models, but in the case of mixed-structure M-grams this method is not adequate since it favors morphologic class n-grams which generally appear more frequently in the training corpus than the 77-grams containing lemmas or words. Thus it is more appropriate to set numbers of particular n-gram structures to be included in the model. This can be done empirically for concrete applications, but it can also be done according to the ratios of numbers of word, lemma and morphologic class types appearing in the training corpus or by some other criteria.
The main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic2 (½). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c\c2W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry. Therefore, the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures. The mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.
The described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm. The probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history. The order of n-grams which is commonly used is 3 (trigrams). The Katz algorithm implies using probabilities of rc-grams included in the model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead. The switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients. The main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.
The first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n. For example, assuming that the trigram LM is used, if the input word sequence consists of the words M>IW2W3, there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3,
Figure imgf000007_0001
and so on. When choosing the optimal trigram probability, it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account. The hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it. The probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words. In a situation when the model does not contain the word n- gram for which the probability estimate is needed and an appropriate w-gram containing lemmas and/or morphologic classes exists, the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate. One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200. Furthermore, if an «-gram W 1 /2C3 consists of one word, one lemma representing a lexeme of size 5 and one morphologic class of size 200, the probability of this rc-gram should be divided by a factor of 5 χ 200 = 1000. Of course, other ways of scaling the initial probabilities may be defined. In any case, the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.
The second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm). The switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively. The lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated. Instead, the penalization can be done simply by multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process. This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.
Brief Description of the Drawings Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.
Figure 2 - contains details about the mixed-structure language model training.
Figure 3 - describes how a mixed structure language model is used to estimate a word- sequence probability. Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.
Best Mode for Carrying Out of the Invention This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.
Figure 1 represents a block diagram of a speech recognition system. The object of the system is finding a word sequence W = w\ ...wn which maximizes the product of two probabilities P(W)P(X\ W), where represents the acoustic feature vector extracted from the speech signal carrying the message W in the feature extraction block 100. The acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103. The result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.
Figure 2 shows the training process for the mixed-structure language model proposed by this invention. The initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept. In order to create mixed-structures, a POS tagging and lemmatization tool 202 is required for assigning the lemmas and morphologic classes to all the words from the initial training corpus. The POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203. The mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus. The initial counts are then smoothed and some probability mass is reserved for "unseen" events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams. These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.
Figure 3 shows how the mixed-structure model can be used in the speech recognition phase. For the input word sequence lemmas and morphologic classes first have to be determined by the POS tagging and lemmatization tool 202 which uses the morphologic dictionary 203. The mixed structures are sent to the block for generating the list of «-grams 300. The block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model. The resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses. Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.
Industrial Applicability
This invention describes an w-gram language model containing 77-grams of mixed structures. Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes. This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent. The mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models. The mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones). This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.

Claims

Claims
1. A method of creating a mixed-structure rc-gram language model comprising steps of: assigning lemmas and morphologic classes to all the words from the training corpus (202,203);
calculating the sizes of lexemes corresponding to lemmas and the sizes of morphologic classes (201);
creating mixed-structure n-grams (204);
calculating the corresponding counts of all different w-grams;
applying a discounting technique (205) on the calculated counts of all different w-grams; determining the «-gram probabilities (206) for all said different n-grams; and
scaling said n-gram probabilities (207), characterized by said mixed-structure rc-grams created by combination of triples of words, lemmas and morphologic classes in all possible ways and said scaling of said «-gram probabilities by the sizes of lemmas and morphologic classes contained by the n-grams.
2. A method as claimed in claim 1, characterized in that the size is calculated as the numbers of different word forms which are assigned to the same lexeme or morphologic class.
3. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) means the smoothing of the corresponding counts and reserving some probability mass for unseen events.
4. A method as claimed in claim 1, characterized in that the step of applying a discounting technique (205) may apply any of following algorithms: Good-Turing, absolute, Jelinek- Mercer, Laplace, Kneser-Ney and Bayessian.
PCT/RS2013/000009 2013-05-22 2013-05-22 A mixed-structure n-gram language model WO2014189399A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000009 WO2014189399A1 (en) 2013-05-22 2013-05-22 A mixed-structure n-gram language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000009 WO2014189399A1 (en) 2013-05-22 2013-05-22 A mixed-structure n-gram language model

Publications (1)

Publication Number Publication Date
WO2014189399A1 true WO2014189399A1 (en) 2014-11-27

Family

ID=48747698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RS2013/000009 WO2014189399A1 (en) 2013-05-22 2013-05-22 A mixed-structure n-gram language model

Country Status (1)

Country Link
WO (1) WO2014189399A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871534A (en) * 2019-01-10 2019-06-11 北京海天瑞声科技股份有限公司 Generation method, device, equipment and the storage medium of China and Britain's mixing corpus

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
US6073091A (en) 1997-08-06 2000-06-06 International Business Machines Corporation Apparatus and method for forming a filtered inflected language model for automatic speech recognition
EP0801786B1 (en) 1995-11-04 2000-06-28 International Business Machines Corporation Method and apparatus for adapting the language model's size in a speech recognition system
US6154722A (en) 1997-12-18 2000-11-28 Apple Computer, Inc. Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability
EP1320086A1 (en) 2001-12-13 2003-06-18 Sony International (Europe) GmbH Method for generating and/or adapting language models
EP1046157B1 (en) 1998-10-21 2004-03-10 Koninklijke Philips Electronics N.V. Method of determining parameters of a statistical language model
EP1528539A1 (en) 2003-10-30 2005-05-04 AT&T Corp. A system and method of using Meta-Data in language modeling
US7020606B1 (en) 1997-12-11 2006-03-28 Harman Becker Automotive Systems Gmbh Voice recognition using a grammar or N-gram procedures
EP1290676B1 (en) 2000-06-01 2006-10-18 Microsoft Corporation Creating a unified task dependent language models with information retrieval techniques
US20110161072A1 (en) 2008-08-20 2011-06-30 Nec Corporation Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium
US20120278060A1 (en) * 2011-04-27 2012-11-01 Xerox Corporation Method and system for confidence-weighted learning of factored discriminative language models

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
EP0801786B1 (en) 1995-11-04 2000-06-28 International Business Machines Corporation Method and apparatus for adapting the language model's size in a speech recognition system
US6073091A (en) 1997-08-06 2000-06-06 International Business Machines Corporation Apparatus and method for forming a filtered inflected language model for automatic speech recognition
US7020606B1 (en) 1997-12-11 2006-03-28 Harman Becker Automotive Systems Gmbh Voice recognition using a grammar or N-gram procedures
US6154722A (en) 1997-12-18 2000-11-28 Apple Computer, Inc. Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability
EP1046157B1 (en) 1998-10-21 2004-03-10 Koninklijke Philips Electronics N.V. Method of determining parameters of a statistical language model
EP1290676B1 (en) 2000-06-01 2006-10-18 Microsoft Corporation Creating a unified task dependent language models with information retrieval techniques
EP1320086A1 (en) 2001-12-13 2003-06-18 Sony International (Europe) GmbH Method for generating and/or adapting language models
EP1528539A1 (en) 2003-10-30 2005-05-04 AT&T Corp. A system and method of using Meta-Data in language modeling
US20110161072A1 (en) 2008-08-20 2011-06-30 Nec Corporation Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium
US20120278060A1 (en) * 2011-04-27 2012-11-01 Xerox Corporation Method and system for confidence-weighted learning of factored discriminative language models

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BROWN P F ET AL: "CLASS-BASED N-GRAM MODELS OF NATURAL LANGUAGE", COMPUTATIONAL LINGUISTICS, CAMBRIDGE, MA, US, vol. 18, no. 4, 1 December 1992 (1992-12-01), pages 467 - 479, XP000892488 *
KIRCHHOFF K ET AL: "Morphology-based language modeling for conversational Arabic speech recognition", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 20, no. 4, 1 October 2006 (2006-10-01), pages 589 - 608, XP024930246, ISSN: 0885-2308, [retrieved on 20061001], DOI: 10.1016/J.CSL.2005.10.001 *
STEVAN OSTROGONAC ET AL: "A language model for highly inflective non-agglutinative languages", INTELLIGENT SYSTEMS AND INFORMATICS (SISY), 2012 IEEE 10TH JUBILEE INTERNATIONAL SYMPOSIUM ON, IEEE, 20 September 2012 (2012-09-20), pages 177 - 181, XP032265283, ISBN: 978-1-4673-4751-8, DOI: 10.1109/SISY.2012.6339510 *
TOMAS BRYCHCIN ET AL: "Morphological based language models for inflectional languages", INTELLIGENT DATA ACQUISITION AND ADVANCED COMPUTING SYSTEMS (IDAACS), 2011 IEEE 6TH INTERNATIONAL CONFERENCE ON, IEEE, 15 September 2011 (2011-09-15), pages 560 - 563, XP031990283, ISBN: 978-1-4577-1426-9, DOI: 10.1109/IDAACS.2011.6072829 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871534A (en) * 2019-01-10 2019-06-11 北京海天瑞声科技股份有限公司 Generation method, device, equipment and the storage medium of China and Britain's mixing corpus
CN109871534B (en) * 2019-01-10 2020-03-24 北京海天瑞声科技股份有限公司 Method, device and equipment for generating Chinese-English mixed corpus and storage medium

Similar Documents

Publication Publication Date Title
US8719021B2 (en) Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program
US5878390A (en) Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
JP2003505778A (en) Phrase-based dialogue modeling with specific use in creating recognition grammars for voice control user interfaces
Sak et al. Morphology-based and sub-word language modeling for Turkish speech recognition
US8255220B2 (en) Device, method, and medium for establishing language model for expanding finite state grammar using a general grammar database
EP2950306A1 (en) A method and system for building a language model
Kipyatkova et al. Recurrent neural network-based language modeling for an automatic Russian speech recognition system
Tachbelie et al. Syllable-based and hybrid acoustic models for amharic speech recognition
WO2014189399A1 (en) A mixed-structure n-gram language model
Tanigaki et al. A hierarchical language model incorporating class-dependent word models for OOV words recognition
EP4295358A1 (en) Lookup-table recurrent language model
Al-Anzi et al. Performance evaluation of sphinx and HTK speech recognizers for spoken Arabic language
Maskey et al. A phrase-level machine translation approach for disfluency detection using weighted finite state transducers
Donaj et al. Context-dependent factored language models
KR20050101694A (en) A system for statistical speech recognition with grammatical constraints, and method thereof
Sakti et al. Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model
Smaïli et al. An hybrid language model for a continuous dictation prototype
Hasegawa-Johnson et al. Fast transcription of speech in low-resource languages
Duchateau et al. Handling disfluencies in spontaneous language models
Alumae Sentence-adapted factored language model for transcribing Estonian speech
Isotani et al. Speech recognition using a stochastic language model integrating local and global constraints
Sas et al. Pipelined language model construction for Polish speech recognition
Ogawa et al. Word class modeling for speech recognition with out-of-task words using a hierarchical language model.
Brugnara et al. Techniques for approximating a trigram language model
Bahrani et al. Building statistical language models for persian continuous speech recognition systems using the peykare corpus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13734523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13734523

Country of ref document: EP

Kind code of ref document: A1