CN106782510B

CN106782510B - Place name voice signal recognition method based on continuous Gaussian mixture HMM model

Info

Publication number: CN106782510B
Application number: CN201611177818.6A
Authority: CN
Inventors: 蔡熙; 聂腾云; 赖雪军; 谢巍; 车松勋
Original assignee: Shanghai Yunda Freight Co ltd; Suzhou Jinfeng Iot Technology Co ltd
Current assignee: Shanghai Yunda Hi Tech Co ltd
Priority date: 2016-12-19
Filing date: 2016-12-19
Publication date: 2020-06-02
Anticipated expiration: 2036-12-19
Also published as: CN106782510A

Abstract

The invention discloses a place name speech signal recognition method based on a continuous mixture Gaussian HMM model, wherein the training process of the continuous mixture Gaussian HMM model is as follows: defining an HMM model and initializing; substituting the feature matrix of the place name voice signals into the model for training; solving the probability of occurrence of the place name voice signals according to the model parameters; comparing the probability with the output probability before training, and judging whether the relative error meets the output condition; according to the place name, outputting an HMM model corresponding to the place name voice signal; if not, judging whether the training times reach the highest training threshold value; if the HMM model does not arrive, training again to reach, and outputting the HMM model; and substituting the feature matrixes of the place name voice signals into the models to obtain a plurality of HMM models corresponding to different place names to form a place name voice recognition model library. The invention can obtain the HMM model and the place name voice recognition model library suitable for the place name voice recognition of the isolated word, and creates conditions for accurately performing the place name voice recognition.

Description

Place name voice signal recognition method based on continuous Gaussian mixture HMM model

Technical Field

The invention relates to a place name voice signal recognition method, in particular to a place name voice signal recognition method based on a continuous Gaussian mixture HMM model.

Background

With the rapid development of economy and the increasingly prominent globalization trend, the modern epidemic has been unprecedentedly developed in developed countries and has generated huge economic and social benefits, and logistics resources such as transportation, storage, sorting, packaging, distribution and the like are distributed in a plurality of fields including manufacturing industry, agriculture, circulation industry and the like.

In the sorting link, the manual work is basically sorted in the present stage, workers are in a noisy working environment for a long time, certain fatigue is inevitably generated in mind and body, the working state of the workers is too relaxed due to the unicity and the repeatability of a working task, the sorting accuracy is inevitably reduced, more irrecoverable sorting error accidents are caused, and the mode of manually detecting the product sorting on a production line in the industrial field cannot meet the requirements of modern industry.

The speech recognition develops to the present, the life of people is changed in many aspects as an important interface for human-computer interaction, and the speech recognition system brings much convenience to people from a speech control system of an intelligent home to a vehicle-mounted speech recognition system, so that the integration of the speech recognition technology and the logistics sorting link is an inevitable requirement for the development of the logistics industry.

One of the keys of the combination of the logistics sorting link and the voice recognition technology is how to effectively realize accurate recognition of place name voice signals, so that technical support is provided for automatically and accurately classifying various articles to set places, and currently, related technologies for performing voice recognition on place names of isolated words are rarely seen, so that research and development of place name voice recognition technology are urgently needed.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a place name speech signal recognition method based on a continuous Gaussian mixture HMM model.

The purpose of the invention is realized by the following technical scheme:

the place name voice signal recognition method based on the continuous Gaussian mixture HMM model comprises a training process of the continuous Gaussian mixture HMM model and a place name voice recognition process, wherein the training process of the continuous Gaussian mixture HMM model comprises the following steps:

s1, defining a continuous gaussian mixture HMM model comprising the following parameters, λ ═ (N, M, a, pi, B), wherein:

n, the number of model states is 4;

m, the number of Gaussian functions corresponding to each state, each state comprises 3 39-dimensional Gaussian functions, and the number of the Gaussian functions of each state in N states in one model is the same;

a, state transition probability matrix, a ═ a_ij}，a_ij＝P[q_t+1＝j/q_t＝i]1 ≦ i, j ≦ N, where qt ═ i denotes the state i at time t, q (t +1) ═ j denotes the state j at time t +1, and overall denotes the probability of transitioning from state i to state j;

pi, initial probability distribution of each state, pi ═ pi_t，π_t＝P[q_i＝i]I is more than or equal to 1 and less than or equal to N, wherein pi is pi_tThe probability of starting from the state i is shown, and the subscript i represents the starting probability corresponding to each state;

b, output probability density function, B ═ B_j(o)}，

J is more than or equal to 1 and less than or equal to N, wherein o is an observation vector, and M is the number of Gaussian elements contained in each state; c. C_jlIs the weight of the ith mixed Gaussian function of the jth state, L is the normal Gaussian probability density function, mu_jlMean vector, U, of the l-th mixed Gaussian element of the j-th state_jlA covariance matrix for the ith mixed gaussian element for the jth state;

s2, model initialization, initial state pi ═ pi_tThe vector is set to be (1000), the probability of the state transition matrix A in the state transition and the transition to the next state is 0.5, each Gaussian function is a 39-order function with the mean value of 0 and the variance of 1, and the weight is 1/3;

s3, substituting the feature matrix of the place name voice signals into the model, and performing primary model parameter training by using a Baum-Welch iterative algorithm; the first-class place name voice signals are obtained by putting feature matrix data of all sample voice signals of a place name together, clustering according to a mean clustering method k-means, and dividing into 4 classes corresponding to 4 states;

s4, calculating the probability of the place name voice signals by using a viterbi algorithm according to the calculated model parameters;

s5, comparing the probability with the output probability before training, and judging whether the relative error between the probability and the output probability meets the output condition;

s6, if the place name voice signal meets the output condition, outputting a continuous Gaussian mixture HMM model corresponding to the place name voice signal;

s7, if the output condition is not met, judging whether the training frequency reaches the highest training threshold value;

s8, if the training frequency does not reach the highest training threshold, repeating the steps S3-S7, if the training frequency reaches the highest training threshold, terminating the training and outputting a continuous Gaussian mixture HMM model;

and S9, substituting the feature matrixes of the place name voice signals into the models, repeating the steps S3-S8 to obtain a plurality of continuous Gaussian mixture HMM models corresponding to different place names, and forming a place name voice recognition model library by all continuous Gaussian mixture HMM model data.

Preferably, the place name speech signal recognition method based on the continuous mixture gaussian HMM model, wherein: in the step S3, the process of calculating the model parameters by using the Baum-Welch algorithm is as follows:

s31, constructing an objective optimization function Q by Lagrange number multiplication, wherein parameters of all continuous Gaussian mixture HMM models are used as variables;

s32, making the partial derivative of Q to each variable be 0, deducing the relationship between the new HMM parameter and the old HMM parameter when Q reaches the pole, thereby obtaining the estimation of each parameter of the HMM;

and S33, repeating iterative operation by using the functional relation between the new HMM model parameters and the old HMM model parameters until the HMM model parameters converge.

Preferably, the place name speech signal recognition method based on the continuous mixture gaussian HMM model, wherein: in the step S6, if the relative error is less than 0.000001, it indicates that the model training has converged and the output condition is satisfied.

Preferably, the place name speech signal recognition method based on the continuous mixture gaussian HMM model, wherein: the place name voice recognition process is as follows:

and S10, substituting a place name voice signal feature matrix with 39 dimensions into the established place name voice recognition model library, solving the output probability of the continuous mixed Gaussian HMM model corresponding to each type of place name voice signals by using a viterbi algorithm, and recognizing the place name voice signal feature matrix as the type with the maximum output probability.

s110, inputting a feature matrix of an nx39 unknown place name voice signal into a continuous Gaussian mixture HMM model corresponding to a kind of place name voice signals in the established place name voice recognition model library, and recording the model as an observation sequence O (O)₁,o₂,…,o_n) Record P_inRepresenting the probability of occurring in state i after the input of the signal of the consecutive nth frame; p is a radical of_inRepresenting the probability of observing the nth frame signal at state i; a is_ijRepresents the probability of transitioning from state i to state j;

when the 1 st frame signal is input, p_i1＝f_i(o₁) (ii) a (1. ltoreq. i.ltoreq.4), where f_i(o₁) Representing the probability of the occurrence of the first frame vector at the state i position;

since the initial state is 1, P₁₁＝p₁₁；P₂₁＝0；P₃₁＝0；P₄₁＝0；

When the 2 nd frame signal is input, p_i2＝f_i(o₂)；(1≤i≤4)

Then P is_i2＝max{P_j1*a_ji*p_i2J is not less than 1 and not more than 4), wherein a_jiRepresents the probability of transitioning from state j to state i;

by the way of analogy, the method can be used,

when the n-th frame signal is input, p_in＝f_i(o_n)；(1≤i≤4)

P_in＝max{P_j(n-1)*a_ji*p_inJ is more than or equal to 1 and less than or equal to 4, wherein n is the frame number of a section of voice signal；

When all frame signals of the unknown place name voice signal are input, P is obtained_1n,P_2n,P_3n,P_4nThe maximum probability is the probability that the unknown place name voice signal appears in the continuous mixed Gaussian HMM model corresponding to the place name voice signal;

and S120, substituting the feature matrix of the unknown place name voice signal into the continuous Gaussian mixture HMM models corresponding to all other kinds of place name voice signals to obtain the probability of the unknown place name voice signal appearing in each continuous Gaussian mixture HMM model, and attributing the unknown place name voice signal to the class with the highest probability of the unknown place name voice signal appearing in the continuous Gaussian mixture HMM models corresponding to all kinds of place name voice signals.

The technical scheme of the invention has the advantages that:

the method has the advantages of ingenious design and reasonable process, and can effectively train and obtain a continuous Gaussian mixture HMM model suitable for place name voice recognition of isolated words and establish a place name voice recognition model library by collecting a large number of place name voice samples, scientific algorithms and optimized training conditions, thereby creating a foundation for subsequent place name voice recognition and providing guarantee for accurate place name recognition.

The invention utilizes the characteristics of the place name voice signals, the selected continuous mixed Gaussian model is 4 states, each state comprises 3 Gaussian functions with 39 dimensions, the dimension of the feature matrix of the place name voice signals is also 39 dimensions, the calculated amount is greatly reduced, and the model training speed and the voice recognition speed are higher.

Drawings

FIG. 1 is a schematic process diagram of the present invention;

figure 2 is a schematic diagram of the hidden markov chain of the present invention.

Detailed Description

Objects, advantages and features of the present invention will be illustrated and explained by the following non-limiting description of preferred embodiments. The embodiments are merely exemplary for applying the technical solutions of the present invention, and any technical solution formed by replacing or converting the equivalent thereof falls within the scope of the present invention claimed.

The invention discloses a place name speech signal recognition method based on a continuous Gaussian mixture HMM model, which comprises a training process of the continuous Gaussian mixture HMM model and a place name speech recognition process, wherein as shown in the attached figure 1, the training process of the continuous Gaussian mixture HMM model comprises the following steps:

n, the number of model states is 4;

a, state transition probability matrix, a ═ a_ij}，a_ij＝P[q_t+1＝j/q_t＝i]I is more than or equal to 1, j is less than or equal to N, wherein q_tI denotes the state i, q at time t_(t+1)J denotes the time t +1 at state j, and overall denotes the probability of transitioning from state i to state j;

b, output probability density function, B ═ B_j(o)}，

J is more than or equal to 1 and less than or equal to N, wherein o is an observation vector, and M is the number of Gaussian elements contained in each state; c. C_jlIs the weight of the ith mixed Gaussian function of the jth state, L is the normal Gaussian probability density function, mu_jlMean vector, U, of the l-th mixed Gaussian element of the j-th state_jlCovariance matrix for the ith mixed gaussian element for the jth state.

S2, after defining the model, initializing the model parameters, specifically, setting the initial state pi to pi_tVector set to (1000), state transition matrixThe probability of the transition of the A to the next state is 0.5, the mean value of 39 orders of the Gaussian function is 0, the variance of the Gaussian function is 1, and the weight of the Gaussian function is 1/3.

S3, substituting the feature matrix of a class of place name voice signals into a model, and performing model parameter training once by using a Baum-Welch iterative algorithm, wherein the class of place name voice signals refers to that feature matrix data of all sample voice signals of a place name are put together, clustering is performed according to a mean value clustering method k-means, vectors with close distances are classified into one class, the class is divided into 4 classes, and the 4 states correspond to each other; four types are selected because the result is inaccurate due to a small number of states, and the calculated amount is large due to an excessive number of states, so four types are selected; the Baum-Welch iterative algorithm is actually an application of the Maximum Likelihood (ML) criterion, and adopts a multi-iteration optimization algorithm, and the detailed process is as follows:

s31, constructing an objective optimization function Q by Lagrange number multiplication, wherein all continuous Gaussian mixture HMM model parameters are used as variables;

And S4, calculating the probability of the occurrence of the first-class place name voice signals by using a viterbi algorithm according to the calculated model parameters.

And S5, comparing the probability calculated in the step S4 with the output probability before training, judging whether the relative error of the probability and the output probability meets the output condition, and ending the circulation when the output meets the requirement.

And S6, if the output condition is met, namely the relative error is less than 0.000001, the model training is converged and the output condition is met, outputting a continuous Gaussian mixture HMM model corresponding to the place name voice signal.

S7, if the output condition is not met, namely the relative error is more than 0.000001, judging whether the training frequency reaches the highest training threshold value; the reason why the highest training threshold is set is that if the training samples are few, a dead cycle occurs in the training process, and the training can be normally terminated by setting the highest training frequency threshold, so that the dead cycle is avoided, otherwise, the training can be continued forever and the training cannot be stopped.

And S8, repeating the steps S3-S7 if the training frequency does not reach the highest training threshold, terminating the training if the training frequency reaches the highest training threshold, and outputting a continuous Gaussian mixture HMM model.

After the place name voice recognition model base is formed, a feature matrix obtained after feature extraction is carried out on any place name voice signal is input into the place name voice model base for recognition, and the process is as follows:

In detail, in all the continuous mixture gaussian HMM models corresponding to different geographical names, each model corresponds to a hidden markov chain as shown in fig. 2, and its parameters include a 4-state transition matrix and four gaussian functions of states 1-4, so that when performing a speech signal recognition of unknown geographical names:

s110, inputting a feature matrix of an nx39 unknown place name voice signal into a continuous Gaussian mixture HMM model corresponding to a kind of place name voice signals in the established place name voice recognition model library, and recording the model as an observation sequence O (O)₁,o₂,…,o_n) Record P_inRepresenting the probability of occurring in state i after the input of the signal of the consecutive nth frame; p is a radical of_inRepresenting the probability of observing the nth frame signal at state i; a is_ijIndicating a transition from state iProbability to state j;

since the initial state is defined as being in state 1 and not in other locations, only the probability of position 1 is calculated, so P₁₁＝p₁₁；P₂₁＝0；P₃₁＝0；P₄₁＝0；

When the 2 nd frame signal is input, p_i2＝f_i(o₂)；(1≤i≤4)

Then P is_i2＝max{P_j1*a_ji*p_i2J is more than or equal to 1 and less than or equal to 4), wherein P_j1Representing the probability, a, at state j after the first frame signal_jiRepresents the probability of transitioning from state j to state i;

by the way of analogy, the method can be used,

when the n-th frame signal is input, p_in＝f_i(o_n)；(1≤i≤4)

P_in＝max{P_j(n-1)*a_ji*p_inJ is more than or equal to 1 and less than or equal to 4, wherein n is the frame number of a section of voice signal;

when all frame signals of unknown place name voice signals are input, because the last frame signal can only appear in states 1-4 after all frames of a signal are input, only 4 probabilities are obtained, and P is obtained_1n,P_2n,P_3n,P_4nThe maximum probability is the probability that the unknown place name voice signal appears in the continuous mixed Gaussian HMM model corresponding to the place name voice signal;

and S120, substituting the feature matrix of the unknown place name voice signal into the continuous Gaussian mixture HMM models corresponding to all other kinds of place name voice signals to obtain the probability of the unknown place name voice signal appearing in each continuous Gaussian mixture HMM model, and attributing the unknown place name voice signal to the class with the highest probability of appearing in the continuous Gaussian mixture HMM models corresponding to the place name voice signals.

The invention has various embodiments, and all technical solutions formed by adopting equivalent transformation or equivalent transformation are within the protection scope of the invention.

Claims

1. The place name speech signal recognition method based on the continuous Gaussian mixture HMM model is characterized by comprising the following steps: the method comprises a training process of a continuous Gaussian mixture HMM model and a place name speech recognition process, wherein the training process of the continuous Gaussian mixture HMM model comprises the following steps:

n, the number of model states is 4;

pi, initial probability distribution of each state, pi ═ pi_t，π_t＝P[q_i＝i]I is more than or equal to 1 and less than or equal to N, wherein pi is pi_tRepresenting the probability from the state i, wherein i represents the starting probability corresponding to each state;

b, output probability density function, B ═ B_j(o)}，

Wherein, o is an observation vector, and M is the number of Gaussian functions contained in each state; c. C_jlIs the weight of the ith mixed Gaussian function of the jth state, L is the normal Gaussian probability density function, mu_jlMean vector, U, of the l-th mixed Gaussian element of the j-th state_jlA covariance matrix for the ith mixed gaussian element for the jth state;

s2 model beginningInitializing, i.e. changing the initial state pi to pi_tThe vector is set to be (1000), the probability of the state transition matrix A in the state transition and the transition to the next state is 0.5, each Gaussian function is a 39-order function with the mean value of 0 and the variance of 1, and the weight is 1/3;

s6, if the place name voice signal meets the output condition, outputting a continuous Gaussian mixture HMM model corresponding to the place name voice signal; if the relative error is less than 0.000001, the model training is converged and the output condition is met;

2. The method of recognizing place name speech signal based on continuous mixture gaussian HMM model according to claim 1, wherein: in the step S3, the process of calculating the model parameters by using the Baum-Welch algorithm is as follows:

3. The method of recognizing place name speech signal based on continuous mixture gaussian HMM model according to claim 1, wherein: the place name voice recognition process is as follows:

4. The method of recognizing place name speech signal based on continuous mixture gaussian HMM model according to claim 1, wherein: the place name voice recognition process is as follows:

When the 2 nd frame signal is input, p_i2＝f_i(o₂)；(1≤i≤4)

by the way of analogy, the method can be used,

when the n-th frame signal is input, p_in＝f_i(o_n)；(1≤i≤4)