CN112466299B

CN112466299B - Voice theme recognition method

Info

Publication number: CN112466299B
Application number: CN202011347449.7A
Authority: CN
Inventors: 冯鹏宇
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2023-11-17
Anticipated expiration: 2040-11-26
Also published as: CN112466299A

Abstract

The application discloses a voice theme identification method, which comprises the following steps: extracting MFCC feature vectors of the audio samples; calculating Euclidean distance between the MFCC feature vectors; clustering the MFCC feature vectors according to the Euclidean distance, dividing the MFCC feature vectors into a plurality of classes, and obtaining word vectors corresponding to the MFCC feature vectors; inputting word vectors corresponding to the MFCC feature vectors into the LDA model, and outputting the topic category of each audio sample; inputting the MFCC feature vector sequence into an LSTM model, and classifying the MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the audio sample. The application can accurately obtain the distribution condition of each theme in the sound, improves the discrimination accuracy of the theme in the sound, can fully eliminate the influence of possible useless signals and acquires more characteristic information with stronger timeliness.

Description

Voice theme recognition method

Technical Field

The application relates to the technical field of voice recognition, in particular to a voice theme recognition method.

Background

Voice recognition is a very extensive research content in the field of practical application, and especially in the smart home direction, the voice recognition has been increasingly receiving attention. For towns with rapid development of the current economic network, old security and simple artificial sound box equipment cannot meet demands of people more and more, people are also required to be assisted in daily life through sound recognition equipment to take care of the security condition of the home, the living state of domestic pets is known at any time or people are assisted in carrying out the management of daily life, if the visitor identity is monitored through the sound outside a door, whether the domestic pets are loud or loud, so that the living of neighbors is influenced, songs are played, weather is queried, an alarm clock is set and the like through calling an AI intelligent sound box. However, these sounds are easy to cause problems such as low accuracy and even failure in recognition due to the wide frequency range and high similarity between individual classes.

At present, most of the existing voice recognition methods use original signal features of spectrograms or audios, but most of the existing voice recognition methods only distribute according to time sequence to acquire feature sequences when segmenting audios and extracting signal features, in real life, voice signals are likely not continuous, large-section blank voices are likely to be obtained by extracting features according to time sequence, the content of target voices in the large-section blank voices is very low, more voice is more, the spectrograms or feature sequences containing large-section noise are difficult to well represent signal features of the target voices, analysis of the audio signals is too thin, judgment of subject scenes where the voices are located is not carried out, and identifiable voice types are too single, accuracy is low, and diversity is poor.

Disclosure of Invention

The embodiment of the application provides a voice theme identification method, which can accurately obtain the distribution condition of each theme in voice, improve the judgment accuracy of the theme in the voice, fully eliminate the influence of possible useless signals and acquire more characteristic information with stronger timeliness.

In view of this, a first aspect of the present application provides a method for identifying a voice subject, the method comprising:

acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories;

extracting a first MFCC feature vector of a first audio sample in the training set;

calculating a first euclidean distance between the first MFCC feature vectors;

clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class;

calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum;

inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample;

and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample.

Optionally, before extracting the first MFCC feature vector of the first audio sample in the training set, further includes:

the first audio sample is segmented.

Optionally, the extracting a first MFCC feature vector of a first audio sample in the training set specifically includes:

extracting the first MFCC feature vector of each segment of the first audio sample in the training set, and outputting the first audio sample in the phi-th segment as E after passing through N filters, where the omega-th filter at the t moment _S The calculation formula of the MFCC eigenvector at the time t is as follows:

optionally, the clustering is performed on the first MFCC feature vector according to the value of the euclidean distance, and the first MFCC feature vector is divided into a plurality of classes, each class includes a plurality of word vectors, and each word vector corresponds to one sound class, specifically:

clustering the first MFCC feature vectors according to the value of the euclidean distance, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the plurality of word vectors are contained in the category.

Optionally, the calculating a second euclidean distance between the first MFCC feature vector and the word vector, and when the value of the second euclidean distance is minimum, obtaining the word vector corresponding to the first MFCC feature vector specifically includes:

calculating the firstMFCC feature vector (x) ₁ ，y ₁ ) To the word vector (x ₂ ，y ₂ ) Second euclidean distance ρ of:

and the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector.

Optionally, the inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample specifically includes:

inputting the word vector corresponding to the first MFCC feature vector into an LDA model;

calculating the probability P of occurrence of the word vector in each first audio sample:

wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;

and obtaining the distribution p (theta) of the topic vector theta in each first audio sample, wherein the topic vector with the largest probability value is the topic category of the first audio sample.

Optionally, the obtaining a second MFCC feature vector sequence of the second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample, where the subject class specifically is:

obtaining a second MFCC feature vector sequence for a second audio sample in the test set;

inputting the second MFCC feature vector sequence into the LSTM model, where an architecture composition formula of the LSTM model is:

f _t ＝σ(W _f [h _t-1 ,x _t ]+b _f )；

i _t ＝σ(W _i [h _t-1 ,x _t ]+b _i )；

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

in which W is _f ，W _i ，W _o Respectively weight parameters b _f ，b _i ，/>b _o Is a bias value; the second MFCC feature vector sequence is input x _t ；h _t-1 To conceal layer f _t ，i _t ，o _t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) _t-1 And C _t The cell states at time t-1 and time t in the LSTM structure are shown, respectively.

Normalizing the output result by using a Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.

From the above technical scheme, the application has the following advantages:

the application provides a voice theme recognition method, which comprises the steps of obtaining a plurality of audio samples, dividing the audio samples into a training set and a testing set, wherein the audio samples respectively belong to a plurality of voice categories; extracting a first MFCC feature vector of a first audio sample in the training set; calculating a first Euclidean distance between the first MFCC feature vectors; clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class; calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum; inputting word vectors corresponding to the first MFCC feature vectors into the LDA model, and outputting the topic category of each first audio sample; and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.

The audio sample is clustered by calculating Euclidean distance between MFCC feature vectors, and the corresponding relation between the MFCC feature vectors and word vectors is obtained by calculating Euclidean distance between the MFCC feature vectors and word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.

Drawings

FIG. 1 is a flow chart of a method for identifying a voice subject according to one embodiment of the present application;

FIG. 2 is a flow chart of extracting MFCC features in an embodiment of the present application;

fig. 3 is a schematic diagram of an LSTM model structure in an embodiment of the present application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a method flow of an embodiment of a voice theme recognition method according to the present application, as shown in fig. 1, where fig. 1 includes:

101. and acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories.

It should be noted that a large number of audio samples may be obtained in the present application, and the audio samples may include audio samples selected from a plurality of sound types, for example, the audio samples may be alarm sounds, speaking sounds, music sounds, animal sounds, etc.; the acquired audio samples can be divided into a training set and a testing set, wherein the training set can be used for training the LDA model and the LSTM model, and the LSTM model is tested by adopting the testing set; and comparing the theme category output by the LDA model with a test result tested by adopting the LSTM model, thereby identifying the theme category to which the audio sample belongs.

102. A first MFCC feature vector for a first audio sample in the training set is extracted.

It should be noted that, after performing operations such as pre-emphasis, framing, windowing, discrete fourier transform and the like on the first audio sample in the training set, the present application calculates the audio signal characteristics of the first audio sample by performing operations such as number of Mel filters on the processed first audio sample, and performs filtering processing on the processed first audio sample by using a filter bank, and a flowchart for specifically extracting MFCC characteristics is shown in fig. 2.

In a specific embodiment, before extracting the first MFCC feature vector of the first audio sample in the training set, further comprises:

1020. the first audio sample is segmented.

It should be noted that, the present application may segment each first audio sample, for example, divide each first audio sample into 10 segments, and extract the first MFCC feature vector of each segment of first audio sample, where the 10 segments of first MFCC feature vectors form a first MFCC feature vector sequence of the first audio sample.

Specifically, for the first audio sample in the phi-th segment, after passing through the N filters, the output of the omega-th filter at the t moment is E _S The calculation formula of the MFCC eigenvector at the time t is as follows:

103. a first euclidean distance between the first MFCC feature vectors is calculated.

It should be noted that the present application may calculate a first euclidean distance between the first MFCC feature vectors of the first audio sample, and cluster the first MFCC feature vectors according to the magnitude of the first euclidean distance.

104. Clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class.

It should be noted that, after clustering, since one class may include the first MFCC feature vector corresponding to the first audio sample belonging to the plurality of different sound classes, one class includes a plurality of word vectors (each sound class corresponds to one word vector).

The first MFCC feature vectors are divided into a plurality of classes for the distance between them using a K-means clustering method, and the first MFCC feature vectors are made to correspond to word vectors according to the center point of each class obtained. The application divides the first MFCC feature vector set into K classes, wherein the number of the classes is N respectively ₁ ，N ₂ ，…，N _K But similar center point a ₁ ，a ₂ ，…，a _K Instead of being fixed, in order to make the objective function as small as possible, the class center point needs to be updated continuously, and the objective function may be set as a square error function S, which is calculated by the following method:

in order to make the square error S as small as possible, an update method for obtaining the class center point is required, and the class center point a is updated while the derivative processing is performed on a by S _k The updating method comprises the following steps:

where n represents the number of first MFCC feature vectors in a class, K represents the number of classes, x _j Representing a first MFCC feature vector, a representing a class center point.

105. And calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum.

In the K-means method, cluster analysis is performed on each class center point, and the first MFCC feature vector is converted into a word vector capable of expressing sound features by calculating the Euclidean distance between the first MFCC feature vector and the word vector. Specifically, the word vector closest to the first MFCC feature vector is used as the word vector corresponding to the first MFCC feature vector.

The application can select a bag-of-words model as a word representation method and calculate the similarity between two words, wherein the similarity measurement method is to calculate the Euclidean distance between two vectors which are respectively the characteristic vectors of the audio characteristic MFCC and the word vectors obtained by K-means clustering. Specifically, a first MFCC feature vector (x ₁ ，y ₁ ) To word vector (x) ₂ ，y ₂ ) The second Euclidean distance ρ of (2) is:

the word bag model analyzes the obtained second Euclidean distance and determines the corresponding condition of the first MFCC feature vector of the first audio sample and the word vector, so that the first MFCC feature vector corresponding to each word vector can be obtained.

106. And inputting the word vector corresponding to the first MFCC feature vector into the LDA model, and outputting the theme class of each first audio sample.

It should be noted that, word vectors corresponding to the first MFCC feature vectors are input into the LDA model;

and obtaining the distribution p (theta) of the topic vectors theta in each first audio sample, wherein the topic vector with the maximum probability value is the topic category of the first audio sample.

Specifically, by inputting the word vector corresponding to the first MFCC feature vector into the LDA model, the distribution p (θ) of the topic vector in a section of the first audio sample may be obtained, and the LDA model may introduce a dirichlet distribution function to obtain the distribution of p (θ), that is, the distribution of probability distribution, so that the super parameter α may be converted into the probability distribution of the topic vector, where the probability is the largest, and represents the topic scene of the current first audio sample.

107. And acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.

It should be noted that, the present application may perform jump interception on the second audio samples in the test set, for example, obtain a section of second audio sample section at intervals of preset intervals, and then extract the second MFCC feature vector of each section of second audio sample section, so as to obtain a second MFCC feature vector sequence of the second audio sample; and taking the obtained second MFCC feature vector sequence as input data of the LSTM model. In the application, a four-layer LSTM model can be adopted to analyze the second MFCC feature vector sequence, a specific four-layer LSTM model structure is shown in fig. 3, an input gate, a forget gate and an output gate can be arranged in each neuron in the LSTM model, and the architecture composition formula is as follows:

f _t ＝σ(W _f [h _t-1 ,x _t ]+b _f )；

i _t ＝σ(W _i [h _t-1 ,x _t ]+b _i )；

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

Normalizing the output by using the Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.

Specifically, the output vector can be mapped to the range of [0,1] by using a multi-classification Softmax function for normalization processing, the probability of the sound in various possible topic categories is controlled between [0,1], the sum of the probabilities is 1, and the topic category of the second audio sample is the topic category with the largest probability value, so that the final sound recognition result is obtained.

In a specific embodiment, the present application further comprises:

training the LSTM model by adopting a training set to obtain a trained LSTM model;

testing the trained LSTM model by adopting a test set to obtain a test result, wherein the test result comprises a theme category corresponding to a second audio sample in the test set;

and comparing the output theme category of the LDA model with the test result to identify the theme category of the audio sample.

The application clusters the sample set by calculating Euclidean distance between the MFCC feature vectors, and obtains the corresponding relation between the MFCC feature vectors and the word vectors by calculating Euclidean distance between the MFCC feature vectors and the word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.

The application also comprises a specific application example, which is as follows:

the application can adopt a data set containing 16000 short audio samples, the sound types of the audio samples can be divided into 8 types, namely, automobile whistling, dog barking, gun sound, police ring sound, knocking, speaking sound, footstep sound and music sound, wherein each sound type takes 2000 audio samples, half of the data set of the audio samples is used as a training set, the other half is used as a test set, the duration of each audio sample is 5s, and the audio samples with duration less than 5s are complemented by blank sounds. For extracting signal characteristics in audio samples, the application segments the audio samples in each training set in a characteristic extraction stage, for example, each 0.5s is a segment, and 30-dimensional MFCC characteristic vectors are extracted on each segment of audio samples to form a 30-dimensional MFCC characteristic vector sequence; clustering the MFCC feature vectors through K-means clustering, and acquiring the center of each class from the clustered classes; setting the K value to 180 in a K-means clustering method, wherein the number of the LDA models can be set to 8; dividing an audio sample corresponding to the input MFCC feature vector into the 4 types of theme scenes, namely human voice, animal voice, daily life voice and security alarm voice; in the test set, jumping interception is needed to be carried out on the MFCC feature vectors in the audio data, 5s audio segments are segmented in 0.1s time, the MFCC features in all segments are extracted, after the first segment of MFCC feature vectors are selected, one segment of MFCC feature vectors are selected every 0.1s to form a new MFCC feature vector sequence, and the new MFCC feature vector sequence is used as input data of a subsequent four-layer LSTM model; finally, in the LSTM neural network model, the LSTM layer number is set to be 4 layers, the batch_size is set to be 100, the learning rate is set to be 0.002, the number of epochs is set to be 200, the value of each epochs is 2000, the Dropout value is set to be 0.5, the ReLU function is selected as the activation function, the Adam is used as the optimizer, the mean square error is used as the objective function, the Softmax function is used for mapping the input data passing through the LSTM layer to the range of [0,1], and the normalization operation is carried out on the input data, so that the sum of probabilities of all categories of possible categories of the sound is 1, wherein the category with the largest probability value is the category to which the sound belongs, and the specific structural diagram of the LSTM model is shown in fig. 3.

The voice recognition model provided by the application is used for extracting the original frontal audio signal characteristics and combining the original frontal audio signal characteristics with the LDA main model and the LSTM neural network model, classifying the theme scenes of the voice, classifying and recognizing the voice, and better utilizing the representative characteristic information in the audio signal. Meanwhile, the method also uses a four-layer LSTM neural network model structure, enhances the analysis capability of the front-back association relation of the sound on the time sequence, enables the model to recognize more characteristic information of the sound, and improves the efficiency of the model on the sound recognition accuracy.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for identifying a voice subject, comprising:

calculating a first euclidean distance between the first MFCC feature vectors;

acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample;

the first MFCC feature vector of the first audio sample in the training set is extracted specifically:

clustering the first MFCC feature vector according to the value of the euclidean distance, and dividing the first MFCC feature vector into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class, specifically:

clustering the first MFCC feature vectors according to the Euclidean distance value, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the category comprises a plurality of word vectors;

and calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum, specifically:

calculating the first MFCC feature vector (x ₁ ，y ₁ ) To the word vector (x ₂ ，y ₂ ) Second euclidean distance ρ of:

the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector;

inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample, specifically:

2. The method of claim 1, further comprising, prior to extracting a first MFCC feature vector for a first audio sample in a training set:

the first audio sample is segmented.

3. The method for recognizing a sound theme according to claim 2, wherein the obtaining a second MFCC feature vector sequence of a second audio sample in the test set inputs the second MFCC feature vector sequence into an LSTM model, and classifies the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the theme class of the second audio sample, specifically:

f _t ＝σ(W _f [h _t-1 ,x _t ]+b _f )；

i _t ＝σ(W _i [h _t-1 ,x _t ]+b _i )；

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )；

h _t ＝o _t *tanh(C _t )；

in which W is _f ，W _i ，W _o Respectively weight parameters b _f ，b _i ，/>b _o Is a bias value; the second MFCC feature vector sequence is input x _t ；h _t-1 To conceal layer f _t ，i _t ，o _t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) _t-1 And C _t Respectively representing the cell states at the time t-1 and the time t in the LSTM structure;