CN112466299B - Voice theme recognition method - Google Patents
Voice theme recognition method Download PDFInfo
- Publication number
- CN112466299B CN112466299B CN202011347449.7A CN202011347449A CN112466299B CN 112466299 B CN112466299 B CN 112466299B CN 202011347449 A CN202011347449 A CN 202011347449A CN 112466299 B CN112466299 B CN 112466299B
- Authority
- CN
- China
- Prior art keywords
- mfcc feature
- feature vector
- audio sample
- vector
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000013598 vector Substances 0.000 claims abstract description 216
- 238000012360 testing method Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 23
- 230000004913 activation Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 230000006870 function Effects 0.000 description 16
- 230000005236 sound signal Effects 0.000 description 7
- 238000003064 k means clustering Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a voice theme identification method, which comprises the following steps: extracting MFCC feature vectors of the audio samples; calculating Euclidean distance between the MFCC feature vectors; clustering the MFCC feature vectors according to the Euclidean distance, dividing the MFCC feature vectors into a plurality of classes, and obtaining word vectors corresponding to the MFCC feature vectors; inputting word vectors corresponding to the MFCC feature vectors into the LDA model, and outputting the topic category of each audio sample; inputting the MFCC feature vector sequence into an LSTM model, and classifying the MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the audio sample. The application can accurately obtain the distribution condition of each theme in the sound, improves the discrimination accuracy of the theme in the sound, can fully eliminate the influence of possible useless signals and acquires more characteristic information with stronger timeliness.
Description
Technical Field
The application relates to the technical field of voice recognition, in particular to a voice theme recognition method.
Background
Voice recognition is a very extensive research content in the field of practical application, and especially in the smart home direction, the voice recognition has been increasingly receiving attention. For towns with rapid development of the current economic network, old security and simple artificial sound box equipment cannot meet demands of people more and more, people are also required to be assisted in daily life through sound recognition equipment to take care of the security condition of the home, the living state of domestic pets is known at any time or people are assisted in carrying out the management of daily life, if the visitor identity is monitored through the sound outside a door, whether the domestic pets are loud or loud, so that the living of neighbors is influenced, songs are played, weather is queried, an alarm clock is set and the like through calling an AI intelligent sound box. However, these sounds are easy to cause problems such as low accuracy and even failure in recognition due to the wide frequency range and high similarity between individual classes.
At present, most of the existing voice recognition methods use original signal features of spectrograms or audios, but most of the existing voice recognition methods only distribute according to time sequence to acquire feature sequences when segmenting audios and extracting signal features, in real life, voice signals are likely not continuous, large-section blank voices are likely to be obtained by extracting features according to time sequence, the content of target voices in the large-section blank voices is very low, more voice is more, the spectrograms or feature sequences containing large-section noise are difficult to well represent signal features of the target voices, analysis of the audio signals is too thin, judgment of subject scenes where the voices are located is not carried out, and identifiable voice types are too single, accuracy is low, and diversity is poor.
Disclosure of Invention
The embodiment of the application provides a voice theme identification method, which can accurately obtain the distribution condition of each theme in voice, improve the judgment accuracy of the theme in the voice, fully eliminate the influence of possible useless signals and acquire more characteristic information with stronger timeliness.
In view of this, a first aspect of the present application provides a method for identifying a voice subject, the method comprising:
acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories;
extracting a first MFCC feature vector of a first audio sample in the training set;
calculating a first euclidean distance between the first MFCC feature vectors;
clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class;
calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample;
and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample.
Optionally, before extracting the first MFCC feature vector of the first audio sample in the training set, further includes:
the first audio sample is segmented.
Optionally, the extracting a first MFCC feature vector of a first audio sample in the training set specifically includes:
extracting the first MFCC feature vector of each segment of the first audio sample in the training set, and outputting the first audio sample in the phi-th segment as E after passing through N filters, where the omega-th filter at the t moment S The calculation formula of the MFCC eigenvector at the time t is as follows:
optionally, the clustering is performed on the first MFCC feature vector according to the value of the euclidean distance, and the first MFCC feature vector is divided into a plurality of classes, each class includes a plurality of word vectors, and each word vector corresponds to one sound class, specifically:
clustering the first MFCC feature vectors according to the value of the euclidean distance, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the plurality of word vectors are contained in the category.
Optionally, the calculating a second euclidean distance between the first MFCC feature vector and the word vector, and when the value of the second euclidean distance is minimum, obtaining the word vector corresponding to the first MFCC feature vector specifically includes:
calculating the firstMFCC feature vector (x) 1 ,y 1 ) To the word vector (x 2 ,y 2 ) Second euclidean distance ρ of:
and the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector.
Optionally, the inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample specifically includes:
inputting the word vector corresponding to the first MFCC feature vector into an LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vector theta in each first audio sample, wherein the topic vector with the largest probability value is the topic category of the first audio sample.
Optionally, the obtaining a second MFCC feature vector sequence of the second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample, where the subject class specifically is:
obtaining a second MFCC feature vector sequence for a second audio sample in the test set;
inputting the second MFCC feature vector sequence into the LSTM model, where an architecture composition formula of the LSTM model is:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W i ,W o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t The cell states at time t-1 and time t in the LSTM structure are shown, respectively.
Normalizing the output result by using a Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
From the above technical scheme, the application has the following advantages:
the application provides a voice theme recognition method, which comprises the steps of obtaining a plurality of audio samples, dividing the audio samples into a training set and a testing set, wherein the audio samples respectively belong to a plurality of voice categories; extracting a first MFCC feature vector of a first audio sample in the training set; calculating a first Euclidean distance between the first MFCC feature vectors; clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class; calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum; inputting word vectors corresponding to the first MFCC feature vectors into the LDA model, and outputting the topic category of each first audio sample; and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.
The audio sample is clustered by calculating Euclidean distance between MFCC feature vectors, and the corresponding relation between the MFCC feature vectors and word vectors is obtained by calculating Euclidean distance between the MFCC feature vectors and word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.
Drawings
FIG. 1 is a flow chart of a method for identifying a voice subject according to one embodiment of the present application;
FIG. 2 is a flow chart of extracting MFCC features in an embodiment of the present application;
fig. 3 is a schematic diagram of an LSTM model structure in an embodiment of the present application.
Detailed Description
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a method flow of an embodiment of a voice theme recognition method according to the present application, as shown in fig. 1, where fig. 1 includes:
101. and acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories.
It should be noted that a large number of audio samples may be obtained in the present application, and the audio samples may include audio samples selected from a plurality of sound types, for example, the audio samples may be alarm sounds, speaking sounds, music sounds, animal sounds, etc.; the acquired audio samples can be divided into a training set and a testing set, wherein the training set can be used for training the LDA model and the LSTM model, and the LSTM model is tested by adopting the testing set; and comparing the theme category output by the LDA model with a test result tested by adopting the LSTM model, thereby identifying the theme category to which the audio sample belongs.
102. A first MFCC feature vector for a first audio sample in the training set is extracted.
It should be noted that, after performing operations such as pre-emphasis, framing, windowing, discrete fourier transform and the like on the first audio sample in the training set, the present application calculates the audio signal characteristics of the first audio sample by performing operations such as number of Mel filters on the processed first audio sample, and performs filtering processing on the processed first audio sample by using a filter bank, and a flowchart for specifically extracting MFCC characteristics is shown in fig. 2.
In a specific embodiment, before extracting the first MFCC feature vector of the first audio sample in the training set, further comprises:
1020. the first audio sample is segmented.
It should be noted that, the present application may segment each first audio sample, for example, divide each first audio sample into 10 segments, and extract the first MFCC feature vector of each segment of first audio sample, where the 10 segments of first MFCC feature vectors form a first MFCC feature vector sequence of the first audio sample.
Specifically, for the first audio sample in the phi-th segment, after passing through the N filters, the output of the omega-th filter at the t moment is E S The calculation formula of the MFCC eigenvector at the time t is as follows:
103. a first euclidean distance between the first MFCC feature vectors is calculated.
It should be noted that the present application may calculate a first euclidean distance between the first MFCC feature vectors of the first audio sample, and cluster the first MFCC feature vectors according to the magnitude of the first euclidean distance.
104. Clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class.
It should be noted that, after clustering, since one class may include the first MFCC feature vector corresponding to the first audio sample belonging to the plurality of different sound classes, one class includes a plurality of word vectors (each sound class corresponds to one word vector).
The first MFCC feature vectors are divided into a plurality of classes for the distance between them using a K-means clustering method, and the first MFCC feature vectors are made to correspond to word vectors according to the center point of each class obtained. The application divides the first MFCC feature vector set into K classes, wherein the number of the classes is N respectively 1 ,N 2 ,…,N K But similar center point a 1 ,a 2 ,…,a K Instead of being fixed, in order to make the objective function as small as possible, the class center point needs to be updated continuously, and the objective function may be set as a square error function S, which is calculated by the following method:
in order to make the square error S as small as possible, an update method for obtaining the class center point is required, and the class center point a is updated while the derivative processing is performed on a by S k The updating method comprises the following steps:
where n represents the number of first MFCC feature vectors in a class, K represents the number of classes, x j Representing a first MFCC feature vector, a representing a class center point.
105. And calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum.
In the K-means method, cluster analysis is performed on each class center point, and the first MFCC feature vector is converted into a word vector capable of expressing sound features by calculating the Euclidean distance between the first MFCC feature vector and the word vector. Specifically, the word vector closest to the first MFCC feature vector is used as the word vector corresponding to the first MFCC feature vector.
The application can select a bag-of-words model as a word representation method and calculate the similarity between two words, wherein the similarity measurement method is to calculate the Euclidean distance between two vectors which are respectively the characteristic vectors of the audio characteristic MFCC and the word vectors obtained by K-means clustering. Specifically, a first MFCC feature vector (x 1 ,y 1 ) To word vector (x) 2 ,y 2 ) The second Euclidean distance ρ of (2) is:
the word bag model analyzes the obtained second Euclidean distance and determines the corresponding condition of the first MFCC feature vector of the first audio sample and the word vector, so that the first MFCC feature vector corresponding to each word vector can be obtained.
106. And inputting the word vector corresponding to the first MFCC feature vector into the LDA model, and outputting the theme class of each first audio sample.
It should be noted that, word vectors corresponding to the first MFCC feature vectors are input into the LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vectors theta in each first audio sample, wherein the topic vector with the maximum probability value is the topic category of the first audio sample.
Specifically, by inputting the word vector corresponding to the first MFCC feature vector into the LDA model, the distribution p (θ) of the topic vector in a section of the first audio sample may be obtained, and the LDA model may introduce a dirichlet distribution function to obtain the distribution of p (θ), that is, the distribution of probability distribution, so that the super parameter α may be converted into the probability distribution of the topic vector, where the probability is the largest, and represents the topic scene of the current first audio sample.
107. And acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.
It should be noted that, the present application may perform jump interception on the second audio samples in the test set, for example, obtain a section of second audio sample section at intervals of preset intervals, and then extract the second MFCC feature vector of each section of second audio sample section, so as to obtain a second MFCC feature vector sequence of the second audio sample; and taking the obtained second MFCC feature vector sequence as input data of the LSTM model. In the application, a four-layer LSTM model can be adopted to analyze the second MFCC feature vector sequence, a specific four-layer LSTM model structure is shown in fig. 3, an input gate, a forget gate and an output gate can be arranged in each neuron in the LSTM model, and the architecture composition formula is as follows:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W i ,W o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t The cell states at time t-1 and time t in the LSTM structure are shown, respectively.
Normalizing the output by using the Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
Specifically, the output vector can be mapped to the range of [0,1] by using a multi-classification Softmax function for normalization processing, the probability of the sound in various possible topic categories is controlled between [0,1], the sum of the probabilities is 1, and the topic category of the second audio sample is the topic category with the largest probability value, so that the final sound recognition result is obtained.
In a specific embodiment, the present application further comprises:
training the LSTM model by adopting a training set to obtain a trained LSTM model;
testing the trained LSTM model by adopting a test set to obtain a test result, wherein the test result comprises a theme category corresponding to a second audio sample in the test set;
and comparing the output theme category of the LDA model with the test result to identify the theme category of the audio sample.
The application clusters the sample set by calculating Euclidean distance between the MFCC feature vectors, and obtains the corresponding relation between the MFCC feature vectors and the word vectors by calculating Euclidean distance between the MFCC feature vectors and the word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.
The application also comprises a specific application example, which is as follows:
the application can adopt a data set containing 16000 short audio samples, the sound types of the audio samples can be divided into 8 types, namely, automobile whistling, dog barking, gun sound, police ring sound, knocking, speaking sound, footstep sound and music sound, wherein each sound type takes 2000 audio samples, half of the data set of the audio samples is used as a training set, the other half is used as a test set, the duration of each audio sample is 5s, and the audio samples with duration less than 5s are complemented by blank sounds. For extracting signal characteristics in audio samples, the application segments the audio samples in each training set in a characteristic extraction stage, for example, each 0.5s is a segment, and 30-dimensional MFCC characteristic vectors are extracted on each segment of audio samples to form a 30-dimensional MFCC characteristic vector sequence; clustering the MFCC feature vectors through K-means clustering, and acquiring the center of each class from the clustered classes; setting the K value to 180 in a K-means clustering method, wherein the number of the LDA models can be set to 8; dividing an audio sample corresponding to the input MFCC feature vector into the 4 types of theme scenes, namely human voice, animal voice, daily life voice and security alarm voice; in the test set, jumping interception is needed to be carried out on the MFCC feature vectors in the audio data, 5s audio segments are segmented in 0.1s time, the MFCC features in all segments are extracted, after the first segment of MFCC feature vectors are selected, one segment of MFCC feature vectors are selected every 0.1s to form a new MFCC feature vector sequence, and the new MFCC feature vector sequence is used as input data of a subsequent four-layer LSTM model; finally, in the LSTM neural network model, the LSTM layer number is set to be 4 layers, the batch_size is set to be 100, the learning rate is set to be 0.002, the number of epochs is set to be 200, the value of each epochs is 2000, the Dropout value is set to be 0.5, the ReLU function is selected as the activation function, the Adam is used as the optimizer, the mean square error is used as the objective function, the Softmax function is used for mapping the input data passing through the LSTM layer to the range of [0,1], and the normalization operation is carried out on the input data, so that the sum of probabilities of all categories of possible categories of the sound is 1, wherein the category with the largest probability value is the category to which the sound belongs, and the specific structural diagram of the LSTM model is shown in fig. 3.
The voice recognition model provided by the application is used for extracting the original frontal audio signal characteristics and combining the original frontal audio signal characteristics with the LDA main model and the LSTM neural network model, classifying the theme scenes of the voice, classifying and recognizing the voice, and better utilizing the representative characteristic information in the audio signal. Meanwhile, the method also uses a four-layer LSTM neural network model structure, enhances the analysis capability of the front-back association relation of the sound on the time sequence, enables the model to recognize more characteristic information of the sound, and improves the efficiency of the model on the sound recognition accuracy.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (3)
1. A method for identifying a voice subject, comprising:
acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories;
extracting a first MFCC feature vector of a first audio sample in the training set;
calculating a first euclidean distance between the first MFCC feature vectors;
clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class;
calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample;
acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample;
the first MFCC feature vector of the first audio sample in the training set is extracted specifically:
extracting the first MFCC feature vector of each segment of the first audio sample in the training set, and outputting the first audio sample in the phi-th segment as E after passing through N filters, where the omega-th filter at the t moment S The calculation formula of the MFCC eigenvector at the time t is as follows:
clustering the first MFCC feature vector according to the value of the euclidean distance, and dividing the first MFCC feature vector into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class, specifically:
clustering the first MFCC feature vectors according to the Euclidean distance value, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the category comprises a plurality of word vectors;
and calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum, specifically:
calculating the first MFCC feature vector (x 1 ,y 1 ) To the word vector (x 2 ,y 2 ) Second euclidean distance ρ of:
the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample, specifically:
inputting the word vector corresponding to the first MFCC feature vector into an LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vector theta in each first audio sample, wherein the topic vector with the largest probability value is the topic category of the first audio sample.
2. The method of claim 1, further comprising, prior to extracting a first MFCC feature vector for a first audio sample in a training set:
the first audio sample is segmented.
3. The method for recognizing a sound theme according to claim 2, wherein the obtaining a second MFCC feature vector sequence of a second audio sample in the test set inputs the second MFCC feature vector sequence into an LSTM model, and classifies the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the theme class of the second audio sample, specifically:
obtaining a second MFCC feature vector sequence for a second audio sample in the test set;
inputting the second MFCC feature vector sequence into the LSTM model, where an architecture composition formula of the LSTM model is:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W i ,W o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t Respectively representing the cell states at the time t-1 and the time t in the LSTM structure;
normalizing the output result by using a Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011347449.7A CN112466299B (en) | 2020-11-26 | 2020-11-26 | Voice theme recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011347449.7A CN112466299B (en) | 2020-11-26 | 2020-11-26 | Voice theme recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112466299A CN112466299A (en) | 2021-03-09 |
CN112466299B true CN112466299B (en) | 2023-11-17 |
Family
ID=74808503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011347449.7A Active CN112466299B (en) | 2020-11-26 | 2020-11-26 | Voice theme recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112466299B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105702251A (en) * | 2016-04-20 | 2016-06-22 | 中国科学院自动化研究所 | Speech emotion identifying method based on Top-k enhanced audio bag-of-word model |
CN108281146A (en) * | 2017-12-29 | 2018-07-13 | 青岛真时科技有限公司 | A kind of phrase sound method for distinguishing speek person and device |
CN109448703A (en) * | 2018-11-14 | 2019-03-08 | 山东师范大学 | In conjunction with the audio scene recognition method and system of deep neural network and topic model |
CN110120218A (en) * | 2019-04-29 | 2019-08-13 | 东北大学 | Expressway oversize vehicle recognition methods based on GMM-HMM |
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
-
2020
- 2020-11-26 CN CN202011347449.7A patent/CN112466299B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105702251A (en) * | 2016-04-20 | 2016-06-22 | 中国科学院自动化研究所 | Speech emotion identifying method based on Top-k enhanced audio bag-of-word model |
CN108281146A (en) * | 2017-12-29 | 2018-07-13 | 青岛真时科技有限公司 | A kind of phrase sound method for distinguishing speek person and device |
CN109448703A (en) * | 2018-11-14 | 2019-03-08 | 山东师范大学 | In conjunction with the audio scene recognition method and system of deep neural network and topic model |
CN110120218A (en) * | 2019-04-29 | 2019-08-13 | 东北大学 | Expressway oversize vehicle recognition methods based on GMM-HMM |
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112466299A (en) | 2021-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dennis et al. | Overlapping sound event recognition using local spectrogram features and the generalised hough transform | |
Ullo et al. | Hybrid computerized method for environmental sound classification | |
Cakir et al. | Multi-label vs. combined single-label sound event detection with deep neural networks | |
Jancovic et al. | Bird species recognition using unsupervised modeling of individual vocalization elements | |
Grzeszick et al. | Bag-of-features methods for acoustic event detection and classification | |
CN110120218A (en) | Expressway oversize vehicle recognition methods based on GMM-HMM | |
CN104795064A (en) | Recognition method for sound event under scene of low signal to noise ratio | |
WO2019086118A1 (en) | Segmentation-based feature extraction for acoustic scene classification | |
CN108806694A (en) | A kind of teaching Work attendance method based on voice recognition | |
Ting Yuan et al. | Frog sound identification system for frog species recognition | |
CN111816185A (en) | Method and device for identifying speaker in mixed voice | |
Colonna et al. | Feature evaluation for unsupervised bioacoustic signal segmentation of anuran calls | |
Yan et al. | Birdsong classification based on multi-feature fusion | |
Podwinska et al. | Acoustic event detection from weakly labeled data using auditory salience | |
Shreyas et al. | Trends of sound event recognition in audio surveillance: a recent review and study | |
US11776532B2 (en) | Audio processing apparatus and method for audio scene classification | |
CN113707175B (en) | Acoustic event detection system based on feature decomposition classifier and adaptive post-processing | |
Rao et al. | Exploring the impact of optimal clusters on cluster purity | |
CN105006231A (en) | Distributed large population speaker recognition method based on fuzzy clustering decision tree | |
CN112420056A (en) | Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle | |
CN112466299B (en) | Voice theme recognition method | |
Chaves et al. | Katydids acoustic classification on verification approach based on MFCC and HMM | |
Wanare et al. | Human Emotion recognition from speech | |
Luo et al. | Polyphonic sound event detection based on CapsNet-RNN and post processing optimization | |
Nasiri et al. | Audiomask: Robust sound event detection using mask r-cnn and frame-level classifier |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |