CN112466299B - Voice theme recognition method - Google Patents

Voice theme recognition method Download PDF

Info

Publication number
CN112466299B
CN112466299B CN202011347449.7A CN202011347449A CN112466299B CN 112466299 B CN112466299 B CN 112466299B CN 202011347449 A CN202011347449 A CN 202011347449A CN 112466299 B CN112466299 B CN 112466299B
Authority
CN
China
Prior art keywords
mfcc feature
feature vector
audio sample
vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011347449.7A
Other languages
Chinese (zh)
Other versions
CN112466299A (en
Inventor
冯鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202011347449.7A priority Critical patent/CN112466299B/en
Publication of CN112466299A publication Critical patent/CN112466299A/en
Application granted granted Critical
Publication of CN112466299B publication Critical patent/CN112466299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a voice theme identification method, which comprises the following steps: extracting MFCC feature vectors of the audio samples; calculating Euclidean distance between the MFCC feature vectors; clustering the MFCC feature vectors according to the Euclidean distance, dividing the MFCC feature vectors into a plurality of classes, and obtaining word vectors corresponding to the MFCC feature vectors; inputting word vectors corresponding to the MFCC feature vectors into the LDA model, and outputting the topic category of each audio sample; inputting the MFCC feature vector sequence into an LSTM model, and classifying the MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the audio sample. The application can accurately obtain the distribution condition of each theme in the sound, improves the discrimination accuracy of the theme in the sound, can fully eliminate the influence of possible useless signals and acquires more characteristic information with stronger timeliness.

Description

Voice theme recognition method
Technical Field
The application relates to the technical field of voice recognition, in particular to a voice theme recognition method.
Background
Voice recognition is a very extensive research content in the field of practical application, and especially in the smart home direction, the voice recognition has been increasingly receiving attention. For towns with rapid development of the current economic network, old security and simple artificial sound box equipment cannot meet demands of people more and more, people are also required to be assisted in daily life through sound recognition equipment to take care of the security condition of the home, the living state of domestic pets is known at any time or people are assisted in carrying out the management of daily life, if the visitor identity is monitored through the sound outside a door, whether the domestic pets are loud or loud, so that the living of neighbors is influenced, songs are played, weather is queried, an alarm clock is set and the like through calling an AI intelligent sound box. However, these sounds are easy to cause problems such as low accuracy and even failure in recognition due to the wide frequency range and high similarity between individual classes.
At present, most of the existing voice recognition methods use original signal features of spectrograms or audios, but most of the existing voice recognition methods only distribute according to time sequence to acquire feature sequences when segmenting audios and extracting signal features, in real life, voice signals are likely not continuous, large-section blank voices are likely to be obtained by extracting features according to time sequence, the content of target voices in the large-section blank voices is very low, more voice is more, the spectrograms or feature sequences containing large-section noise are difficult to well represent signal features of the target voices, analysis of the audio signals is too thin, judgment of subject scenes where the voices are located is not carried out, and identifiable voice types are too single, accuracy is low, and diversity is poor.
Disclosure of Invention
The embodiment of the application provides a voice theme identification method, which can accurately obtain the distribution condition of each theme in voice, improve the judgment accuracy of the theme in the voice, fully eliminate the influence of possible useless signals and acquire more characteristic information with stronger timeliness.
In view of this, a first aspect of the present application provides a method for identifying a voice subject, the method comprising:
acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories;
extracting a first MFCC feature vector of a first audio sample in the training set;
calculating a first euclidean distance between the first MFCC feature vectors;
clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class;
calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample;
and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample.
Optionally, before extracting the first MFCC feature vector of the first audio sample in the training set, further includes:
the first audio sample is segmented.
Optionally, the extracting a first MFCC feature vector of a first audio sample in the training set specifically includes:
extracting the first MFCC feature vector of each segment of the first audio sample in the training set, and outputting the first audio sample in the phi-th segment as E after passing through N filters, where the omega-th filter at the t moment S The calculation formula of the MFCC eigenvector at the time t is as follows:
optionally, the clustering is performed on the first MFCC feature vector according to the value of the euclidean distance, and the first MFCC feature vector is divided into a plurality of classes, each class includes a plurality of word vectors, and each word vector corresponds to one sound class, specifically:
clustering the first MFCC feature vectors according to the value of the euclidean distance, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the plurality of word vectors are contained in the category.
Optionally, the calculating a second euclidean distance between the first MFCC feature vector and the word vector, and when the value of the second euclidean distance is minimum, obtaining the word vector corresponding to the first MFCC feature vector specifically includes:
calculating the firstMFCC feature vector (x) 1 ,y 1 ) To the word vector (x 2 ,y 2 ) Second euclidean distance ρ of:
and the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector.
Optionally, the inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample specifically includes:
inputting the word vector corresponding to the first MFCC feature vector into an LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vector theta in each first audio sample, wherein the topic vector with the largest probability value is the topic category of the first audio sample.
Optionally, the obtaining a second MFCC feature vector sequence of the second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample, where the subject class specifically is:
obtaining a second MFCC feature vector sequence for a second audio sample in the test set;
inputting the second MFCC feature vector sequence into the LSTM model, where an architecture composition formula of the LSTM model is:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W iW o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t The cell states at time t-1 and time t in the LSTM structure are shown, respectively.
Normalizing the output result by using a Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
From the above technical scheme, the application has the following advantages:
the application provides a voice theme recognition method, which comprises the steps of obtaining a plurality of audio samples, dividing the audio samples into a training set and a testing set, wherein the audio samples respectively belong to a plurality of voice categories; extracting a first MFCC feature vector of a first audio sample in the training set; calculating a first Euclidean distance between the first MFCC feature vectors; clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class; calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum; inputting word vectors corresponding to the first MFCC feature vectors into the LDA model, and outputting the topic category of each first audio sample; and acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.
The audio sample is clustered by calculating Euclidean distance between MFCC feature vectors, and the corresponding relation between the MFCC feature vectors and word vectors is obtained by calculating Euclidean distance between the MFCC feature vectors and word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.
Drawings
FIG. 1 is a flow chart of a method for identifying a voice subject according to one embodiment of the present application;
FIG. 2 is a flow chart of extracting MFCC features in an embodiment of the present application;
fig. 3 is a schematic diagram of an LSTM model structure in an embodiment of the present application.
Detailed Description
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a method flow of an embodiment of a voice theme recognition method according to the present application, as shown in fig. 1, where fig. 1 includes:
101. and acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories.
It should be noted that a large number of audio samples may be obtained in the present application, and the audio samples may include audio samples selected from a plurality of sound types, for example, the audio samples may be alarm sounds, speaking sounds, music sounds, animal sounds, etc.; the acquired audio samples can be divided into a training set and a testing set, wherein the training set can be used for training the LDA model and the LSTM model, and the LSTM model is tested by adopting the testing set; and comparing the theme category output by the LDA model with a test result tested by adopting the LSTM model, thereby identifying the theme category to which the audio sample belongs.
102. A first MFCC feature vector for a first audio sample in the training set is extracted.
It should be noted that, after performing operations such as pre-emphasis, framing, windowing, discrete fourier transform and the like on the first audio sample in the training set, the present application calculates the audio signal characteristics of the first audio sample by performing operations such as number of Mel filters on the processed first audio sample, and performs filtering processing on the processed first audio sample by using a filter bank, and a flowchart for specifically extracting MFCC characteristics is shown in fig. 2.
In a specific embodiment, before extracting the first MFCC feature vector of the first audio sample in the training set, further comprises:
1020. the first audio sample is segmented.
It should be noted that, the present application may segment each first audio sample, for example, divide each first audio sample into 10 segments, and extract the first MFCC feature vector of each segment of first audio sample, where the 10 segments of first MFCC feature vectors form a first MFCC feature vector sequence of the first audio sample.
Specifically, for the first audio sample in the phi-th segment, after passing through the N filters, the output of the omega-th filter at the t moment is E S The calculation formula of the MFCC eigenvector at the time t is as follows:
103. a first euclidean distance between the first MFCC feature vectors is calculated.
It should be noted that the present application may calculate a first euclidean distance between the first MFCC feature vectors of the first audio sample, and cluster the first MFCC feature vectors according to the magnitude of the first euclidean distance.
104. Clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class.
It should be noted that, after clustering, since one class may include the first MFCC feature vector corresponding to the first audio sample belonging to the plurality of different sound classes, one class includes a plurality of word vectors (each sound class corresponds to one word vector).
The first MFCC feature vectors are divided into a plurality of classes for the distance between them using a K-means clustering method, and the first MFCC feature vectors are made to correspond to word vectors according to the center point of each class obtained. The application divides the first MFCC feature vector set into K classes, wherein the number of the classes is N respectively 1 ,N 2 ,…,N K But similar center point a 1 ,a 2 ,…,a K Instead of being fixed, in order to make the objective function as small as possible, the class center point needs to be updated continuously, and the objective function may be set as a square error function S, which is calculated by the following method:
in order to make the square error S as small as possible, an update method for obtaining the class center point is required, and the class center point a is updated while the derivative processing is performed on a by S k The updating method comprises the following steps:
where n represents the number of first MFCC feature vectors in a class, K represents the number of classes, x j Representing a first MFCC feature vector, a representing a class center point.
105. And calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum.
In the K-means method, cluster analysis is performed on each class center point, and the first MFCC feature vector is converted into a word vector capable of expressing sound features by calculating the Euclidean distance between the first MFCC feature vector and the word vector. Specifically, the word vector closest to the first MFCC feature vector is used as the word vector corresponding to the first MFCC feature vector.
The application can select a bag-of-words model as a word representation method and calculate the similarity between two words, wherein the similarity measurement method is to calculate the Euclidean distance between two vectors which are respectively the characteristic vectors of the audio characteristic MFCC and the word vectors obtained by K-means clustering. Specifically, a first MFCC feature vector (x 1 ,y 1 ) To word vector (x) 2 ,y 2 ) The second Euclidean distance ρ of (2) is:
the word bag model analyzes the obtained second Euclidean distance and determines the corresponding condition of the first MFCC feature vector of the first audio sample and the word vector, so that the first MFCC feature vector corresponding to each word vector can be obtained.
106. And inputting the word vector corresponding to the first MFCC feature vector into the LDA model, and outputting the theme class of each first audio sample.
It should be noted that, word vectors corresponding to the first MFCC feature vectors are input into the LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vectors theta in each first audio sample, wherein the topic vector with the maximum probability value is the topic category of the first audio sample.
Specifically, by inputting the word vector corresponding to the first MFCC feature vector into the LDA model, the distribution p (θ) of the topic vector in a section of the first audio sample may be obtained, and the LDA model may introduce a dirichlet distribution function to obtain the distribution of p (θ), that is, the distribution of probability distribution, so that the super parameter α may be converted into the probability distribution of the topic vector, where the probability is the largest, and represents the topic scene of the current first audio sample.
107. And acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into the LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject category of the second audio sample.
It should be noted that, the present application may perform jump interception on the second audio samples in the test set, for example, obtain a section of second audio sample section at intervals of preset intervals, and then extract the second MFCC feature vector of each section of second audio sample section, so as to obtain a second MFCC feature vector sequence of the second audio sample; and taking the obtained second MFCC feature vector sequence as input data of the LSTM model. In the application, a four-layer LSTM model can be adopted to analyze the second MFCC feature vector sequence, a specific four-layer LSTM model structure is shown in fig. 3, an input gate, a forget gate and an output gate can be arranged in each neuron in the LSTM model, and the architecture composition formula is as follows:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W iW o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t The cell states at time t-1 and time t in the LSTM structure are shown, respectively.
Normalizing the output by using the Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
Specifically, the output vector can be mapped to the range of [0,1] by using a multi-classification Softmax function for normalization processing, the probability of the sound in various possible topic categories is controlled between [0,1], the sum of the probabilities is 1, and the topic category of the second audio sample is the topic category with the largest probability value, so that the final sound recognition result is obtained.
In a specific embodiment, the present application further comprises:
training the LSTM model by adopting a training set to obtain a trained LSTM model;
testing the trained LSTM model by adopting a test set to obtain a test result, wherein the test result comprises a theme category corresponding to a second audio sample in the test set;
and comparing the output theme category of the LDA model with the test result to identify the theme category of the audio sample.
The application clusters the sample set by calculating Euclidean distance between the MFCC feature vectors, and obtains the corresponding relation between the MFCC feature vectors and the word vectors by calculating Euclidean distance between the MFCC feature vectors and the word vectors in the class; and analyzing and classifying the word vectors through the LDA model, so as to determine the topic categories of all the MFCC feature vectors in the audio sample, wherein the topic category with the largest proportion in the audio sample is the topic category of the audio sample. According to the application, the topic to which the MFCC feature vector of the audio sample belongs is analyzed and classified through the LDA model, and then the voice is identified by using the multilayer LSTM neural network, so that the fused model can mine more topic information contained in the audio features, thereby increasing the utilization rate of audio signals in the voice and improving the accuracy of voice identification.
The application also comprises a specific application example, which is as follows:
the application can adopt a data set containing 16000 short audio samples, the sound types of the audio samples can be divided into 8 types, namely, automobile whistling, dog barking, gun sound, police ring sound, knocking, speaking sound, footstep sound and music sound, wherein each sound type takes 2000 audio samples, half of the data set of the audio samples is used as a training set, the other half is used as a test set, the duration of each audio sample is 5s, and the audio samples with duration less than 5s are complemented by blank sounds. For extracting signal characteristics in audio samples, the application segments the audio samples in each training set in a characteristic extraction stage, for example, each 0.5s is a segment, and 30-dimensional MFCC characteristic vectors are extracted on each segment of audio samples to form a 30-dimensional MFCC characteristic vector sequence; clustering the MFCC feature vectors through K-means clustering, and acquiring the center of each class from the clustered classes; setting the K value to 180 in a K-means clustering method, wherein the number of the LDA models can be set to 8; dividing an audio sample corresponding to the input MFCC feature vector into the 4 types of theme scenes, namely human voice, animal voice, daily life voice and security alarm voice; in the test set, jumping interception is needed to be carried out on the MFCC feature vectors in the audio data, 5s audio segments are segmented in 0.1s time, the MFCC features in all segments are extracted, after the first segment of MFCC feature vectors are selected, one segment of MFCC feature vectors are selected every 0.1s to form a new MFCC feature vector sequence, and the new MFCC feature vector sequence is used as input data of a subsequent four-layer LSTM model; finally, in the LSTM neural network model, the LSTM layer number is set to be 4 layers, the batch_size is set to be 100, the learning rate is set to be 0.002, the number of epochs is set to be 200, the value of each epochs is 2000, the Dropout value is set to be 0.5, the ReLU function is selected as the activation function, the Adam is used as the optimizer, the mean square error is used as the objective function, the Softmax function is used for mapping the input data passing through the LSTM layer to the range of [0,1], and the normalization operation is carried out on the input data, so that the sum of probabilities of all categories of possible categories of the sound is 1, wherein the category with the largest probability value is the category to which the sound belongs, and the specific structural diagram of the LSTM model is shown in fig. 3.
The voice recognition model provided by the application is used for extracting the original frontal audio signal characteristics and combining the original frontal audio signal characteristics with the LDA main model and the LSTM neural network model, classifying the theme scenes of the voice, classifying and recognizing the voice, and better utilizing the representative characteristic information in the audio signal. Meanwhile, the method also uses a four-layer LSTM neural network model structure, enhances the analysis capability of the front-back association relation of the sound on the time sequence, enables the model to recognize more characteristic information of the sound, and improves the efficiency of the model on the sound recognition accuracy.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (3)

1. A method for identifying a voice subject, comprising:
acquiring a plurality of audio samples, dividing the plurality of audio samples into a training set and a testing set, wherein the plurality of audio samples respectively belong to a plurality of sound categories;
extracting a first MFCC feature vector of a first audio sample in the training set;
calculating a first euclidean distance between the first MFCC feature vectors;
clustering the first MFCC feature vectors according to the first Euclidean distance, and dividing the first MFCC feature vectors into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class;
calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample;
acquiring a second MFCC feature vector sequence of a second audio sample in the test set, inputting the second MFCC feature vector sequence into an LSTM model, and classifying the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the subject class of the second audio sample;
the first MFCC feature vector of the first audio sample in the training set is extracted specifically:
extracting the first MFCC feature vector of each segment of the first audio sample in the training set, and outputting the first audio sample in the phi-th segment as E after passing through N filters, where the omega-th filter at the t moment S The calculation formula of the MFCC eigenvector at the time t is as follows:
clustering the first MFCC feature vector according to the value of the euclidean distance, and dividing the first MFCC feature vector into a plurality of classes, wherein each class comprises a plurality of word vectors, and each word vector corresponds to one sound class, specifically:
clustering the first MFCC feature vectors according to the Euclidean distance value, wherein the first audio samples corresponding to the first MFCC feature vectors respectively belong to a plurality of sound categories, each sound category corresponds to one word vector, and the category comprises a plurality of word vectors;
and calculating a second Euclidean distance from the first MFCC feature vector to the word vector, and obtaining the word vector corresponding to the first MFCC feature vector when the value of the second Euclidean distance is minimum, specifically:
calculating the first MFCC feature vector (x 1 ,y 1 ) To the word vector (x 2 ,y 2 ) Second euclidean distance ρ of:
the word vector with the minimum Euclidean distance with the first MFCC feature vector is the word vector corresponding to the first MFCC feature vector;
inputting the word vector corresponding to the first MFCC feature vector into an LDA model, and outputting a theme class of each first audio sample, specifically:
inputting the word vector corresponding to the first MFCC feature vector into an LDA model;
calculating the probability P of occurrence of the word vector in each first audio sample:
wherein w is a word vector, θ is a topic vector, sig is an audio piece of sound;
and obtaining the distribution p (theta) of the topic vector theta in each first audio sample, wherein the topic vector with the largest probability value is the topic category of the first audio sample.
2. The method of claim 1, further comprising, prior to extracting a first MFCC feature vector for a first audio sample in a training set:
the first audio sample is segmented.
3. The method for recognizing a sound theme according to claim 2, wherein the obtaining a second MFCC feature vector sequence of a second audio sample in the test set inputs the second MFCC feature vector sequence into an LSTM model, and classifies the second MFCC feature vector sequence by using a Softmax function in the LSTM model to obtain the theme class of the second audio sample, specifically:
obtaining a second MFCC feature vector sequence for a second audio sample in the test set;
inputting the second MFCC feature vector sequence into the LSTM model, where an architecture composition formula of the LSTM model is:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t *tanh(C t );
in which W is f ,W iW o Respectively weight parameters b f ,b i ,/>b o Is a bias value; the second MFCC feature vector sequence is input x t ;h t-1 To conceal layer f t ,i t ,o t Input gate, forget gate and output gate, respectively, sigma represents the activation function sigmoid, the output result of which is a number between 0 and 1; c (C) t-1 And C t Respectively representing the cell states at the time t-1 and the time t in the LSTM structure;
normalizing the output result by using a Softmax function, and taking the theme class corresponding to the maximum value of the output result as the theme class of the second audio sample.
CN202011347449.7A 2020-11-26 2020-11-26 Voice theme recognition method Active CN112466299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011347449.7A CN112466299B (en) 2020-11-26 2020-11-26 Voice theme recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011347449.7A CN112466299B (en) 2020-11-26 2020-11-26 Voice theme recognition method

Publications (2)

Publication Number Publication Date
CN112466299A CN112466299A (en) 2021-03-09
CN112466299B true CN112466299B (en) 2023-11-17

Family

ID=74808503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011347449.7A Active CN112466299B (en) 2020-11-26 2020-11-26 Voice theme recognition method

Country Status (1)

Country Link
CN (1) CN112466299B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105702251A (en) * 2016-04-20 2016-06-22 中国科学院自动化研究所 Speech emotion identifying method based on Top-k enhanced audio bag-of-word model
CN108281146A (en) * 2017-12-29 2018-07-13 青岛真时科技有限公司 A kind of phrase sound method for distinguishing speek person and device
CN109448703A (en) * 2018-11-14 2019-03-08 山东师范大学 In conjunction with the audio scene recognition method and system of deep neural network and topic model
CN110120218A (en) * 2019-04-29 2019-08-13 东北大学 Expressway oversize vehicle recognition methods based on GMM-HMM
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105702251A (en) * 2016-04-20 2016-06-22 中国科学院自动化研究所 Speech emotion identifying method based on Top-k enhanced audio bag-of-word model
CN108281146A (en) * 2017-12-29 2018-07-13 青岛真时科技有限公司 A kind of phrase sound method for distinguishing speek person and device
CN109448703A (en) * 2018-11-14 2019-03-08 山东师范大学 In conjunction with the audio scene recognition method and system of deep neural network and topic model
CN110120218A (en) * 2019-04-29 2019-08-13 东北大学 Expressway oversize vehicle recognition methods based on GMM-HMM
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment

Also Published As

Publication number Publication date
CN112466299A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
Dennis et al. Overlapping sound event recognition using local spectrogram features and the generalised hough transform
Ullo et al. Hybrid computerized method for environmental sound classification
Cakir et al. Multi-label vs. combined single-label sound event detection with deep neural networks
Jancovic et al. Bird species recognition using unsupervised modeling of individual vocalization elements
Grzeszick et al. Bag-of-features methods for acoustic event detection and classification
CN110120218A (en) Expressway oversize vehicle recognition methods based on GMM-HMM
CN104795064A (en) Recognition method for sound event under scene of low signal to noise ratio
WO2019086118A1 (en) Segmentation-based feature extraction for acoustic scene classification
CN108806694A (en) A kind of teaching Work attendance method based on voice recognition
Ting Yuan et al. Frog sound identification system for frog species recognition
CN111816185A (en) Method and device for identifying speaker in mixed voice
Colonna et al. Feature evaluation for unsupervised bioacoustic signal segmentation of anuran calls
Yan et al. Birdsong classification based on multi-feature fusion
Podwinska et al. Acoustic event detection from weakly labeled data using auditory salience
Shreyas et al. Trends of sound event recognition in audio surveillance: a recent review and study
US11776532B2 (en) Audio processing apparatus and method for audio scene classification
CN113707175B (en) Acoustic event detection system based on feature decomposition classifier and adaptive post-processing
Rao et al. Exploring the impact of optimal clusters on cluster purity
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
CN112420056A (en) Speaker identity authentication method and system based on variational self-encoder and unmanned aerial vehicle
CN112466299B (en) Voice theme recognition method
Chaves et al. Katydids acoustic classification on verification approach based on MFCC and HMM
Wanare et al. Human Emotion recognition from speech
Luo et al. Polyphonic sound event detection based on CapsNet-RNN and post processing optimization
Nasiri et al. Audiomask: Robust sound event detection using mask r-cnn and frame-level classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant