CN109074822A

CN109074822A - Specific sound recognition methods, equipment and storage medium

Info

Publication number: CN109074822A
Application number: CN201780009004.8A
Authority: CN
Inventors: 刘洪涛; 王伟; 孟亚彬
Original assignee: Shenzhen H&T Intelligent Control Co Ltd
Current assignee: Shenzhen H&T Intelligent Control Co Ltd
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2018-12-21
Anticipated expiration: 2037-10-24
Also published as: CN109074822B; WO2019079972A1

Abstract

A kind of specific sound recognition methods, equipment and storage medium, this method comprises: sampled voice signal and obtaining the mel-frequency cepstrum coefficient characteristic parameter matrix (201) of the voice signal；Characteristic parameter (202) are extracted from the mel-frequency cepstrum coefficient characteristic parameter matrix of the voice signal；The specific sound characteristic model based on deep neural network that characteristic parameter input obtains in advance is identified, to confirm whether the voice signal is specific sound (203).This method and equipment use the recognizer based on MFCC characteristic parameter and deep neural network model, and algorithm complexity is low, calculation amount is few, thus it is low to hardware requirement, reduce cost of goods manufactured.

Description

Specific voice recognition method, apparatus and storage medium

Technical Field

Embodiments of the present invention relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.

Background

In life, we can hear some specific sound without actual semantics every day. Such as: snoring, coughing, sneezing, etc., which, although they have no actual semantic meaning, accurately reflect a person's physiological needs, condition or quality of a substance. For example: the doctor can distinguish the health condition of people through snore, cough, sneeze and the like of the patient. The content of the specific sound is simple and repeated, but is an indispensable part in our life, and the significance of effectively identifying and judging various specific sound signals is great.

Currently, there is a study of recognizing a specific sound by a voice recognition technique. For example, there is a recognition method for cough sound, in which the characteristics of cough sound are combined with a speech recognition technology to establish a cough model, and a model matching method based on a Dynamic Time Warping (DTW) is used to recognize isolated cough sound of a specific person.

In the process of implementing the application, the inventor finds that at least the following problems exist in the related art: the existing specific voice recognition algorithm has large calculation amount and high requirement on hardware equipment.

Disclosure of Invention

The application aims to provide a specific sound identification method, equipment and a storage medium, which can identify specific sounds, and have the advantages of simple algorithm, small calculation amount and low requirement on hardware equipment.

To achieve the above object, in a first aspect, an embodiment of the present application provides a specific voice recognition method, including:

sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

inputting the characteristic parameters into a specific sound characteristic model which is obtained in advance and is based on the deep neural network for recognition so as to determine whether the sound signal is a specific sound.

Optionally, the method further includes: and acquiring the specific sound characteristic model based on the deep neural network in advance.

Optionally, the pre-obtaining the specific acoustic feature model based on the deep neural network includes:

collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;

extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;

and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.

Optionally, the extracting the feature parameter from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal includes:

sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a characteristic vector;

dividing the feature vector from the head of the feature vector to the tail of the feature vector according to a preset step length to obtain feature parameters of a group of sub-feature vectors with preset lengths, wherein each sub-feature vector has the same label, the preset step length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient, and the preset length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient;

the extracting of the characteristic parameters from the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal comprises:

sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal end to form a characteristic vector;

and segmenting the feature vector from the head of the feature vector to the tail of the feature vector according to the preset step length to obtain feature parameters of a group of sub-feature vectors with the preset lengths.

Optionally, the training a deep neural network model with the feature parameters of the specific sound sample signal as input to obtain the specific sound feature model based on the deep neural network includes:

taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;

and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.

Optionally, the inputting the feature parameters into a pre-obtained specific sound feature model based on a deep neural network for recognition to determine whether the sound signal is a specific sound includes:

inputting a group of sub-feature vectors contained in the feature parameters into a pre-acquired specific sound feature model based on a deep neural network to obtain a prediction result corresponding to the group of sub-feature vectors;

and if the positive prediction result is more than the negative prediction result in the prediction results, confirming that the sound signal is the specific sound, otherwise, confirming that the sound signal is not the specific sound.

Optionally, the specific sound includes any one of a cough sound, a snore sound, and a sneeze sound.

In a second aspect, an embodiment of the present application further provides a specific voice recognition apparatus, including:

the system comprises a sampling and characteristic parameter acquisition module, a signal processing module and a signal processing module, wherein the sampling and characteristic parameter acquisition module is used for sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

the characteristic parameter extraction module is used for extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

the characteristic matching module is used for confirming whether the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network;

and the confirming module is used for confirming that the sound signal is a specific sound if the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network.

Optionally, the apparatus further comprises:

the characteristic model presetting module is used for acquiring the specific sound characteristic model based on the deep neural network in advance;

the feature model presetting module is specifically configured to:

In a third aspect, an embodiment of the present application further provides a specific voice recognition apparatus, where the specific voice recognition apparatus includes:

a sound input unit for receiving a sound signal;

a signal processing unit for performing signal processing on the sound signal;

the signal processing unit is connected with an arithmetic processing unit which is internally or externally arranged on a specific voice recognition device, and the arithmetic processing unit comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

In a fourth aspect, the present application further provides a storage medium storing executable instructions, which when executed by a specific sound recognition apparatus, cause the specific sound recognition apparatus to perform the above method.

In a fifth aspect, the present application further provides a program product including a program stored on a storage medium, the program including program instructions that, when executed by a specific sound recognition apparatus, cause the specific sound recognition apparatus to perform the above-mentioned method.

The specific sound identification method, the specific sound identification equipment and the specific sound identification storage medium are based on the Mel frequency cepstrum coefficient characteristic parameters and the identification algorithm of the deep neural network model, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an application environment according to embodiments of the present application;

fig. 2 is a schematic flow chart of pre-obtaining a specific acoustic feature model based on a deep neural network in a specific acoustic recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the Mel frequency filtering process in the MFCC coefficient calculation process;

FIG. 4 is a time-amplitude diagram of a cough sound signal;

FIG. 5 is a schematic diagram of the feature parameter extraction step dividing the feature vector into sub-feature vectors;

FIG. 6 is a schematic diagram of a general deep neural network architecture;

FIG. 7 is a schematic diagram of a general deep belief network structure;

FIG. 8 is a flowchart illustrating a step of extracting feature parameters in a specific voice recognition method according to an embodiment of the present application;

FIG. 9 is a flowchart illustrating a step of training a deep neural network-based specific acoustic feature model in a specific acoustic recognition method according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a specific voice recognition method provided by an embodiment of the present application;

FIG. 11 is a block diagram of a specific example of a voice recognition device;

FIG. 12 is a block diagram of a specific example of a voice recognition device according to the present disclosure;

fig. 13 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a specific voice recognition scheme based on Mel-Frequency cepstral coefficients (MFCCs) characteristic parameters and a Deep Neural Network (DNN) algorithm, and the scheme is suitable for the application environment shown in fig. 1. The application environment includes a user 10 and a specific voice recognition device 20, and the specific voice recognition device 20 is configured to receive a voice uttered by the user 10 and recognize the voice to determine whether the voice is a specific voice.

Further, after recognizing that the sound is a specific sound, the specific recognition device 20 may also record and process the specific sound to output information on the condition that the user 10 uttered the specific sound. The condition information of the specific sound may include the number of times of the specific sound, the duration of the specific sound, and the decibel of the specific sound. For example, counting statistics may be performed on a specific sound when the specific sound is detected by including a counter in the specific sound recognition apparatus; the specific sound recognition device may include a timer for counting a duration of the specific sound when the specific sound is detected; it is possible to detect the decibel of a specific sound by including decibel detection means in the specific sound recognition apparatus for detecting the decibel of the specific sound when the specific sound is detected.

The recognition principle of the specific voice is similar to the voice recognition principle, and the specific voice is input into the voice model for recognition after being processed, so that a recognition result is obtained. It can be divided into two stages, a specific voice model training stage and a specific voice recognition stage. The specific sound model training stage mainly comprises the steps of collecting a certain number of specific sound sample signals, calculating an MFCC characteristic parameter matrix of the specific sound sample signals, extracting characteristic parameters from the MFCC characteristic parameter matrix, and performing model training on the characteristic parameters based on a DNN algorithm to obtain a specific sound characteristic model. In the stage of specific sound identification, an MFCC characteristic parameter matrix of a sound signal needing to be judged is calculated, corresponding characteristic parameters are extracted from the MFCC characteristic parameter matrix of the sound signal, and then the characteristic parameters are input into a specific sound characteristic model for identification so as to determine whether the sound signal is a specific sound. The identification process mainly comprises the steps of preprocessing, feature extraction, model training, pattern matching, judgment and the like.

Wherein, in the preprocessing step, sampling a specific sound sample signal and calculating an MFCC characteristic parameter matrix of the specific sound sample signal are included. In the feature extraction step, feature parameters are extracted from the MFCC feature parameter matrix. In the model training step, the characteristic parameters extracted from the MFCC characteristic parameter matrix of the specific sound sample signal are used as input, and a specific sound characteristic model based on the deep neural network is trained. In the pattern matching and determining step, the specific sound feature model is used to identify whether the new sound signal is a specific sound. Wherein, whether the new sound signal is the specific sound is identified, comprising: firstly, an MFCC characteristic parameter matrix of a sound signal is calculated, then characteristic parameters of the sound signal are extracted from the MFCC characteristic parameter matrix, and then the characteristic parameters of the sound signal are input into a specific sound characteristic model for recognition so as to determine whether the sound signal is a specific sound.

The scheme of identifying the specific sound by combining MFCC and DNN can simplify the complexity of the algorithm, reduce the calculation amount and obviously improve the accuracy of identifying the specific sound.

The embodiment of the present application provides a specific sound identification method, which may be used in the above-mentioned specific sound identification apparatus 20, where the specific sound identification method needs to obtain a DNN-based specific sound feature model in advance, where the DNN-based specific sound feature model may be preconfigured, or may be obtained by training in the following steps 101 to 103, after obtaining the DNN-based specific sound feature model by training, a specific sound may be identified based on the DNN-based specific sound feature model, and further, if the DNN-based specific sound feature model is not qualified in accuracy when used for identifying the specific sound due to scene change or other reasons, the DNN-based specific sound feature model may be reconfigured or trained.

As shown in fig. 2, the obtaining a DNN-based specific acoustic feature model in advance includes:

step 101: collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;

sampling to obtain a specific sound sample signal s (n), and acquiring an MFCC characteristic parameter matrix of the specific sound sample signal according to the specific sound sample signal. The mel frequency cepstrum coefficient is mainly used for sound data feature extraction and operation dimensionality reduction. For example: for data with 512 dimensions (sampling points) in one frame, the most important 40-dimensional data can be extracted after MFCC processing, and the purpose of reducing dimensions is achieved. The mel-frequency cepstrum coefficient calculation generally includes: pre-emphasis, framing, windowing, fast fourier transform, mel filter bank, and discrete cosine transform.

Obtaining the MFCC characteristic parameter matrix of the specific sound sample signal, specifically comprising the following steps:

① Pre-emphasis

The pre-emphasis is to boost the high frequency part to flatten the spectrum of the signal, and to maintain the spectrum in the whole frequency band from low frequency to high frequency, so that the spectrum can be obtained with the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the specific sound sample signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency. The method is realized by pre-emphasizing a sampled specific sound sample signal s (n) through a first-order Finite Impulse Response (FIR) high-pass digital filter, wherein the transfer function is as follows:

H(z)＝1-a·z^-1(1)

wherein z represents an input signal, a time domain representation is a specific sound sample signal s (n), and a represents a pre-emphasis coefficient, and generally takes a constant of 0.9-1.0.

② framing

Every P samples in a specific sound sample signal s (n) are grouped into an observation unit, which is called a frame. The value of P can be 256 or 512, and the covering time is about 20-30 ms. To avoid excessive variation between two adjacent frames, an overlap region may be formed between two adjacent frames, the overlap region includes G sampling points, and the value of G may be about 1/2 or 1/3 of P. The sampling frequency of a specific sound sample signal may be 8KHz or 16KHz, and in 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.

③ windowing

Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), n is 0,1 …, P-1, and P is the size of the frame, then after multiplying by the hamming window, S' (n) is S (n) x w (n), wherein,

where l represents the window length.

④ Fast Fourier Transform (FFT)

Since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different sounds. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And performing a modular square on the frequency spectrum of the specific sound sample signal to obtain a power spectrum of the specific sound sample signal.

⑤ triangular band-pass filter filtering

The energy spectrum is filtered through a set of mel-scale triangular filter banks. A filter bank with M filters is defined (the number of filters is close to the number of critical bands), and the filters used are triangular filters with center frequencies f (M), where M is 1, 2. M may be 22-26. The interval between f (m) decreases with decreasing m value and increases with increasing m value, please refer to fig. 3.

The frequency response of the triangular filter is defined as:

wherein

⑥ discrete cosine transform

The logarithmic energy of each filter bank output is calculated as:

obtaining MFCC by Discrete Cosine Transform (DCT) on logarithmic energy s (m):

step 102: extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;

from equation (5), MFCC is a coefficient matrix of N × L, where N is the number of frames of the audio signal and L is the MFCC length. Since the MFCC characteristic parameter matrix has a high dimension, and the number N of matrix rows is different due to the inconsistency of the lengths of the sound signals, the MFCC characteristic parameter matrix cannot be used as a direct input to obtain a DNN-based specific sound characteristic model, and therefore, it is necessary to further extract characteristic parameters from the MFCC characteristic parameter matrix. The purpose of extracting the characteristic parameters is to extract the characteristics of a specific sound sample signal to mark the specific sound sample signal, and train a specific sound characteristic model based on DNN by taking the characteristic parameters as input. Feature parameters may be extracted from the MFCC feature parameter matrix in combination with time domain or frequency domain characteristics of a particular sound signal.

Taking a specific sound signal as an example of a cough sound signal, please refer to fig. 4, fig. 4 is a time-amplitude diagram (time domain diagram) of the cough sound signal, and it can be seen from fig. 4 that the occurrence process of the cough sound signal is very short and has obvious paroxysmal property, the duration of the single-sound cough sound is usually less than 550ms, and even patients with severe throat and bronchial diseases have the duration of the single-sound cough sound maintained at about 1000 ms. Energetically, the energy of the cough sound signal is concentrated primarily in the first half of the signal. Therefore, after the MFCC calculation processing, the main characteristic information of the cough sound sample signal is substantially concentrated in the first half of the cough sound sample signal. The characteristic parameters input into the deep neural network should cover as much main information of the cough sound sample signals as possible, and the characteristic parameters extracted from the MFCC characteristic parameter matrix are useful information rather than redundant information.

The feature parameters of the cough sound sample signals of the front fixed frame number can be selected in the MFCC feature parameter matrix of the cough sound sample signals as the input of the deep neural network, and the cough sound sample signals of the fixed frame number should contain the front half parts of the respective cough sound sample signals as much as possible in view of the fact that the main characteristic information of the cough sound sample signals is basically concentrated in the front half parts of the cough sound sample signals. In order to make full use of data, the remaining feature data in the MFCC feature parameter matrix can also be used as the input of the deep neural network, the MFCC feature parameter matrix can be divided according to the fixed frame number, and then the divided data can be used as the input of the deep neural network together.

Specifically, as shown in fig. 8, the extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal includes:

step 1021: sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a vector;

step 1022: and dividing the vector from the head of the vector to the tail of the vector according to a preset step length (the unit is a frame) to obtain characteristic parameters of a group of sub-vectors with preset lengths (namely a fixed frame number), wherein each sub-vector has the same label.

The method includes the steps that frames of an MFCC feature parameter matrix are connected in series to form a vector X, a preset length e is taken as a basic unit, a preset step length d is moved from the head to the tail of the vector X, and a group of data Xi with the same label is formed, wherein i is 1, 2. The specific processing procedure is shown in fig. 5.

In practical application, if the specific sound is a cough sound, the frame number of the first half section of the general cough sound signal can be calculated statistically, and then the preset length is taken as the frame number, and the preset step length can be taken in combination with practical application. If the specific sound is other sound, such as snore or sneeze, the value can be set to the preset length and the preset step length according to the time domain and frequency domain characteristics.

The MFCC characteristic parameter matrix of a specific sound sample signal is divided into a plurality of sub-characteristic vectors with fixed lengths, so that the sub-characteristic vectors adapt to the requirement of consistency of input data of the deep neural network and can be directly used as the input of the deep neural network. Moreover, each sub-feature vector in the plurality of sub-feature vectors is set to be the same label, that is, a group of sub-feature vectors are used for expressing the same specific sound sample signal, so that the number of data samples is increased, and the loss of information during feature parameter extraction is avoided. And establishing a specific sound characteristic model based on the deep neural network by using the sub-characteristic vectors and the labels corresponding to the sub-characteristic vectors, and identifying specific sound by using the specific sound characteristic model, so that the error identification rate is reduced, and the accuracy rate of specific sound identification is improved. When the specific sound identification method provided by the embodiment of the application is used for identifying the cough sound, the identification rate of the cough sound can reach more than 95% on the basis of not increasing the calculation amount.

Step 103: and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.

DNN is an expansion of a shallow neural network, functionally utilizes the expression of a multilayer neural network, and has very good feature extraction, learning and generalization capabilities on the processing of nonlinear and high-dimensional data. The DNN model generally includes an input layer, a hidden layer and an output layer, please refer to fig. 6, in which the first layer is the input layer, the middle layer is the hidden layer, and the last layer is the output layer (fig. 6 only shows three hidden layers, and actually includes more hidden layers), and the layers are fully connected, that is, any neuron in the Q-th layer must be connected to any neuron in the Q + 1-th layer.

Each connection established between neurons has a linear weight, and each neuron in each layer has an offset (except for the input layer). The linear weight from the kth neuron of the l-1 layer to the jth neuron of the l layer is defined as w^l _jkWhere the superscript l represents the number of layers in which the linear weight lies, and the subscripts correspond to the output l-th layer index j and the input l-1-th layer index k, e.g., the linear weight from the 4 th neuron of the second layer to the 2 nd neuron of the third layer is defined as w³ ₂₄. The bias corresponding to the ith neuron of the l-th layer is b^l _iWhere the superscript l represents the number of layers and the subscript i represents the index of the neuron in which the offset is located, e.g., the offset corresponding to the third neuron in the second layer is defined as b² ₃。

A series of w may be randomly initialized and selected^l _jk and b^l _iThe characteristic parameters of a specific sound sample signal are used as data of an input layer by utilizing a forward propagation algorithm, then a first hidden layer is calculated by the input layer, a second hidden layer is calculated by the first hidden layer, and the like until reaching an output layer. Then, the back propagation algorithm is used for w^l _jk and b^l _iAnd fine tuning is carried out to obtain a specific sound characteristic model based on the deep neural network finally.

Or obtaining each initial parameter w by a Deep Belief Network (DBN) algorithm^l _jk and b^l _iThen using gradient descent and back propagation algorithm to w^l _jk and b^l _iFine tuning is carried out to obtain the final w^l _jk and b^l _iThe value of (a). Referring to fig. 9, the training of the deep neural network-based model using the feature parameters of the specific sound sample signal as input to obtain the specific sound feature model based on the deep neural network includes:

step 1031: taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;

the DBN is a deep learning model and is used for non-monitoringThe supervised mode preprocesses the model layer by layer, and the unsupervised preprocessing mode is a Restricted Boltzmann Machine (RBM). As shown in fig. 7(b), the DBN is stacked from a series of RBMs. As shown in FIG. 7(a), RBM is a two-layer structure, v denotes a visible layer, h denotes a hidden layer, and the connection between the visible layer and the hidden layer is non-directional (values can be taken from the visible layer->Hidden layer or hidden layer->Visible layer arbitrary transport) and fully connected. The visible layer v and the hidden layer h are connected through linear weight, and the linear weight of the ith neuron of the visible layer and the jth neuron of the hidden layer is defined as w_ijThe bias corresponding to the ith neuron of the visible layer is b_iThe bias corresponding to the jth neuron of the hidden layer is a_jThe indices i and j represent the index of the neuron.

RBM performs one-step Gibbs (Gibbs) sampling by contrast divergence algorithm, optimizing weight w_ij、b_i and a_jAnother state expression h of input sample data (i.e. characteristic parameters of a specific sound sample signal) v can be obtained, the output h1 of the RBM can be used as the input of the next RBM, the hidden state h2 can be obtained by continuing optimization in the same way, and so on, the multi-layer DBN model can apply the weights w to the weights in a layer-by-layer preprocessing way_ij、b_i and a_jInitialization is carried out, each layer is characterized by an expression mode of the first layer data v, and various initial parameters are obtained after unsupervised preprocessing.

Specifically, the RBM is an energy model, and the energy of the whole RBM is represented as the following formula (6).

Wherein E represents the total energy of the RBM model, v represents the visible layer data, h represents the hidden layer data, theta represents the model parameters, m represents the visible layer neuron number, n represents the hidden layer neuron number, b represents the visible layer bias, and a represents the hidden layer bias.

The RBM model samples according to the conditional probability of the visible layer data and the hidden layer data, and for the Bernoulli-Bernoulli RBM model, the conditional probability formulas are respectively formula (7) and formula (8),

where σ denotes an activation function sigmoid function, and σ (x) ═ 1+ e^-x)^-1。

Gibbs sampling is carried out on the RBM by utilizing a contrast divergence algorithm according to the formula to obtain a sample with v and h jointly distributed, then parameters are optimized by maximizing a likelihood logarithm function (9) of an observation sample,

Δw_ij≈＜v_ih_j＞₀-＜v_ih_j＞₁(10)

the optimization parameters adopt a one-step contrast divergence algorithm, sampling samples are directly generated in an average field approximation mode, and the weight among the neurons and the bias of the neurons and other initial parameters are finally obtained by utilizing a formula (10) to iterate the optimization parameters for multiple times. Wherein, N represents the number of neurons in the visible layer of the RBM model, i.e. the dimension of the input data of the RBM model.

Step 1032: and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.

After the optimization process of the DBN is completed, weights w between neurons of all layers (an input layer, a hidden layer and an output layer) of a specific sound feature model based on the DNN and bias b of the neurons are obtained, a random initialization mode is adopted by a final multi-class logistic regression layer (softmax), and then the specific sound feature model is finely adjusted by the DNN through a supervised gradient descent algorithm.

Specifically, the entire DNN-specific acoustic feature model is fine-tuned by optimizing the parameters (equation 12) in a manner that minimizes the cost function (equation 11) in a supervised manner.

Wherein J represents a cost function, h_W，b(x) Denotes the output of DNN, and y denotes the tag corresponding to the input data.

Wherein α represents learning rate, and its value is 0.5-0.01.

The partial derivatives of the nodes of the deep neural network calculated in the above formula (12) may adopt the back propagation algorithm of the formula (13).

Where δ represents sensitivity, and a represents an output value of each neuron node. When l represents the output layer,when l represents other layersWhere σ represents the activation function. And then, through multiple iterations, updating a formula (13), optimizing the whole DNN model layer by layer, and finally obtaining each parameter to obtain the trained specific sound characteristic model based on the DNN.

Through the combination of the non-supervised learning and the supervised learning methods based on the DBN, compared with the randomly initialized deep neural network, the DNN model obtained by carrying out the supervised learning after the unsupervised preprocessing has obviously better performance than the common deep neural network. The specific sound characteristic model based on the DNN is obtained by taking the MFCC characteristic parameters of the specific sound sample signal as the input of the DNN model for modeling, and the specific sound characteristic model is utilized to identify the specific sound, so that the identification rate of the specific sound is effectively improved.

Fig. 10 is a schematic flowchart of a specific voice recognition method provided in an embodiment of the present application, and as shown in fig. 10, the specific voice recognition method includes:

step 201: sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

in practical applications, a sound input unit (e.g., a microphone) may be disposed on the specific sound recognition device 20 to collect a sound signal, amplify, filter, and convert the sound signal into a digital signal. The digital signal may be sampled and subjected to other calculation processing in an arithmetic processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, an intelligent terminal, or another server via a network for processing.

Please refer to step 101 for technical details of obtaining the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal, which is not described herein again.

Step 202: extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

for a specific calculation method for extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the sound signal, please refer to step 102, which is not described herein again.

Step 203: inputting the characteristic parameters into a specific sound characteristic model which is obtained in advance and is based on the deep neural network for recognition so as to determine whether the sound signal is a specific sound.

Specifically, inputting the feature parameters into a pre-acquired specific sound feature model based on a deep neural network for recognition to determine whether the sound signal is a specific sound, including:

When the characteristic parameters of the sound signal are input into a trained specific sound characteristic model based on DNN, the prediction result of whether the sound signal is a specific sound is obtained. Since the feature parameter of the same sound signal includes a plurality of sub-feature vectors, each sub-feature vector obtains a prediction result, so that each sound signal obtains a plurality of prediction results, and the prediction results represent the possibility of whether the sound signal is a specific sound. The specific sound characteristic model based on DNN votes for all the prediction results of the same sound signal, namely, in the prediction results of all the sub-characteristic vectors, if the positive prediction result is more than the negative prediction result, the sound signal is confirmed to be a specific sound; if the positive prediction result is less than the negative prediction result, it is confirmed that the sound signal is not the specific sound.

The specific sound identification method provided by the embodiment of the application can identify the specific sound, so that the specific sound condition emitted by the user can be monitored by monitoring the sound emitted by the user, and the user does not need to wear any detection component. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.

It should be noted that, the specific sound identification method based on the MFCC characteristic parameters and the DNN model provided in the embodiment of the present application is also applicable to identifying other specific sounds such as snoring, sneezing, breathing, laughing, firecracker and crying, in addition to identifying the cough sound.

Accordingly, as shown in fig. 11, the embodiment of the present application further provides a specific voice recognition apparatus, which is used for a specific voice recognition device 20, and the apparatus includes:

a sampling and characteristic parameter obtaining module 301, configured to sample a sound signal and obtain a mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal;

a feature parameter extraction module 302, configured to extract feature parameters from a mel-frequency cepstrum coefficient feature parameter matrix of the sound signal;

the identifying module 303 is configured to input the feature parameters into a pre-acquired specific sound feature model based on the deep neural network for identification, so as to determine whether the sound signal is a specific sound.

The specific sound recognition device that this application embodiment provided can discern specific sound to can monitor the specific sound condition that the user sent through the sound that the monitoring user sent, need not the user and wears any detection part. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.

Optionally, in another embodiment of the apparatus, as shown in fig. 12, the apparatus further includes:

a feature model presetting module 304, configured to obtain the specific sound feature model based on the deep neural network in advance.

Optionally, in some embodiments of the apparatus, the feature model presetting module 304 is specifically configured to:

Optionally, in some embodiments of the apparatus, the feature model presetting module 304 is further specifically configured to:

the feature parameter extraction module 302 is further specifically configured to:

Optionally, in some embodiments of the apparatus, the identifying module 303 is specifically configured to:

Optionally, in some embodiments of the device, the specific sound comprises any one of a cough, a snore and a sneeze.

It should be noted that the above-mentioned apparatus can execute the method provided by the embodiments of the present application, and has corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The embodiment of the present application also provides a specific sound recognition apparatus, and as shown in fig. 8, the specific sound recognition apparatus 20 includes a sound input unit 21, a signal processing unit 22, and an arithmetic processing unit 23. Wherein: a sound input unit 21 for receiving a sound signal, which may be, for example, a microphone or the like. A signal processing unit 22 for performing signal processing on the sound signal; the signal processing unit 22 may perform analog signal processing such as amplification, filtering, digital-to-analog conversion, and the like on the sound signal, and send the obtained digital signal to the arithmetic processing unit 23.

The signal processing unit 22 is connected to an arithmetic processing unit 23 (fig. 13 illustrates that the arithmetic processing unit is built in the specific voice recognition device) or is externally installed on the specific voice recognition device, the arithmetic processing unit 23 may be built in the specific voice recognition device 20 or externally installed on the specific voice recognition device 20, and the arithmetic processing unit 23 may also be a server that is remotely installed, for example, a cloud server, a smart terminal or other server that is communicatively connected to the specific voice recognition device 20 through a network.

The arithmetic processing unit 23 includes:

at least one processor 232 (illustrated as a processor in fig. 13) and a memory 231, the processor 232 and the memory 231 may be connected by a bus or other means, and fig. 13 illustrates an example of a connection by a bus.

The memory 231 is used for storing nonvolatile software programs, nonvolatile computer executable programs, and software modules, such as program instructions/modules corresponding to a specific voice recognition method in the embodiment of the present application (for example, the sampling and feature parameter acquiring module 301 shown in fig. 11). The processor 232 executes various functional applications and data processing, i.e., implementing a specific voice recognition method of the above-described method embodiments, by executing nonvolatile software programs, instructions, and modules stored in the memory 231.

The memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created according to a particular voice recognition device usage, and the like. Further, the memory 231 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 231 optionally includes memory located remotely from processor 232, which may be connected to a particular voice recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 231 and, when executed by the one or more processors 232, perform the specific voice recognition method in any of the above-described method embodiments, for example, performing the above-described method steps 101-103 in fig. 2, the method steps 1021-1022 in fig. 8, the method steps 1031-1032 in fig. 9, and the step 201-203 in fig. 10; the functions of the modules 301 and 304 in fig. 11 and 12 are realized.

The specific sound identification equipment that this application embodiment provided can discern specific sound to can monitor the specific sound condition that the user sent through the sound that the monitoring user sent, need not the user and wears any detection part. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.

The specific voice recognition device can execute the method provided by the embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The present embodiment provides a storage medium storing computer-executable instructions, which are executed by one or more processors (e.g., one processor 232 in fig. 13), and can enable the one or more processors to perform the specific sound recognition method in any of the above-mentioned method embodiments, for example, the method steps 101-103 in fig. 2, 1021-1022 in fig. 8, 1031-1032 in fig. 9, and 201-203 in fig. 10 described above are performed; the functions of the modules 301 and 304 in fig. 11 and 12 are realized.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for specific voice recognition, the method comprising:

2. The specific voice recognition method according to claim 1, further comprising: and acquiring the specific sound characteristic model based on the deep neural network in advance.

3. The specific sound recognition method according to claim 2, wherein the pre-obtaining the specific sound feature model based on the deep neural network comprises:

4. The specific sound identification method according to claim 3, wherein the extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal comprises:

5. The specific sound recognition method according to claim 4, wherein training a deep neural network-based model to obtain the deep neural network-based specific sound feature model with the feature parameters of the specific sound sample signal as input comprises:

6. The specific sound identification method according to claim 4, wherein the inputting the feature parameters into a pre-acquired specific sound feature model based on a deep neural network for identification to determine whether the sound signal is a specific sound comprises:

7. The specific sound recognition method according to any one of claims 1 to 6, wherein the specific sound includes any one of a cough sound, a snore sound, and a sneeze sound.

8. A specific voice recognition apparatus, characterized in that the apparatus comprises:

9. The specific sound recognition device of claim 8, wherein the device further comprises:

the feature model presetting module is specifically configured to:

10. A specific voice recognition apparatus characterized by comprising:

a sound input unit for receiving a sound signal;

a signal processing unit for performing analog signal processing on the sound signal;

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

11. A storage medium storing executable instructions that, when executed by a particular sound recognition device, cause the particular sound recognition device to perform the method of any one of claims 1-7.