CN109074822A - Specific sound recognition methods, equipment and storage medium - Google Patents

Specific sound recognition methods, equipment and storage medium Download PDF

Info

Publication number
CN109074822A
CN109074822A CN201780009004.8A CN201780009004A CN109074822A CN 109074822 A CN109074822 A CN 109074822A CN 201780009004 A CN201780009004 A CN 201780009004A CN 109074822 A CN109074822 A CN 109074822A
Authority
CN
China
Prior art keywords
specific sound
sound
characteristic
signal
specific
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201780009004.8A
Other languages
Chinese (zh)
Other versions
CN109074822B (en
Inventor
刘洪涛
王伟
孟亚彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen H&T Intelligent Control Co Ltd
Original Assignee
Shenzhen H&T Intelligent Control Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen H&T Intelligent Control Co Ltd filed Critical Shenzhen H&T Intelligent Control Co Ltd
Publication of CN109074822A publication Critical patent/CN109074822A/en
Application granted granted Critical
Publication of CN109074822B publication Critical patent/CN109074822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)

Abstract

A kind of specific sound recognition methods, equipment and storage medium, this method comprises: sampled voice signal and obtaining the mel-frequency cepstrum coefficient characteristic parameter matrix (201) of the voice signal;Characteristic parameter (202) are extracted from the mel-frequency cepstrum coefficient characteristic parameter matrix of the voice signal;The specific sound characteristic model based on deep neural network that characteristic parameter input obtains in advance is identified, to confirm whether the voice signal is specific sound (203).This method and equipment use the recognizer based on MFCC characteristic parameter and deep neural network model, and algorithm complexity is low, calculation amount is few, thus it is low to hardware requirement, reduce cost of goods manufactured.

Description

Specific voice recognition method, apparatus and storage medium
Technical Field
Embodiments of the present invention relate to sound processing technologies, and in particular, to a specific sound recognition method, device, and storage medium.
Background
In life, we can hear some specific sound without actual semantics every day. Such as: snoring, coughing, sneezing, etc., which, although they have no actual semantic meaning, accurately reflect a person's physiological needs, condition or quality of a substance. For example: the doctor can distinguish the health condition of people through snore, cough, sneeze and the like of the patient. The content of the specific sound is simple and repeated, but is an indispensable part in our life, and the significance of effectively identifying and judging various specific sound signals is great.
Currently, there is a study of recognizing a specific sound by a voice recognition technique. For example, there is a recognition method for cough sound, in which the characteristics of cough sound are combined with a speech recognition technology to establish a cough model, and a model matching method based on a Dynamic Time Warping (DTW) is used to recognize isolated cough sound of a specific person.
In the process of implementing the application, the inventor finds that at least the following problems exist in the related art: the existing specific voice recognition algorithm has large calculation amount and high requirement on hardware equipment.
Disclosure of Invention
The application aims to provide a specific sound identification method, equipment and a storage medium, which can identify specific sounds, and have the advantages of simple algorithm, small calculation amount and low requirement on hardware equipment.
To achieve the above object, in a first aspect, an embodiment of the present application provides a specific voice recognition method, including:
sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
inputting the characteristic parameters into a specific sound characteristic model which is obtained in advance and is based on the deep neural network for recognition so as to determine whether the sound signal is a specific sound.
Optionally, the method further includes: and acquiring the specific sound characteristic model based on the deep neural network in advance.
Optionally, the pre-obtaining the specific acoustic feature model based on the deep neural network includes:
collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
Optionally, the extracting the feature parameter from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal includes:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a characteristic vector;
dividing the feature vector from the head of the feature vector to the tail of the feature vector according to a preset step length to obtain feature parameters of a group of sub-feature vectors with preset lengths, wherein each sub-feature vector has the same label, the preset step length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient, and the preset length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient;
the extracting of the characteristic parameters from the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal comprises:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal end to form a characteristic vector;
and segmenting the feature vector from the head of the feature vector to the tail of the feature vector according to the preset step length to obtain feature parameters of a group of sub-feature vectors with the preset lengths.
Optionally, the training a deep neural network model with the feature parameters of the specific sound sample signal as input to obtain the specific sound feature model based on the deep neural network includes:
taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;
and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.
Optionally, the inputting the feature parameters into a pre-obtained specific sound feature model based on a deep neural network for recognition to determine whether the sound signal is a specific sound includes:
inputting a group of sub-feature vectors contained in the feature parameters into a pre-acquired specific sound feature model based on a deep neural network to obtain a prediction result corresponding to the group of sub-feature vectors;
and if the positive prediction result is more than the negative prediction result in the prediction results, confirming that the sound signal is the specific sound, otherwise, confirming that the sound signal is not the specific sound.
Optionally, the specific sound includes any one of a cough sound, a snore sound, and a sneeze sound.
In a second aspect, an embodiment of the present application further provides a specific voice recognition apparatus, including:
the system comprises a sampling and characteristic parameter acquisition module, a signal processing module and a signal processing module, wherein the sampling and characteristic parameter acquisition module is used for sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
the characteristic parameter extraction module is used for extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
the characteristic matching module is used for confirming whether the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network;
and the confirming module is used for confirming that the sound signal is a specific sound if the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network.
Optionally, the apparatus further comprises:
the characteristic model presetting module is used for acquiring the specific sound characteristic model based on the deep neural network in advance;
the feature model presetting module is specifically configured to:
collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
In a third aspect, an embodiment of the present application further provides a specific voice recognition apparatus, where the specific voice recognition apparatus includes:
a sound input unit for receiving a sound signal;
a signal processing unit for performing signal processing on the sound signal;
the signal processing unit is connected with an arithmetic processing unit which is internally or externally arranged on a specific voice recognition device, and the arithmetic processing unit comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
In a fourth aspect, the present application further provides a storage medium storing executable instructions, which when executed by a specific sound recognition apparatus, cause the specific sound recognition apparatus to perform the above method.
In a fifth aspect, the present application further provides a program product including a program stored on a storage medium, the program including program instructions that, when executed by a specific sound recognition apparatus, cause the specific sound recognition apparatus to perform the above-mentioned method.
The specific sound identification method, the specific sound identification equipment and the specific sound identification storage medium are based on the Mel frequency cepstrum coefficient characteristic parameters and the identification algorithm of the deep neural network model, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
FIG. 1 is a schematic diagram of an application environment according to embodiments of the present application;
fig. 2 is a schematic flow chart of pre-obtaining a specific acoustic feature model based on a deep neural network in a specific acoustic recognition method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of the Mel frequency filtering process in the MFCC coefficient calculation process;
FIG. 4 is a time-amplitude diagram of a cough sound signal;
FIG. 5 is a schematic diagram of the feature parameter extraction step dividing the feature vector into sub-feature vectors;
FIG. 6 is a schematic diagram of a general deep neural network architecture;
FIG. 7 is a schematic diagram of a general deep belief network structure;
FIG. 8 is a flowchart illustrating a step of extracting feature parameters in a specific voice recognition method according to an embodiment of the present application;
FIG. 9 is a flowchart illustrating a step of training a deep neural network-based specific acoustic feature model in a specific acoustic recognition method according to an embodiment of the present application;
FIG. 10 is a flow chart illustrating a specific voice recognition method provided by an embodiment of the present application;
FIG. 11 is a block diagram of a specific example of a voice recognition device;
FIG. 12 is a block diagram of a specific example of a voice recognition device according to the present disclosure;
fig. 13 is a schematic structural diagram of a specific voice recognition apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a specific voice recognition scheme based on Mel-Frequency cepstral coefficients (MFCCs) characteristic parameters and a Deep Neural Network (DNN) algorithm, and the scheme is suitable for the application environment shown in fig. 1. The application environment includes a user 10 and a specific voice recognition device 20, and the specific voice recognition device 20 is configured to receive a voice uttered by the user 10 and recognize the voice to determine whether the voice is a specific voice.
Further, after recognizing that the sound is a specific sound, the specific recognition device 20 may also record and process the specific sound to output information on the condition that the user 10 uttered the specific sound. The condition information of the specific sound may include the number of times of the specific sound, the duration of the specific sound, and the decibel of the specific sound. For example, counting statistics may be performed on a specific sound when the specific sound is detected by including a counter in the specific sound recognition apparatus; the specific sound recognition device may include a timer for counting a duration of the specific sound when the specific sound is detected; it is possible to detect the decibel of a specific sound by including decibel detection means in the specific sound recognition apparatus for detecting the decibel of the specific sound when the specific sound is detected.
The recognition principle of the specific voice is similar to the voice recognition principle, and the specific voice is input into the voice model for recognition after being processed, so that a recognition result is obtained. It can be divided into two stages, a specific voice model training stage and a specific voice recognition stage. The specific sound model training stage mainly comprises the steps of collecting a certain number of specific sound sample signals, calculating an MFCC characteristic parameter matrix of the specific sound sample signals, extracting characteristic parameters from the MFCC characteristic parameter matrix, and performing model training on the characteristic parameters based on a DNN algorithm to obtain a specific sound characteristic model. In the stage of specific sound identification, an MFCC characteristic parameter matrix of a sound signal needing to be judged is calculated, corresponding characteristic parameters are extracted from the MFCC characteristic parameter matrix of the sound signal, and then the characteristic parameters are input into a specific sound characteristic model for identification so as to determine whether the sound signal is a specific sound. The identification process mainly comprises the steps of preprocessing, feature extraction, model training, pattern matching, judgment and the like.
Wherein, in the preprocessing step, sampling a specific sound sample signal and calculating an MFCC characteristic parameter matrix of the specific sound sample signal are included. In the feature extraction step, feature parameters are extracted from the MFCC feature parameter matrix. In the model training step, the characteristic parameters extracted from the MFCC characteristic parameter matrix of the specific sound sample signal are used as input, and a specific sound characteristic model based on the deep neural network is trained. In the pattern matching and determining step, the specific sound feature model is used to identify whether the new sound signal is a specific sound. Wherein, whether the new sound signal is the specific sound is identified, comprising: firstly, an MFCC characteristic parameter matrix of a sound signal is calculated, then characteristic parameters of the sound signal are extracted from the MFCC characteristic parameter matrix, and then the characteristic parameters of the sound signal are input into a specific sound characteristic model for recognition so as to determine whether the sound signal is a specific sound.
The scheme of identifying the specific sound by combining MFCC and DNN can simplify the complexity of the algorithm, reduce the calculation amount and obviously improve the accuracy of identifying the specific sound.
The embodiment of the present application provides a specific sound identification method, which may be used in the above-mentioned specific sound identification apparatus 20, where the specific sound identification method needs to obtain a DNN-based specific sound feature model in advance, where the DNN-based specific sound feature model may be preconfigured, or may be obtained by training in the following steps 101 to 103, after obtaining the DNN-based specific sound feature model by training, a specific sound may be identified based on the DNN-based specific sound feature model, and further, if the DNN-based specific sound feature model is not qualified in accuracy when used for identifying the specific sound due to scene change or other reasons, the DNN-based specific sound feature model may be reconfigured or trained.
As shown in fig. 2, the obtaining a DNN-based specific acoustic feature model in advance includes:
step 101: collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
sampling to obtain a specific sound sample signal s (n), and acquiring an MFCC characteristic parameter matrix of the specific sound sample signal according to the specific sound sample signal. The mel frequency cepstrum coefficient is mainly used for sound data feature extraction and operation dimensionality reduction. For example: for data with 512 dimensions (sampling points) in one frame, the most important 40-dimensional data can be extracted after MFCC processing, and the purpose of reducing dimensions is achieved. The mel-frequency cepstrum coefficient calculation generally includes: pre-emphasis, framing, windowing, fast fourier transform, mel filter bank, and discrete cosine transform.
Obtaining the MFCC characteristic parameter matrix of the specific sound sample signal, specifically comprising the following steps:
① Pre-emphasis
The pre-emphasis is to boost the high frequency part to flatten the spectrum of the signal, and to maintain the spectrum in the whole frequency band from low frequency to high frequency, so that the spectrum can be obtained with the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the specific sound sample signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency. The method is realized by pre-emphasizing a sampled specific sound sample signal s (n) through a first-order Finite Impulse Response (FIR) high-pass digital filter, wherein the transfer function is as follows:
H(z)=1-a·z-1(1)
wherein z represents an input signal, a time domain representation is a specific sound sample signal s (n), and a represents a pre-emphasis coefficient, and generally takes a constant of 0.9-1.0.
② framing
Every P samples in a specific sound sample signal s (n) are grouped into an observation unit, which is called a frame. The value of P can be 256 or 512, and the covering time is about 20-30 ms. To avoid excessive variation between two adjacent frames, an overlap region may be formed between two adjacent frames, the overlap region includes G sampling points, and the value of G may be about 1/2 or 1/3 of P. The sampling frequency of a specific sound sample signal may be 8KHz or 16KHz, and in 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000 × 1000 ═ 32 ms.
③ windowing
Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), n is 0,1 …, P-1, and P is the size of the frame, then after multiplying by the hamming window, S' (n) is S (n) x w (n), wherein,
where l represents the window length.
④ Fast Fourier Transform (FFT)
Since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different sounds. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And performing a modular square on the frequency spectrum of the specific sound sample signal to obtain a power spectrum of the specific sound sample signal.
⑤ triangular band-pass filter filtering
The energy spectrum is filtered through a set of mel-scale triangular filter banks. A filter bank with M filters is defined (the number of filters is close to the number of critical bands), and the filters used are triangular filters with center frequencies f (M), where M is 1, 2. M may be 22-26. The interval between f (m) decreases with decreasing m value and increases with increasing m value, please refer to fig. 3.
The frequency response of the triangular filter is defined as:
wherein
⑥ discrete cosine transform
The logarithmic energy of each filter bank output is calculated as:
obtaining MFCC by Discrete Cosine Transform (DCT) on logarithmic energy s (m):
step 102: extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
from equation (5), MFCC is a coefficient matrix of N × L, where N is the number of frames of the audio signal and L is the MFCC length. Since the MFCC characteristic parameter matrix has a high dimension, and the number N of matrix rows is different due to the inconsistency of the lengths of the sound signals, the MFCC characteristic parameter matrix cannot be used as a direct input to obtain a DNN-based specific sound characteristic model, and therefore, it is necessary to further extract characteristic parameters from the MFCC characteristic parameter matrix. The purpose of extracting the characteristic parameters is to extract the characteristics of a specific sound sample signal to mark the specific sound sample signal, and train a specific sound characteristic model based on DNN by taking the characteristic parameters as input. Feature parameters may be extracted from the MFCC feature parameter matrix in combination with time domain or frequency domain characteristics of a particular sound signal.
Taking a specific sound signal as an example of a cough sound signal, please refer to fig. 4, fig. 4 is a time-amplitude diagram (time domain diagram) of the cough sound signal, and it can be seen from fig. 4 that the occurrence process of the cough sound signal is very short and has obvious paroxysmal property, the duration of the single-sound cough sound is usually less than 550ms, and even patients with severe throat and bronchial diseases have the duration of the single-sound cough sound maintained at about 1000 ms. Energetically, the energy of the cough sound signal is concentrated primarily in the first half of the signal. Therefore, after the MFCC calculation processing, the main characteristic information of the cough sound sample signal is substantially concentrated in the first half of the cough sound sample signal. The characteristic parameters input into the deep neural network should cover as much main information of the cough sound sample signals as possible, and the characteristic parameters extracted from the MFCC characteristic parameter matrix are useful information rather than redundant information.
The feature parameters of the cough sound sample signals of the front fixed frame number can be selected in the MFCC feature parameter matrix of the cough sound sample signals as the input of the deep neural network, and the cough sound sample signals of the fixed frame number should contain the front half parts of the respective cough sound sample signals as much as possible in view of the fact that the main characteristic information of the cough sound sample signals is basically concentrated in the front half parts of the cough sound sample signals. In order to make full use of data, the remaining feature data in the MFCC feature parameter matrix can also be used as the input of the deep neural network, the MFCC feature parameter matrix can be divided according to the fixed frame number, and then the divided data can be used as the input of the deep neural network together.
Specifically, as shown in fig. 8, the extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal includes:
step 1021: sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a vector;
step 1022: and dividing the vector from the head of the vector to the tail of the vector according to a preset step length (the unit is a frame) to obtain characteristic parameters of a group of sub-vectors with preset lengths (namely a fixed frame number), wherein each sub-vector has the same label.
The method includes the steps that frames of an MFCC feature parameter matrix are connected in series to form a vector X, a preset length e is taken as a basic unit, a preset step length d is moved from the head to the tail of the vector X, and a group of data Xi with the same label is formed, wherein i is 1, 2. The specific processing procedure is shown in fig. 5.
In practical application, if the specific sound is a cough sound, the frame number of the first half section of the general cough sound signal can be calculated statistically, and then the preset length is taken as the frame number, and the preset step length can be taken in combination with practical application. If the specific sound is other sound, such as snore or sneeze, the value can be set to the preset length and the preset step length according to the time domain and frequency domain characteristics.
The MFCC characteristic parameter matrix of a specific sound sample signal is divided into a plurality of sub-characteristic vectors with fixed lengths, so that the sub-characteristic vectors adapt to the requirement of consistency of input data of the deep neural network and can be directly used as the input of the deep neural network. Moreover, each sub-feature vector in the plurality of sub-feature vectors is set to be the same label, that is, a group of sub-feature vectors are used for expressing the same specific sound sample signal, so that the number of data samples is increased, and the loss of information during feature parameter extraction is avoided. And establishing a specific sound characteristic model based on the deep neural network by using the sub-characteristic vectors and the labels corresponding to the sub-characteristic vectors, and identifying specific sound by using the specific sound characteristic model, so that the error identification rate is reduced, and the accuracy rate of specific sound identification is improved. When the specific sound identification method provided by the embodiment of the application is used for identifying the cough sound, the identification rate of the cough sound can reach more than 95% on the basis of not increasing the calculation amount.
Step 103: and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
DNN is an expansion of a shallow neural network, functionally utilizes the expression of a multilayer neural network, and has very good feature extraction, learning and generalization capabilities on the processing of nonlinear and high-dimensional data. The DNN model generally includes an input layer, a hidden layer and an output layer, please refer to fig. 6, in which the first layer is the input layer, the middle layer is the hidden layer, and the last layer is the output layer (fig. 6 only shows three hidden layers, and actually includes more hidden layers), and the layers are fully connected, that is, any neuron in the Q-th layer must be connected to any neuron in the Q + 1-th layer.
Each connection established between neurons has a linear weight, and each neuron in each layer has an offset (except for the input layer). The linear weight from the kth neuron of the l-1 layer to the jth neuron of the l layer is defined as wl jkWhere the superscript l represents the number of layers in which the linear weight lies, and the subscripts correspond to the output l-th layer index j and the input l-1-th layer index k, e.g., the linear weight from the 4 th neuron of the second layer to the 2 nd neuron of the third layer is defined as w3 24. The bias corresponding to the ith neuron of the l-th layer is bl iWhere the superscript l represents the number of layers and the subscript i represents the index of the neuron in which the offset is located, e.g., the offset corresponding to the third neuron in the second layer is defined as b2 3
A series of w may be randomly initialized and selectedl jk and bl iThe characteristic parameters of a specific sound sample signal are used as data of an input layer by utilizing a forward propagation algorithm, then a first hidden layer is calculated by the input layer, a second hidden layer is calculated by the first hidden layer, and the like until reaching an output layer. Then, the back propagation algorithm is used for wl jk and bl iAnd fine tuning is carried out to obtain a specific sound characteristic model based on the deep neural network finally.
Or obtaining each initial parameter w by a Deep Belief Network (DBN) algorithml jk and bl iThen using gradient descent and back propagation algorithm to wl jk and bl iFine tuning is carried out to obtain the final wl jk and bl iThe value of (a). Referring to fig. 9, the training of the deep neural network-based model using the feature parameters of the specific sound sample signal as input to obtain the specific sound feature model based on the deep neural network includes:
step 1031: taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;
the DBN is a deep learning model and is used for non-monitoringThe supervised mode preprocesses the model layer by layer, and the unsupervised preprocessing mode is a Restricted Boltzmann Machine (RBM). As shown in fig. 7(b), the DBN is stacked from a series of RBMs. As shown in FIG. 7(a), RBM is a two-layer structure, v denotes a visible layer, h denotes a hidden layer, and the connection between the visible layer and the hidden layer is non-directional (values can be taken from the visible layer->Hidden layer or hidden layer->Visible layer arbitrary transport) and fully connected. The visible layer v and the hidden layer h are connected through linear weight, and the linear weight of the ith neuron of the visible layer and the jth neuron of the hidden layer is defined as wijThe bias corresponding to the ith neuron of the visible layer is biThe bias corresponding to the jth neuron of the hidden layer is ajThe indices i and j represent the index of the neuron.
RBM performs one-step Gibbs (Gibbs) sampling by contrast divergence algorithm, optimizing weight wij、bi and ajAnother state expression h of input sample data (i.e. characteristic parameters of a specific sound sample signal) v can be obtained, the output h1 of the RBM can be used as the input of the next RBM, the hidden state h2 can be obtained by continuing optimization in the same way, and so on, the multi-layer DBN model can apply the weights w to the weights in a layer-by-layer preprocessing wayij、bi and ajInitialization is carried out, each layer is characterized by an expression mode of the first layer data v, and various initial parameters are obtained after unsupervised preprocessing.
Specifically, the RBM is an energy model, and the energy of the whole RBM is represented as the following formula (6).
Wherein E represents the total energy of the RBM model, v represents the visible layer data, h represents the hidden layer data, theta represents the model parameters, m represents the visible layer neuron number, n represents the hidden layer neuron number, b represents the visible layer bias, and a represents the hidden layer bias.
The RBM model samples according to the conditional probability of the visible layer data and the hidden layer data, and for the Bernoulli-Bernoulli RBM model, the conditional probability formulas are respectively formula (7) and formula (8),
where σ denotes an activation function sigmoid function, and σ (x) ═ 1+ e-x)-1
Gibbs sampling is carried out on the RBM by utilizing a contrast divergence algorithm according to the formula to obtain a sample with v and h jointly distributed, then parameters are optimized by maximizing a likelihood logarithm function (9) of an observation sample,
Δwij≈<vihj0-<vihj1(10)
the optimization parameters adopt a one-step contrast divergence algorithm, sampling samples are directly generated in an average field approximation mode, and the weight among the neurons and the bias of the neurons and other initial parameters are finally obtained by utilizing a formula (10) to iterate the optimization parameters for multiple times. Wherein, N represents the number of neurons in the visible layer of the RBM model, i.e. the dimension of the input data of the RBM model.
Step 1032: and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.
After the optimization process of the DBN is completed, weights w between neurons of all layers (an input layer, a hidden layer and an output layer) of a specific sound feature model based on the DNN and bias b of the neurons are obtained, a random initialization mode is adopted by a final multi-class logistic regression layer (softmax), and then the specific sound feature model is finely adjusted by the DNN through a supervised gradient descent algorithm.
Specifically, the entire DNN-specific acoustic feature model is fine-tuned by optimizing the parameters (equation 12) in a manner that minimizes the cost function (equation 11) in a supervised manner.
Wherein J represents a cost function, hW,b(x) Denotes the output of DNN, and y denotes the tag corresponding to the input data.
Wherein α represents learning rate, and its value is 0.5-0.01.
The partial derivatives of the nodes of the deep neural network calculated in the above formula (12) may adopt the back propagation algorithm of the formula (13).
Where δ represents sensitivity, and a represents an output value of each neuron node. When l represents the output layer,when l represents other layersWhere σ represents the activation function. And then, through multiple iterations, updating a formula (13), optimizing the whole DNN model layer by layer, and finally obtaining each parameter to obtain the trained specific sound characteristic model based on the DNN.
Through the combination of the non-supervised learning and the supervised learning methods based on the DBN, compared with the randomly initialized deep neural network, the DNN model obtained by carrying out the supervised learning after the unsupervised preprocessing has obviously better performance than the common deep neural network. The specific sound characteristic model based on the DNN is obtained by taking the MFCC characteristic parameters of the specific sound sample signal as the input of the DNN model for modeling, and the specific sound characteristic model is utilized to identify the specific sound, so that the identification rate of the specific sound is effectively improved.
Fig. 10 is a schematic flowchart of a specific voice recognition method provided in an embodiment of the present application, and as shown in fig. 10, the specific voice recognition method includes:
step 201: sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
in practical applications, a sound input unit (e.g., a microphone) may be disposed on the specific sound recognition device 20 to collect a sound signal, amplify, filter, and convert the sound signal into a digital signal. The digital signal may be sampled and subjected to other calculation processing in an arithmetic processing unit local to the specific voice recognition device 20, or may be uploaded to a cloud server, an intelligent terminal, or another server via a network for processing.
Please refer to step 101 for technical details of obtaining the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal, which is not described herein again.
Step 202: extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
for a specific calculation method for extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the sound signal, please refer to step 102, which is not described herein again.
Step 203: inputting the characteristic parameters into a specific sound characteristic model which is obtained in advance and is based on the deep neural network for recognition so as to determine whether the sound signal is a specific sound.
Specifically, inputting the feature parameters into a pre-acquired specific sound feature model based on a deep neural network for recognition to determine whether the sound signal is a specific sound, including:
inputting a group of sub-feature vectors contained in the feature parameters into a pre-acquired specific sound feature model based on a deep neural network to obtain a prediction result corresponding to the group of sub-feature vectors;
and if the positive prediction result is more than the negative prediction result in the prediction results, confirming that the sound signal is the specific sound, otherwise, confirming that the sound signal is not the specific sound.
When the characteristic parameters of the sound signal are input into a trained specific sound characteristic model based on DNN, the prediction result of whether the sound signal is a specific sound is obtained. Since the feature parameter of the same sound signal includes a plurality of sub-feature vectors, each sub-feature vector obtains a prediction result, so that each sound signal obtains a plurality of prediction results, and the prediction results represent the possibility of whether the sound signal is a specific sound. The specific sound characteristic model based on DNN votes for all the prediction results of the same sound signal, namely, in the prediction results of all the sub-characteristic vectors, if the positive prediction result is more than the negative prediction result, the sound signal is confirmed to be a specific sound; if the positive prediction result is less than the negative prediction result, it is confirmed that the sound signal is not the specific sound.
The specific sound identification method provided by the embodiment of the application can identify the specific sound, so that the specific sound condition emitted by the user can be monitored by monitoring the sound emitted by the user, and the user does not need to wear any detection component. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.
It should be noted that, the specific sound identification method based on the MFCC characteristic parameters and the DNN model provided in the embodiment of the present application is also applicable to identifying other specific sounds such as snoring, sneezing, breathing, laughing, firecracker and crying, in addition to identifying the cough sound.
Accordingly, as shown in fig. 11, the embodiment of the present application further provides a specific voice recognition apparatus, which is used for a specific voice recognition device 20, and the apparatus includes:
a sampling and characteristic parameter obtaining module 301, configured to sample a sound signal and obtain a mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
a feature parameter extraction module 302, configured to extract feature parameters from a mel-frequency cepstrum coefficient feature parameter matrix of the sound signal;
the identifying module 303 is configured to input the feature parameters into a pre-acquired specific sound feature model based on the deep neural network for identification, so as to determine whether the sound signal is a specific sound.
The specific sound recognition device that this application embodiment provided can discern specific sound to can monitor the specific sound condition that the user sent through the sound that the monitoring user sent, need not the user and wears any detection part. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.
Optionally, in another embodiment of the apparatus, as shown in fig. 12, the apparatus further includes:
a feature model presetting module 304, configured to obtain the specific sound feature model based on the deep neural network in advance.
Optionally, in some embodiments of the apparatus, the feature model presetting module 304 is specifically configured to:
collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
Optionally, in some embodiments of the apparatus, the feature model presetting module 304 is further specifically configured to:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a characteristic vector;
dividing the feature vector from the head of the feature vector to the tail of the feature vector according to a preset step length to obtain feature parameters of a group of sub-feature vectors with preset lengths, wherein each sub-feature vector has the same label, the preset step length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient, and the preset length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient;
the feature parameter extraction module 302 is further specifically configured to:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal end to form a characteristic vector;
and segmenting the feature vector from the head of the feature vector to the tail of the feature vector according to the preset step length to obtain feature parameters of a group of sub-feature vectors with the preset lengths.
Optionally, in some embodiments of the apparatus, the feature model presetting module 304 is further specifically configured to:
taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;
and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.
Optionally, in some embodiments of the apparatus, the identifying module 303 is specifically configured to:
inputting a group of sub-feature vectors contained in the feature parameters into a pre-acquired specific sound feature model based on a deep neural network to obtain a prediction result corresponding to the group of sub-feature vectors;
and if the positive prediction result is more than the negative prediction result in the prediction results, confirming that the sound signal is the specific sound, otherwise, confirming that the sound signal is not the specific sound.
Optionally, in some embodiments of the device, the specific sound comprises any one of a cough, a snore and a sneeze.
It should be noted that the above-mentioned apparatus can execute the method provided by the embodiments of the present application, and has corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The embodiment of the present application also provides a specific sound recognition apparatus, and as shown in fig. 8, the specific sound recognition apparatus 20 includes a sound input unit 21, a signal processing unit 22, and an arithmetic processing unit 23. Wherein: a sound input unit 21 for receiving a sound signal, which may be, for example, a microphone or the like. A signal processing unit 22 for performing signal processing on the sound signal; the signal processing unit 22 may perform analog signal processing such as amplification, filtering, digital-to-analog conversion, and the like on the sound signal, and send the obtained digital signal to the arithmetic processing unit 23.
The signal processing unit 22 is connected to an arithmetic processing unit 23 (fig. 13 illustrates that the arithmetic processing unit is built in the specific voice recognition device) or is externally installed on the specific voice recognition device, the arithmetic processing unit 23 may be built in the specific voice recognition device 20 or externally installed on the specific voice recognition device 20, and the arithmetic processing unit 23 may also be a server that is remotely installed, for example, a cloud server, a smart terminal or other server that is communicatively connected to the specific voice recognition device 20 through a network.
The arithmetic processing unit 23 includes:
at least one processor 232 (illustrated as a processor in fig. 13) and a memory 231, the processor 232 and the memory 231 may be connected by a bus or other means, and fig. 13 illustrates an example of a connection by a bus.
The memory 231 is used for storing nonvolatile software programs, nonvolatile computer executable programs, and software modules, such as program instructions/modules corresponding to a specific voice recognition method in the embodiment of the present application (for example, the sampling and feature parameter acquiring module 301 shown in fig. 11). The processor 232 executes various functional applications and data processing, i.e., implementing a specific voice recognition method of the above-described method embodiments, by executing nonvolatile software programs, instructions, and modules stored in the memory 231.
The memory 231 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created according to a particular voice recognition device usage, and the like. Further, the memory 231 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 231 optionally includes memory located remotely from processor 232, which may be connected to a particular voice recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 231 and, when executed by the one or more processors 232, perform the specific voice recognition method in any of the above-described method embodiments, for example, performing the above-described method steps 101-103 in fig. 2, the method steps 1021-1022 in fig. 8, the method steps 1031-1032 in fig. 9, and the step 201-203 in fig. 10; the functions of the modules 301 and 304 in fig. 11 and 12 are realized.
The specific sound identification equipment that this application embodiment provided can discern specific sound to can monitor the specific sound condition that the user sent through the sound that the monitoring user sent, need not the user and wears any detection part. And because the identification algorithm based on the MFCC characteristic parameters and the DNN model is adopted, the algorithm complexity is low, the calculated amount is small, the requirement on hardware is low, and the product manufacturing cost is reduced.
The specific voice recognition device can execute the method provided by the embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The present embodiment provides a storage medium storing computer-executable instructions, which are executed by one or more processors (e.g., one processor 232 in fig. 13), and can enable the one or more processors to perform the specific sound recognition method in any of the above-mentioned method embodiments, for example, the method steps 101-103 in fig. 2, 1021-1022 in fig. 8, 1031-1032 in fig. 9, and 201-203 in fig. 10 described above are performed; the functions of the modules 301 and 304 in fig. 11 and 12 are realized.
The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform, and may also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (11)

1. A method for specific voice recognition, the method comprising:
sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
inputting the characteristic parameters into a specific sound characteristic model which is obtained in advance and is based on the deep neural network for recognition so as to determine whether the sound signal is a specific sound.
2. The specific voice recognition method according to claim 1, further comprising: and acquiring the specific sound characteristic model based on the deep neural network in advance.
3. The specific sound recognition method according to claim 2, wherein the pre-obtaining the specific sound feature model based on the deep neural network comprises:
collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
4. The specific sound identification method according to claim 3, wherein the extracting the feature parameters from the mel-frequency cepstrum coefficient feature parameter matrix of the specific sound sample signal comprises:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal end to form a characteristic vector;
dividing the feature vector from the head of the feature vector to the tail of the feature vector according to a preset step length to obtain feature parameters of a group of sub-feature vectors with preset lengths, wherein each sub-feature vector has the same label, the preset step length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient, and the preset length is an integral multiple of the length of each frame of Mel frequency cepstrum coefficient;
the extracting of the characteristic parameters from the mel-frequency cepstrum coefficient characteristic parameter matrix of the sound signal comprises:
sequentially connecting the Mel frequency cepstrum coefficients of each signal frame in the Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal end to form a characteristic vector;
and segmenting the feature vector from the head of the feature vector to the tail of the feature vector according to the preset step length to obtain feature parameters of a group of sub-feature vectors with the preset lengths.
5. The specific sound recognition method according to claim 4, wherein training a deep neural network-based model to obtain the deep neural network-based specific sound feature model with the feature parameters of the specific sound sample signal as input comprises:
taking the characteristic parameters of the specific sound sample signal as input, and carrying out model training based on a deep confidence network algorithm to obtain each initial parameter of the specific sound characteristic model based on the deep neural network;
and fine-tuning each initial parameter based on a gradient descent and back propagation algorithm of the deep neural network to obtain each parameter of the specific sound characteristic model based on the deep neural network.
6. The specific sound identification method according to claim 4, wherein the inputting the feature parameters into a pre-acquired specific sound feature model based on a deep neural network for identification to determine whether the sound signal is a specific sound comprises:
inputting a group of sub-feature vectors contained in the feature parameters into a pre-acquired specific sound feature model based on a deep neural network to obtain a prediction result corresponding to the group of sub-feature vectors;
and if the positive prediction result is more than the negative prediction result in the prediction results, confirming that the sound signal is the specific sound, otherwise, confirming that the sound signal is not the specific sound.
7. The specific sound recognition method according to any one of claims 1 to 6, wherein the specific sound includes any one of a cough sound, a snore sound, and a sneeze sound.
8. A specific voice recognition apparatus, characterized in that the apparatus comprises:
the system comprises a sampling and characteristic parameter acquisition module, a signal processing module and a signal processing module, wherein the sampling and characteristic parameter acquisition module is used for sampling a sound signal and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
the characteristic parameter extraction module is used for extracting characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the sound signal;
the characteristic matching module is used for confirming whether the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network;
and the confirming module is used for confirming that the sound signal is a specific sound if the characteristic parameters are matched with a specific sound characteristic model which is obtained in advance and is based on the deep neural network.
9. The specific sound recognition device of claim 8, wherein the device further comprises:
the characteristic model presetting module is used for acquiring the specific sound characteristic model based on the deep neural network in advance;
the feature model presetting module is specifically configured to:
collecting a preset number of specific sound sample signals and acquiring a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signals;
extracting the characteristic parameters from a Mel frequency cepstrum coefficient characteristic parameter matrix of the specific sound sample signal;
and taking the characteristic parameters of the specific sound sample signal as input, and training a deep neural network model to obtain the specific sound characteristic model based on the deep neural network.
10. A specific voice recognition apparatus characterized by comprising:
a sound input unit for receiving a sound signal;
a signal processing unit for performing analog signal processing on the sound signal;
the signal processing unit is connected with an arithmetic processing unit which is internally or externally arranged on a specific voice recognition device, and the arithmetic processing unit comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
11. A storage medium storing executable instructions that, when executed by a particular sound recognition device, cause the particular sound recognition device to perform the method of any one of claims 1-7.
CN201780009004.8A 2017-10-24 2017-10-24 Specific voice recognition method, apparatus and storage medium Active CN109074822B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/107505 WO2019079972A1 (en) 2017-10-24 2017-10-24 Specific sound recognition method and apparatus, and storage medium

Publications (2)

Publication Number Publication Date
CN109074822A true CN109074822A (en) 2018-12-21
CN109074822B CN109074822B (en) 2023-04-21

Family

ID=64678057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780009004.8A Active CN109074822B (en) 2017-10-24 2017-10-24 Specific voice recognition method, apparatus and storage medium

Country Status (2)

Country Link
CN (1) CN109074822B (en)
WO (1) WO2019079972A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767784A (en) * 2019-01-31 2019-05-17 龙马智芯(珠海横琴)科技有限公司 Method and device, storage medium and the processor of sound of snoring identification
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110338797A (en) * 2019-08-12 2019-10-18 苏州小蓝医疗科技有限公司 A kind of intermediate frequency snore stopper data processing method based on the sound of snoring and blood oxygen
CN110558944A (en) * 2019-09-09 2019-12-13 成都智能迭迦科技合伙企业(有限合伙) Heart sound processing method and device, electronic equipment and computer readable storage medium
CN110933235A (en) * 2019-11-06 2020-03-27 杭州哲信信息技术有限公司 Noise removing method in intelligent calling system based on machine learning
CN111009261A (en) * 2019-12-10 2020-04-14 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
WO2020140609A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Voice recognition method and device and computer readable storage medium
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112418173A (en) * 2020-12-08 2021-02-26 北京声智科技有限公司 Abnormal sound identification method and device and electronic equipment
WO2021051608A1 (en) * 2019-09-20 2021-03-25 平安科技(深圳)有限公司 Voiceprint recognition method and device employing deep learning, and apparatus
CN113241093A (en) * 2021-04-02 2021-08-10 深圳达实智能股份有限公司 Method and device for recognizing voice in emergency state of subway station and electronic equipment
CN115064244A (en) * 2022-08-16 2022-09-16 深圳市奋达智能技术有限公司 Method and system for reminding medicine taking for needleless injection based on voice recognition

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI728632B (en) * 2019-12-31 2021-05-21 財團法人工業技術研究院 Positioning method for specific sound source
CN112185347A (en) * 2020-09-27 2021-01-05 北京达佳互联信息技术有限公司 Language identification method, language identification device, server and storage medium
CN112668556B (en) * 2021-01-21 2024-06-07 广东白云学院 Breathing sound identification method and system
CN113111786B (en) * 2021-04-15 2024-02-09 西安电子科技大学 Underwater target identification method based on small sample training diagram convolutional network
CN113571092B (en) * 2021-07-14 2024-05-17 东软集团股份有限公司 Engine abnormal sound identification method and related equipment thereof
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device
CN114398925A (en) * 2021-12-31 2022-04-26 厦门大学 Multi-feature-based ship radiation noise sample length selection method and system
EP4226883A1 (en) 2022-02-15 2023-08-16 Koninklijke Philips N.V. Apparatuses and methods for use with a treatment device
CN116264620B (en) * 2023-04-21 2023-07-25 深圳市声菲特科技技术有限公司 Live broadcast recorded audio data acquisition and processing method and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042152A (en) * 2014-08-18 2016-03-31 日本放送協会 Voice recognition device and program
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN105702250A (en) * 2016-01-06 2016-06-22 福建天晴数码有限公司 Voice recognition method and device
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976564A (en) * 2010-10-15 2011-02-16 中国林业科学研究院森林生态环境与保护研究所 Method for identifying insect voice
CN103325382A (en) * 2013-06-07 2013-09-25 大连民族学院 Method for automatically identifying Chinese national minority traditional instrument audio data
CN104706321B (en) * 2015-02-06 2017-10-03 四川长虹电器股份有限公司 A kind of heart sound kind identification method based on improved MFCC
US9687208B2 (en) * 2015-06-03 2017-06-27 iMEDI PLUS Inc. Method and system for recognizing physiological sound
US10014003B2 (en) * 2015-10-12 2018-07-03 Gwangju Institute Of Science And Technology Sound detection method for recognizing hazard situation
CN106847293A (en) * 2017-01-19 2017-06-13 内蒙古农业大学 Facility cultivation sheep stress behavior acoustical signal monitoring method
CN107910020B (en) * 2017-10-24 2020-04-14 深圳和而泰智能控制股份有限公司 Snore detection method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042152A (en) * 2014-08-18 2016-03-31 日本放送協会 Voice recognition device and program
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN105702250A (en) * 2016-01-06 2016-06-22 福建天晴数码有限公司 Voice recognition method and device
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140609A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Voice recognition method and device and computer readable storage medium
CN109767784B (en) * 2019-01-31 2020-02-07 龙马智芯(珠海横琴)科技有限公司 Snore identification method and device, storage medium and processor
CN109767784A (en) * 2019-01-31 2019-05-17 龙马智芯(珠海横琴)科技有限公司 Method and device, storage medium and the processor of sound of snoring identification
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110338797A (en) * 2019-08-12 2019-10-18 苏州小蓝医疗科技有限公司 A kind of intermediate frequency snore stopper data processing method based on the sound of snoring and blood oxygen
CN110558944A (en) * 2019-09-09 2019-12-13 成都智能迭迦科技合伙企业(有限合伙) Heart sound processing method and device, electronic equipment and computer readable storage medium
WO2021051608A1 (en) * 2019-09-20 2021-03-25 平安科技(深圳)有限公司 Voiceprint recognition method and device employing deep learning, and apparatus
CN110933235A (en) * 2019-11-06 2020-03-27 杭州哲信信息技术有限公司 Noise removing method in intelligent calling system based on machine learning
CN111009261A (en) * 2019-12-10 2020-04-14 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111009261B (en) * 2019-12-10 2022-11-15 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
CN111243619B (en) * 2020-01-06 2023-09-22 平安科技(深圳)有限公司 Training method and device for speech signal segmentation model and computer equipment
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111488485B (en) * 2020-04-16 2023-11-17 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN112382302A (en) * 2020-12-02 2021-02-19 漳州立达信光电子科技有限公司 Baby cry identification method and terminal equipment
CN112418173A (en) * 2020-12-08 2021-02-26 北京声智科技有限公司 Abnormal sound identification method and device and electronic equipment
CN113241093A (en) * 2021-04-02 2021-08-10 深圳达实智能股份有限公司 Method and device for recognizing voice in emergency state of subway station and electronic equipment
CN115064244A (en) * 2022-08-16 2022-09-16 深圳市奋达智能技术有限公司 Method and system for reminding medicine taking for needleless injection based on voice recognition

Also Published As

Publication number Publication date
CN109074822B (en) 2023-04-21
WO2019079972A1 (en) 2019-05-02

Similar Documents

Publication Publication Date Title
CN109074822B (en) Specific voice recognition method, apparatus and storage medium
CN108369813B (en) Specific voice recognition method, apparatus and storage medium
Sailor et al. Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification.
Lokesh et al. An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map
Xie et al. Utterance-level aggregation for speaker recognition in the wild
Ghahremani et al. Acoustic Modelling from the Signal Domain Using CNNs.
CN107146601B (en) Rear-end i-vector enhancement method for speaker recognition system
Sainath et al. Learning filter banks within a deep neural network framework
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
Deshwal et al. A language identification system using hybrid features and back-propagation neural network
CN108701469B (en) Cough sound recognition method, device, and storage medium
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
Bhattacharjee A comparative study of LPCC and MFCC features for the recognition of Assamese phonemes
CN107680582A (en) Acoustic training model method, audio recognition method, device, equipment and medium
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
Ghezaiel et al. Hybrid network for end-to-end text-independent speaker identification
Al Bashit et al. A mel-filterbank and MFCC-based neural network approach to train the Houston toad call detection system design
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Hanchate et al. Vocal digit recognition using artificial neural network
CN112329819A (en) Underwater target identification method based on multi-network fusion
Vecchiotti et al. Convolutional neural networks with 3-d kernels for voice activity detection in a multiroom environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant