CN110444202B

CN110444202B - Composite voice recognition method, device, equipment and computer readable storage medium

Info

Publication number: CN110444202B
Application number: CN201910601019.4A
Authority: CN
Inventors: 吴冀平; 彭俊清; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2023-05-26
Anticipated expiration: 2039-07-04
Also published as: WO2021000498A1; CN110444202A

Abstract

The invention relates to the field of artificial intelligence, and realizes recognition of a voice type of a composite voice signal through a capsule network model by using deep learning. Specifically disclosed are a method, a device, a computer device and a computer readable storage medium for composite voice recognition, wherein the method comprises the following steps: detecting composite voice within a preset range in real time or at fixed time; when the composite voice is detected, acquiring a sound signal of the composite voice; performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice signal; based on a preset capsule network model, extracting a plurality of frequency spectrums of the time-frequency diagram, and obtaining a Mel frequency cepstrum coefficient of each frequency spectrum; and calculating vector modes of the Mel frequency cepstrum coefficients through the preset capsule network model, and determining the type of the composite voice according to the vector modes of the Mel frequency cepstrum coefficients.

Description

Composite voice recognition method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and computer-readable storage medium name for composite speech recognition.

Background

The sound event detection purpose is to automatically detect the occurrence and end time of a specific event by sound, and give a tag to each event. With the aid of this technology, the computer can understand the surrounding environment by sound and respond thereto. Sound event detection has wide application prospects in daily life, including sound monitoring, bioacoustic monitoring, intelligent home furnishings and the like. A single or composite sound event detection is classified according to whether multiple sound events are allowed to occur simultaneously. In single sound event detection, each individual sound event in the spectrum has a certain frequency and amplitude, but for composite sound event detection, these frequencies or amplitudes may overlap, and existing sound detection technologies mainly detect and identify single sounds, and cannot identify overlapping composite sound types that occur simultaneously.

Disclosure of Invention

The invention mainly aims to provide a composite voice recognition method, a device, equipment and a computer readable storage medium name, which aim to solve the problem that the existing voice detection technology cannot recognize overlapping composite voice types which occur simultaneously.

In a first aspect, the present application provides a composite speech recognition method, the composite speech recognition method including:

Detecting composite voice within a preset range in real time or at fixed time;

when the composite voice is detected, acquiring a sound signal of the composite voice signal;

performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice signal;

based on a preset capsule network model, extracting a plurality of frequency spectrums of the time-frequency diagram, and obtaining a Mel frequency cepstrum coefficient of each frequency spectrum;

and calculating vector modes of the Mel frequency cepstrum coefficients through the preset capsule network model, and determining the type of the composite voice according to the vector modes of the Mel frequency cepstrum coefficients.

In a second aspect, the present application further provides a composite voice recognition apparatus, the composite voice recognition apparatus including:

the detection unit is used for detecting the composite voice in the preset range in real time or at fixed time;

the first acquisition module is used for acquiring a sound signal of the composite voice when the composite voice is detected;

the generation module is used for carrying out short-time Fourier transform on the acoustic signals and generating a time-frequency diagram of the composite voice;

the second acquisition module is used for extracting a plurality of spectrograms of the time-frequency chart based on a preset capsule network model and acquiring the Mel frequency cepstrum coefficients of each spectrogram;

And the third acquisition module is used for calculating vector modes of the Mel frequency cepstrum coefficients through the preset capsule network model, and determining the type of acquiring the composite voice according to the vector modes of the Mel frequency cepstrum coefficients.

In a third aspect, the present application also provides a computer device comprising: the system comprises a memory, a processor and a composite voice recognition program stored in the memory and capable of running on the processor, wherein the composite voice recognition program realizes the steps of the composite voice recognition method according to the invention when being executed by the processor.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a composite speech recognition program which when executed by a processor implements the steps of the composite speech recognition method according to the invention as described above.

The embodiment of the invention provides a method, a device, equipment and a computer readable storage medium for composite voice recognition, which detect composite voice in a preset range in real time or at fixed time; when the composite voice is detected, acquiring a sound signal of the composite voice signal; performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice signal; based on a preset capsule network model, extracting a plurality of frequency spectrums of the time-frequency diagram, and obtaining a Mel frequency cepstrum coefficient of each frequency spectrum; and calculating vector modes of the Mel frequency cepstrum coefficients through the preset capsule network model, and determining the type of the composite voice according to the vector modes of the Mel frequency cepstrum coefficients, so that the voice type of the composite voice is identified through the capsule network model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for recognizing composite speech according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating sub-steps of the composite speech recognition method of FIG. 1;

FIG. 3 is a flow chart illustrating sub-steps of the composite speech recognition method of FIG. 1;

FIG. 4 is a flowchart of another method for composite speech recognition according to an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating sub-steps of the composite speech recognition method of FIG. 4;

FIG. 6 is a flowchart illustrating another method for composite speech recognition according to an embodiment of the present disclosure;

FIG. 7 is a flow chart illustrating sub-steps of the composite speech recognition method of FIG. 6;

FIG. 8 is a schematic block diagram of a composite speech recognition device according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of a sub-module of the composite speech recognition device of FIG. 8;

FIG. 10 is a schematic block diagram of a sub-module of the composite speech recognition device of FIG. 8;

FIG. 11 is a schematic block diagram of another composite speech recognition device provided in an embodiment of the present application;

FIG. 12 is a schematic block diagram of a sub-module of the composite speech recognition device of FIG. 11;

FIG. 13 is a schematic block diagram of another composite speech recognition device provided in an embodiment of the present application;

FIG. 14 is a schematic block diagram of a sub-module of the composite speech recognition device of FIG. 13;

fig. 15 is a schematic block diagram of a computer device according to an embodiment of the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

The embodiment of the application provides a method, a device, equipment and a computer readable storage medium for composite voice recognition. The composite voice recognition method can be applied to terminal equipment, and the terminal equipment can be a mobile phone, a tablet computer, a notebook computer and a desktop computer.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flow chart of a method for recognizing composite speech according to an embodiment of the present application.

As shown in fig. 1, the composite voice recognition method includes steps S10 to S50.

Step S10, detecting the composite voice in a preset range in real time or at fixed time.

The terminal detects the composite voice in the preset range in real time or at fixed time, for example, the range which can be detected by the terminal is taken as the preset range of the terminal, and the range which can be detected by the terminal can be an indoor room or the like, or an outdoor park or the like. The preset terminal detects the composite voice of the preset room or the preset park every moment, or detects the preset room or the preset park every hour, wherein the composite voice comprises at least two different mixed voices. It should be noted that the preset range may be set based on practical situations, which is not specifically limited in this application.

Step S20, when the composite voice is detected, a sound signal of the composite voice is acquired.

When the terminal detects the composite voice, the detected composite voice is collected, and the sound signal of the composite voice is obtained through analyzing the composite voice, wherein the sound signal comprises the frequency, the amplitude, the time and the like of the sound. For example, when the terminal detects two or more mixed composite voices, the detected composite voices are detected through a preset frequency spectrum analysis function or a preset oscillometric function, the sound frequency of the composite voices is collected, the sound amplitude of the composite voices is obtained through a preset decibel tester, the frequency spectrum analysis function or the oscillometric function is preset in the terminal, the sound frequency of the composite voices is calculated through the preset frequency spectrum analysis function, or the sound amplitude of the composite voices is calculated through the preset oscillometric function.

In one embodiment, referring specifically to fig. 2, step S20 includes: substep S21 to substep S23.

In a substep S21, when a composite speech is detected, a preset sampling rate is invoked.

When the terminal detects the composite voice, a preset sampling rate, also called sampling rate or sampling frequency, is called, which defines the number of samples extracted from the continuous signal and constituting the discrete signal per second, and is expressed in hertz (Hz), and the preset sampling rate may be 40Hz, 60Hz, etc. It should be noted that the preset sampling rate may be set based on practical situations, which is not specifically limited in this application.

In a substep S22, a sampling time interval of the preset sampling rate is determined by a preset formula and the preset sampling rate.

The terminal calculates the sampling time interval of the preset sampling rate through a preset formula and the preset sampling rate, wherein the preset formula is sampling time interval=1/sampling rate, and the sampling time interval of the sampling rate is obtained through the preset sampling rate. For example, if the sampling frequency is 40KHz, there are 40×1000 sampling points within 1s, and t=1/40×1000 sampling periods (the sampling periods are generally identical) each.

And S23, collecting the composite voice based on the sampling time interval to acquire a discrete signal of the composite voice.

The terminal collects the composite voice through the sampling time intervals to obtain discrete signals of the composite voice, and the number of the discrete signals is based on the number of the sampling time intervals. A discrete signal is a signal that is sampled on a continuous signal, as opposed to the continuous signal having an independent variable that is continuous, a discrete signal is a sequence, i.e., its independent variable is "discrete", and each value of the sequence can be considered a sample of the continuous signal. The composite voice can be processed through the preset sampling rate, so that the better the discrete signal quality of the obtained composite voice signal is.

Step S30, performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice signal.

When the terminal acquires the sound signal of the composite voice, the acquired sound signal is subjected to short-time Fourier transform (STFT, short-time Fourier transform or short-term Fourier transform), which is a mathematical transform related to the Fourier transform, and is used for determining the frequency and the phase of a sine wave in a local area of the time-varying signal, specifically, the short-time Fourier transform comprises frame shift, frame duration and Fourier transform, the acquired sound signal is subjected to pretreatment of frame shift and frame duration, the preprocessed sound is subjected to Fourier transform, a plurality of two-dimensional graphs are acquired, the relation between the frequency and the amplitude in the composite voice can be acquired through carrying out Fourier transform on the sound signal, the two-dimensional graphs are frequency spectrums, the two-dimensional signals are overlapped according to dimensions, each frame in the time-frequency graphs is the frequency spectrum, and the change of the frequency spectrum along with time is the time-frequency spectrum.

In one embodiment, referring specifically to fig. 3, step S30 includes: substep S31 to substep S33.

Step S31, if the discrete signal is obtained, the preset frame duration information and the frame shift information are read.

And if the terminal acquires the discrete signal, the short-time Fourier transform comprises frame duration and frame shift set Fourier transform. The preset frame duration information and frame shift information are read, for example, the frame duration of 40ms, 50ms, etc., and the frame shift of 20ms, 30ms, etc. are preset. It should be noted that the preset frame duration information and the frame shift information may be set based on actual situations, which is not specifically limited in this application.

Step S32, preprocessing the discrete signals through frame duration information and frame shift information to obtain a plurality of short-time analysis signals.

The terminal preprocesses the acquired discrete signals through preset frame duration information and frame shift information to obtain short-time analysis signals. For example, the acquired discrete signals are subjected to frame duration processing such as 40ms or 50ms and frame shift processing such as 20ms or 30ms, so that short-time analysis signals of the discrete signals are obtained.

Step S33, fourier transform is carried out on the plurality of short-time analysis signals, and a time-frequency diagram of the composite voice is generated.

When a plurality of short-time analysis signals are acquired by the terminal, fourier transformation is carried out on each short-time analysis signal to obtain a relation between frequency and time, a two-dimensional graph is generated, the dimensions of each two-dimensional graph are stacked, and a time-frequency graph of the composite voice signal is generated. The discrete signals are subjected to frame shift, frame duration and Fourier transformation to generate a time-frequency diagram of the composite voice signal, so that the frequency spectrum and time variation of the composite voice signal can be better obtained according to the time-frequency diagram.

And S40, extracting a plurality of frequency spectrums of the time-frequency diagram based on a preset capsule network model, and obtaining the Mel frequency cepstrum coefficient of each frequency spectrum.

When the terminal acquires a time-frequency diagram of the composite voice, a capsule network model is based on a preset capsule network model, wherein the capsule network is of a novel neural network structure and comprises a convolution layer, a primary capsule, a high-level capsule and the like, and the capsule is a group of nested neural network layers. In a capsule network, more layers may be added in a single network layer. Specifically, the states of neurons in the capsule that nest one another in one neural network layer characterize the above-described properties of one entity in the image, the capsule outputs a vector that indicates the existence of the entity, the orientation of the vector indicates the properties of the entity, and the vector is sent to all parent capsules in the neural network. The capsule may calculate a prediction vector by multiplying its own weight by a weight matrix.

The capsule network model extracts frame signals in a time-frequency graph, wherein each frame in the time-frequency graph represents a frequency spectrum. When a plurality of frequency spectrums of the time-frequency diagram are obtained, a Mel frequency filtering function set in a capsule network is called, the frequency spectrums pass through the Mel frequency filtering function set, the logarithm of the Mel frequency filtering function set is read, and the logarithm is used as a Mel frequency cepstrum coefficient of the frequency spectrums.

And S50, calculating vector modes of each Mel frequency cepstrum coefficient by presetting a capsule network model, and determining the type of the composite voice according to the vector modes of each Mel frequency cepstrum coefficient.

When the terminal acquires the Mel frequency cepstrum coefficients of each frequency spectrum, a preset capsule network model is acquired, a dynamic routing algorithm and a weight matrix in the preset capsule network model are acquired, a vector model of the Mel frequency cepstrum coefficients of each frequency spectrum is calculated through the dynamic routing algorithm and the weight matrix, the acquired vector model of the Mel frequency cepstrum coefficients of each frequency spectrum is compared, the maximum Mel frequency cepstrum coefficient of the vector model is acquired, and therefore the voice type corresponding to the Mel frequency cepstrum coefficients is acquired and is used as the voice type of the composite voice, the voice type comprises dog barking, glass breaking and the like, and the composite voice at least comprises two voice types.

According to the composite voice recognition method provided by the embodiment, the composite voice is generated into the time-frequency diagram, and the time-frequency diagram is processed based on the capsule network model, so that the voice type of the composite voice can be detected.

Referring to fig. 4, fig. 4 is a schematic diagram of a scenario for implementing the composite voice recognition method according to the present embodiment, and as shown in fig. 4, the composite voice recognition method includes:

The terminal detects the composite voice in the preset range in real time or at fixed time, for example, the range which can be detected by the terminal is taken as the preset range of the terminal, and the range which can be detected by the terminal can be an indoor room or the like, or an outdoor park or the like. The preset terminal detects the composite voice of the preset room or the preset park every moment, or detects the preset room or the preset park every hour, wherein the composite voice comprises at least two different mixed voices.

When the terminal detects the composite voice, the detected composite voice is collected, and the sound signal of the composite voice is obtained through analyzing the composite voice, wherein the sound signal comprises the frequency, the amplitude, the time and the like of the sound. For example, when the terminal detects two or more mixed composite voices, the detected composite voices are detected through a preset frequency spectrum analyzer or a preset oscilloscope, the sound frequency of the composite voices is acquired, and the sound amplitude of the composite voices is acquired through a preset decibel tester.

Step S30, performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice.

Step S41, if the time-frequency diagram of the composite voice signal is obtained, a preset capsule network model is called, wherein the preset capsule network model comprises a convolution layer, a primary capsule, a high-level capsule and an output layer.

And if the terminal acquires the time-frequency diagram of the composite voice signal, a preset capsule network model is called, wherein the preset capsule network model comprises a convolution layer, a primary capsule, an advanced capsule and an output layer. It should be noted that the number of convolution kernels of the convolution layer may be set based on practical situations, which is not specifically limited in this application.

Step S42, inputting the time-frequency diagram into a preset capsule network model, framing the time-frequency diagram through convolution kernel of a convolution layer, and extracting a plurality of frequency spectrums of the time-frequency diagram.

The terminal inputs the acquired time-frequency diagram into a preset capsule network model, a convolution layer of the preset capsule network model is provided with a convolution kernel, the convolution kernel carries out framing on the input time-frequency diagram, and a plurality of frequency spectrums of the time-frequency diagram are extracted. For example, the terminal inputs a 28×28 time-frequency chart, and 256 convolution kernels with a step length of 1 are included in the convolution layer, and frames the 28×28 time-frequency chart by the information such as the number of the convolution kernels and the step length, so as to obtain 256 20×20 frequency spectrums, where f is a time-frequency chart specification and n is a convolution kernel specification in a calculation manner of the frequency spectrum rule= (f-n+1) × (f-n+1). The terminal extracts 256 20×20 spectrums by presetting a convolution layer in the capsule network model.

And S43, filtering the extracted multiple frequency spectrums through a preset filter function set to obtain Mel frequency cepstrum coefficients of each frequency spectrum.

When the terminal extracts a plurality of frequency spectrums through the convolution layer, the extracted frequency spectrums pass through a preset filter function set, the log of the preset filter function set is read, and the read log is used as a Mel frequency cepstrum coefficient of the frequency spectrums. Specifically, when the spectrum is acquired, the spectrum formula is adopted: wherein X K is the frequency spectrum, H K is the envelope, E K is the spectrum detail, the frequency spectrum is the detail of the envelope and the frequency spectrum, the envelope is obtained by connecting a plurality of formants in the frequency spectrum, the formants are the main frequency components representing the voice, and the formants carry the identification attribute of the voice (namely, the identification attribute is the same as a personal identification card). And (3) reading the coefficient of H [ K ] by presetting a filter function group, wherein the coefficient of H [ K ] is the Mel frequency spectrum cepstrum coefficient.

In an embodiment, referring specifically to fig. 5, step S43 includes: substep S431 to substep S432.

And step S431, filtering the multiple frequency spectrums through a preset filter function set in the convolution layer when the multiple frequency spectrums are extracted, and obtaining a Mel frequency cepstrum of each frequency spectrum, wherein the frequency spectrums consist of an envelope and details of the frequency spectrums.

When the terminal detects that the convolution kernel extracts a plurality of frequency spectrums, the frequency spectrums are filtered through a preset filter function set in the convolution layer, wherein the preset filter function set comprises a plurality of filter functions, and the filter functions can be a group of 40 filter functions or a group of 50 filter functions. The spectrum comprises a low-frequency function, an intermediate-frequency function and a high-frequency function, and details including and spectrum in the spectrum can be effectively separated through a preset filter function set, so that details including and spectrum are obtained, namely, a mel frequency spectrum cepstrum of an envelope in each spectrum is obtained.

And step S431, carrying out cepstrum analysis on each Mel frequency cepstrum through the primary capsule to obtain a plurality of cepstrum coefficients of the envelope, and taking the cepstrum coefficients of the envelope as Mel frequency cepstrum coefficients.

And the terminal performs cepstrum analysis on the Mel frequency cepstrum of each envelope through the primary capsule to obtain Mel frequency cepstrum coefficients of each envelope on the Mel frequency cepstrum, wherein the Mel frequency cepstrum coefficients of each envelope are also the Mel frequency cepstrum coefficients of each spectrum envelope.

When the terminal obtains the Mel frequency cepstrum coefficients of each frequency spectrum, the preset capsule network model comprises a dynamic routing algorithm and a weight matrix, the obtained Mel frequency cepstrum coefficients of each frequency spectrum are calculated through the dynamic routing algorithm and the weight matrix, the vector mode of the obtained Mel frequency cepstrum coefficient of each frequency spectrum is compared to obtain the maximum Mel frequency cepstrum coefficient of the vector mode, so that the voice type corresponding to the Mel frequency cepstrum coefficient is obtained, the voice type is used as the voice type of the composite voice, the voice type comprises bark, glass breaking and the like, and the composite voice at least comprises two voice types.

According to the composite voice recognition method provided by the embodiment, the frequency spectrum of the time-frequency chart is extracted through the capsule network model, so that the Mel spectrum cepstrum coefficient of each frequency spectrum is obtained, the characteristics of the composite voice signal can be obtained rapidly, and the manpower resources are saved.

Referring to fig. 6, fig. 6 is a schematic diagram of a scenario for implementing the composite voice recognition method according to the present embodiment, and as shown in fig. 6, the composite voice recognition method includes:

When the terminal acquires the sound signal of the composite voice, the acquired sound signal is subjected to short-time Fourier transform (STFT, short-time Fourier transform or short-term Fourier transform)), which is a mathematical transform related to the Fourier transform, and is used for determining the frequency and the phase of a sine wave in a local area of the time-varying signal, specifically, the short-time Fourier transform comprises frame shift, frame duration and Fourier transform, the acquired sound signal is subjected to pretreatment of frame shift and frame duration, the preprocessed sound is subjected to Fourier transform, a plurality of two-dimensional graphs are acquired, the relation between the frequency and the amplitude in the composite voice can be acquired through carrying out Fourier transform on the sound signal, the two-dimensional graphs are frequency spectrums, the two-dimensional signals are overlapped according to dimensions, each frame in the time-frequency graphs is the frequency spectrum, and the change of the frequency spectrum along with time is the time-frequency spectrum.

When the terminal acquires the time-frequency diagram of the composite voice, the capsule network is a novel neural network structure based on a preset capsule network model and comprises a convolution layer, a primary capsule, an advanced capsule and the like. A capsule is a set of nested neural network layers. In a capsule network, more layers may be added in a single network layer.

Specifically, the states of neurons in the capsule that nest one another in one neural network layer characterize the above-described properties of one entity in the image, the capsule outputs a vector that indicates the existence of the entity, the orientation of the vector indicates the properties of the entity, and the vector is sent to all parent capsules in the neural network. The capsule may calculate a prediction vector by multiplying its own weight by a weight matrix. The capsule network model extracts frame signals in a time-frequency graph, wherein each frame in the time-frequency graph represents a frequency spectrum. When a plurality of frequency spectrums of the time-frequency diagram are obtained, a Mel frequency filtering function set in a capsule network is called, the frequency spectrums pass through the Mel frequency filtering function set, the logarithm of the Mel frequency filtering function set is read, and the logarithm is used as a Mel frequency cepstrum coefficient of the frequency spectrums.

And S51, when a plurality of primary capsules respectively forward propagate the Mel frequency cepstrum coefficients to the advanced capsules, obtaining the intermediate vector of the Mel frequency cepstrum coefficients through a dynamic routing formula of a preset capsule network.

When the terminal acquires the Mel frequency cepstrum coefficients output by each primary capsule, each primary capsule respectively transmits the Mel frequency cepstrum coefficients forward to the advanced capsule, and the intermediate vector of the Mel frequency cepstrum coefficients is acquired through a dynamic routing formula of a preset capsule network model.

In an embodiment, referring specifically to fig. 7, step S51 includes: substep S511 to substep S512.

And step S511, when the primary capsule forwards propagates the Mel frequency cepstrum coefficient to the advanced capsule, acquiring a weight value of the capsule network model.

Specifically, when the primary capsule forwards propagates the mel frequency cepstrum coefficient to the advanced capsule, a weight value of a preset capsule network model is obtained, wherein the weight value is obtained when the capsule network model trains a data set.

And S512, acquiring a vector of the Mel frequency cepstrum coefficient based on a first preset formula of the capsule network model and the weight value, and acquiring a coupling coefficient of the capsule network model.

By presetting a first preset formula in a capsule network model

Wherein->

The vector of the mel frequency cepstrum coefficient is, w is a weight value of a preset capsule network model, and u is the mel frequency cepstrum coefficient output by the primary capsule. And obtaining a vector of the mel frequency cepstrum coefficient and a coupling coefficient of a preset capsule network model through a first preset formula.

Step S513, obtaining an intermediate vector of the mel frequency cepstrum coefficient based on a second preset formula of the capsule network model, the vector, the coupling coefficient and the vector, wherein the dynamic routing formula includes the first preset formula and the second preset formula.

Through a second preset formula

Wherein s is the intermediate vector of the input mel-frequency cepstrum coefficient of the advanced capsule, C is the coupling coefficient, ++>

And obtaining an intermediate vector of the Mel frequency cepstrum coefficient by a second preset formula, wherein the first preset formula and the second preset formula are dynamic routing formulas of a preset capsule network model.

Step S52, based on the activation function and the intermediate vector of the advanced capsule, obtaining a vector mode of the Mel frequency cepstrum coefficient output by the advanced capsule.

The terminal inputs the intermediate vector of each acquired Mel frequency cepstrum coefficient into the advanced capsule, acquires the activation function in the advanced capsule, calculates the intermediate vector of each Mel frequency cepstrum coefficient through the activation function, and acquires the vector module of each Mel frequency cepstrum coefficient output by the advanced capsule.

For example, when the number of primary capsules is 8 and the number of advanced capsules is 3, the 8 primary capsules respectively input mel-frequency cepstrum coefficients to the advanced capsule 1, respectively calculate intermediate vectors of the mel-frequency cepstrum coefficients output by the 8 primary capsules through a dynamic routing formula of a preset capsule network model, input the calculated intermediate vectors of the mel-frequency cepstrum coefficients output by the 8 primary capsules to the advanced capsule 1, and calculate vector modulus values of the 8 mel-frequency cepstrum coefficients through an activation function of the advanced capsule 1.

And then respectively inputting the Mel frequency cepstrum coefficients into the advanced capsules 2 by 8 primary capsules, respectively calculating intermediate vectors of the Mel frequency cepstrum coefficients output by the 8 primary capsules by a dynamic routing formula of a preset capsule network model, inputting the calculated intermediate vectors of the Mel frequency cepstrum coefficients output by the 8 primary capsules into the advanced capsules 2, calculating vector modulus values of the 8 Mel frequency cepstrum coefficients by an activation function of the advanced capsules 2, inputting the calculated intermediate vectors of the Mel frequency cepstrum coefficients output by the 8 primary capsules into the advanced capsules 3, and calculating vector modulus values of the 8 Mel frequency cepstrum coefficients by an activation function of the advanced capsules 3.

And step S53, when the vector modes of the Mel frequency cepstrum coefficients output by the plurality of advanced capsules are obtained, the target advanced capsule outputting the maximum vector mode is marked by comparing the vector modes of the Mel frequency cepstrum coefficients.

When vector modulus values of a plurality of Mel frequency cepstrum coefficients output by each advanced capsule are obtained, vector modulus values of the plurality of Mel frequency cepstrum coefficients are compared, the advanced capsule with the largest vector modulus value is marked, the marked advanced capsule is used as the target advanced capsule, and each advanced capsule corresponds to the marked voice type.

And S54, outputting the identification type of the target-level capsule through the output layer, and acquiring the type of the composite voice.

The identification types of the target high-level capsules are output through the output layer, and each high-level capsule is identified with a voice type, for example, the type identified by the high-level capsule 1 is a bark of a dog, the type identified by the high-level capsule 2 is a broken glass, or the type identified by the high-level capsule 1 is a bark of a dog, broken glass and the like, and the type identified by the high-level capsule can be one voice type or multiple voice types.

According to the composite voice recognition method provided by the embodiment, the Mel spectrum cepstrum coefficients of each spectrum in the acquired time-frequency diagram in the capsule network model are preset, the vector mode of each Mel spectrum cepstrum coefficient is calculated, the identification type of the advanced capsule with the maximum vector mode is acquired based on the vector mode of each Mel spectrum cepstrum coefficient, the composite voice is generated into an image, the image is processed through the capsule network model, a voice signal and the image are combined and calculated, and the type of the composite voice is rapidly acquired.

Referring to fig. 8, fig. 8 is a schematic block diagram of a composite voice recognition device according to an embodiment of the present application.

As shown in fig. 8, the composite voice recognition apparatus 400 includes: a detection module 401, a first acquisition module 402, a generation module 403, a second acquisition module 404 and a third acquisition module 405.

The detection module 401 is used for detecting the composite voice in the preset enclosure in real time or at fixed time.

The first obtaining module 402 is configured to obtain a sound signal of the composite voice when the composite voice is detected.

The generating module 403 is configured to perform short-time fourier transform on the sound signal, and generate a time-frequency chart of the composite voice.

The second obtaining module 404 is configured to extract a plurality of spectrograms of the time-frequency chart based on a preset capsule network model, and obtain mel frequency cepstrum coefficients of each spectrogram.

And the third obtaining module 405 is configured to calculate a vector modulus of each mel frequency cepstrum coefficient through the preset capsule network model, and determine a type of obtaining the composite voice according to the vector modulus of each mel frequency cepstrum coefficient.

In one embodiment, as shown in fig. 9, the first acquisition module 402 includes:

The first retrieving submodule 4021 is configured to retrieve a preset sampling rate when the composite speech is detected.

The determining submodule 4022 is configured to determine a sampling time interval of the preset sampling rate according to a preset formula and the preset sampling rate.

The first obtaining submodule 4023 is configured to collect the composite speech based on the sampling time interval, and obtain a discrete signal of the composite speech.

In one embodiment, as shown in fig. 10, the generating module 403 includes:

the reading sub-module 4031 is configured to read preset frame duration information and frame shift information if the discrete signal is acquired.

The obtaining submodule 4032 is used for preprocessing the discrete signals according to the frame duration information and the frame shift information to obtain a plurality of short-time analysis signals.

And the generating submodule 4033 is used for carrying out Fourier transform on a plurality of short-time analysis signals to generate a time-frequency diagram of the composite voice.

Referring to fig. 11, fig. 11 is a schematic block diagram of another composite voice recognition device according to an embodiment of the present application.

As shown in fig. 11, the composite voice recognition apparatus 500 includes: the device comprises a detection module 501, a first acquisition module 502, a generation module 503, a second calling sub-module 504, an extraction sub-module 505, a second acquisition sub-module 506 and a third acquisition module 507.

The detection module 501 is used for detecting the composite voice in the preset enclosure in real time or at fixed time.

The first obtaining module 502 is configured to obtain a sound signal of the composite voice when the composite voice is detected.

And the generating module 503 is used for performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice.

The second retrieving submodule 504 is configured to retrieve a preset capsule network model if the time-frequency diagram of the composite voice is acquired, where the preset capsule network model includes a convolution layer, a primary capsule, an advanced capsule, and an output layer.

And the extraction sub-module 505 is configured to, when the time-frequency diagram is input into the preset capsule network model, check the time-frequency diagram to frame through convolution of the convolution layer, and extract a plurality of spectrums of the time-frequency diagram.

The second obtaining sub-module 506 is configured to filter the extracted multiple frequency spectrums through a preset filter function set, and obtain mel frequency cepstrum coefficients of each frequency spectrum.

And the third obtaining module 507 is configured to calculate a vector modulus of each mel frequency cepstrum coefficient through the preset capsule network model, and determine a type of obtaining the composite speech according to the vector modulus of each mel frequency cepstrum coefficient.

In one embodiment, as shown in fig. 12, the second acquisition sub-module 506 includes:

the first obtaining subunit 5061 is configured to, when extracting a plurality of frequency spectrums, filter the plurality of frequency spectrums through a preset filter function set in a convolution layer, and obtain mel-frequency cepstrum of each frequency spectrum, where the frequency spectrums are composed of an envelope and details of the frequency spectrums.

The second obtaining subunit 5062 is configured to perform cepstrum analysis on each mel-frequency cepstrum through the primary capsule, obtain cepstrum coefficients of the plurality of envelopes, and use the cepstrum coefficients of the envelopes as the mel-frequency cepstrum coefficients.

Referring to fig. 13, fig. 13 is a schematic block diagram of another composite voice recognition device according to an embodiment of the present application.

As shown in fig. 13, the composite voice recognition apparatus 600 includes: the device comprises a detection module 601, a first acquisition module 602, a generation module 603, a second acquisition module 604, a third acquisition sub-module 605, a fourth acquisition sub-module 606, a marking sub-module 607 and a fifth acquisition sub-module 608.

The detection module 601 is used for detecting the composite voice in the preset range in real time or at fixed time.

The first obtaining module 602 is configured to obtain a sound signal of the composite voice when the composite voice is detected.

The generating module 603 is configured to perform short-time fourier transform on the sound signal, and generate a time-frequency chart of the composite voice.

The second obtaining module 604 is configured to extract a plurality of spectrograms of the time-frequency chart based on a preset capsule network model, and obtain mel frequency cepstrum coefficients of each spectrogram.

And the third obtaining sub-module 605 is configured to obtain, when the primary capsules forward propagate the mel-frequency cepstrum coefficients to the advanced capsules, respectively, an intermediate vector of the mel-frequency cepstrum coefficients through a dynamic routing formula of the preset capsule network.

A fourth obtaining sub-module 606 is configured to obtain a vector modulus of the mel-frequency cepstrum coefficient output by the advanced capsule based on the activation function of the advanced capsule and the intermediate vector.

The marking submodule 607 is used for marking the target advanced capsule outputting the maximum vector mode by comparing the vector modes of the mel frequency cepstrum coefficients when the vector modes of the mel frequency cepstrum coefficients output by the advanced capsules are obtained.

And a fifth obtaining sub-module 608, configured to output, through the output layer, the identification type of the target advanced capsule, and obtain the type of the composite voice signal.

In one embodiment, as shown in fig. 14, the third acquisition sub-module 605 includes:

a third obtaining subunit 6051 is configured to obtain a weight value of the capsule network model when the primary capsule propagates the mel-frequency cepstrum coefficient forward to the advanced capsule.

The fourth obtaining subunit 6052 is configured to obtain a vector of the mel-frequency cepstrum coefficient based on the first preset formula of the capsule network model and the weight value, and obtain a coupling coefficient of the capsule network model.

A fifth obtaining subunit 6053, configured to obtain an intermediate vector of the mel-frequency cepstrum coefficient based on a second preset formula of the capsule network model, the vector, the coupling coefficient, and the vector, where the dynamic routing formula includes the first preset formula and the second preset formula.

It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and modules and units may refer to corresponding processes in the foregoing embodiment of the composite voice recognition method, and will not be described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 15.

Referring to fig. 15, fig. 15 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a terminal.

As shown in fig. 15, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of composite speech recognition methods.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any of the composite speech recognition methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the computer device to which the present application is applied, and in particular, the computer device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

composite speech within a preset range is detected in real time or at fixed time.

And when the composite voice signal is detected, acquiring a voice signal of the composite voice signal.

And performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice.

And extracting a plurality of frequency spectrums of the time-frequency diagram based on a preset capsule network model, and obtaining the mel frequency cepstrum coefficient of each frequency spectrum.

In one embodiment, the processor, when implementing that the sound signal of the composite speech signal is acquired when the composite speech signal is detected, is configured to implement: and when the composite voice is detected, a preset sampling rate is called.

And determining the sampling time interval of the preset sampling rate through a preset formula and the preset sampling rate.

And acquiring the composite voice based on the sampling time interval to acquire a discrete signal of the composite voice.

In one embodiment, the processor is configured to, when implementing a short-time fourier transform of the sound signal to generate a time-frequency plot of the composite speech, implement: and if the discrete signal is acquired, reading preset frame duration information and frame shift information.

And preprocessing the discrete signals through the frame duration information and the frame shift information to obtain a plurality of short-time analysis signals.

And carrying out Fourier transformation on the plurality of short-time analysis signals to generate a time-frequency diagram of the composite voice.

Wherein in another embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

and detecting the composite voice in the preset enclosure in real time or at fixed time.

And when the composite voice is detected, acquiring a sound signal of the composite voice.

And if the time-frequency diagram of the composite voice is obtained, a preset capsule network model is called, wherein the preset capsule network model comprises a convolution layer, a primary capsule, a high-level capsule and an output layer.

When the time-frequency diagram is input into the preset capsule network model, the time-frequency diagram is subjected to framing through convolution check of the convolution layer, and a plurality of frequency spectrums of the time-frequency diagram are extracted.

And filtering the extracted multiple frequency spectrums through a preset filter function set to obtain the Mel frequency cepstrum coefficient of each frequency spectrum.

And calculating vector modes of the Mel frequency cepstrum coefficients through the preset capsule network model, and determining the type of acquiring the composite voice according to the vector modes of the Mel frequency cepstrum coefficients.

In one embodiment, the processor is configured to, when implementing the calculation of the vector modes of the mel-frequency cepstrum coefficients through the preset capsule network model, determine the type of obtaining the composite speech according to the vector modes of the mel-frequency cepstrum coefficients, implement:

When a plurality of frequency spectrums are extracted, filtering the plurality of frequency spectrums through a preset filter function set in a convolution layer to obtain Mel frequency cepstrum of each frequency spectrum, wherein the frequency spectrums consist of envelopes and details of the frequency spectrums.

And carrying out cepstrum analysis on each Mel frequency cepstrum through the primary capsule to obtain a plurality of enveloped cepstrum coefficients, and taking the enveloped cepstrum coefficients as Mel frequency cepstrum coefficients.

And extracting a plurality of spectrograms of the time-frequency chart based on a preset capsule network model, and obtaining the Mel frequency cepstrum coefficients of each spectrogram.

And when the primary capsules respectively forward propagate the Mel frequency cepstrum coefficients to the advanced capsules, obtaining the intermediate vector of the Mel frequency cepstrum coefficients through a dynamic routing formula of the preset capsule network.

And acquiring a vector mode of the Mel frequency cepstrum coefficient output by the advanced capsule based on the activation function of the advanced capsule and the intermediate vector.

And marking the target advanced capsule outputting the maximum vector mode by comparing the vector modes of the Mel frequency cepstrum coefficients when the vector modes of the Mel frequency cepstrum coefficients output by the advanced capsules are obtained.

And outputting the identification type of the target advanced capsule through the output layer to acquire the type of the composite voice signal.

In one embodiment, the processor, when implementing a vector modulus based on the activation function of the advanced capsule and the intermediate vector, is configured to implement:

and when the primary capsule forwards propagates the Mel frequency cepstrum coefficient to the advanced capsule, acquiring a weight value of the capsule network model.

And based on a first preset formula of the capsule network model and the weight value, acquiring a vector of the Mel frequency cepstrum coefficient, and acquiring a coupling coefficient of the capsule network model.

And obtaining an intermediate vector of the Mel frequency cepstrum coefficient based on a second preset formula of the capsule network model, the vector, the coupling coefficient and the vector, wherein the dynamic routing formula comprises a first preset formula and a second preset formula.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, where the computer program includes program instructions, and when the program instructions are executed, the method implemented by the method may refer to various embodiments of the composite speech recognition method of the present application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of composite speech recognition, the method comprising:

detecting composite voice within a preset range in real time or at fixed time;

when the composite voice is detected, acquiring a sound signal of the composite voice;

performing short-time Fourier transform on the sound signal to generate a time-frequency diagram of the composite voice; based on a preset capsule network model, extracting a plurality of frequency spectrums of the time-frequency diagram, and obtaining a Mel frequency cepstrum coefficient of each frequency spectrum;

2. The method of composite speech recognition according to claim 1, wherein when the composite speech signal is detected, acquiring the sound signal of the composite speech comprises:

when the composite voice is detected, a preset sampling rate is called;

determining a sampling time interval of the preset sampling rate through a preset formula and the preset sampling rate;

3. The method of composite speech recognition according to claim 2, wherein said performing a short-time fourier transform on said sound signal to generate a time-frequency plot of said composite speech comprises: if the discrete signal is acquired, reading preset frame duration information and frame shift information;

preprocessing the discrete signals through the frame duration information and the frame shift information to obtain a plurality of short-time analysis signals;

4. The method for composite speech recognition according to any one of claims 1 or 3, wherein extracting a plurality of frequency spectrums of the time-frequency map based on a preset capsule network model, and obtaining mel-frequency cepstrum coefficients of each of the frequency spectrums comprises;

if the time-frequency diagram of the composite voice is obtained, a preset capsule network model is called, wherein the preset capsule network model comprises a convolution layer, a primary capsule, a high-level capsule and an output layer;

when the time-frequency diagram is input into the preset capsule network model, framing the time-frequency diagram through the convolution check of the convolution layer, and extracting a plurality of frequency spectrums of the time-frequency diagram;

5. The method of claim 4, wherein filtering the extracted plurality of frequency spectrums through a preset filter function set to obtain mel-frequency cepstrum coefficients of each frequency spectrum comprises:

when a plurality of frequency spectrums are extracted, filtering the frequency spectrums through a preset filter function group in the convolution layer to obtain Mel frequency cepstrum of each frequency spectrum, wherein the frequency spectrums consist of envelopes and details of the frequency spectrums;

6. The method of claim 5, wherein calculating vector modes of the mel-frequency cepstrum coefficients by the pre-capsule network model, and determining the type of the composite voice signal based on the vector modes of the mel-frequency cepstrum coefficients comprises:

When a plurality of primary capsules respectively forward propagate the Mel frequency cepstrum coefficients to the advanced capsules, obtaining intermediate vectors of the Mel frequency cepstrum coefficients through a dynamic routing formula of the preset capsule network;

based on the activation function of the advanced capsule and the intermediate vector, obtaining a vector module of the mel frequency cepstrum coefficient output by the advanced capsule;

marking a target advanced capsule outputting a maximum vector mode by comparing vector modes of the Mel frequency cepstrum coefficients when vector modes of the Mel frequency cepstrum coefficients output by a plurality of advanced capsules are obtained;

and outputting the identification type of the target advanced capsule through the output layer to acquire the type of the composite voice.

7. The method of composite speech recognition of claim 6, wherein the dynamic routing formula comprises a first preset formula and a second preset formula; when the primary capsule forwards propagates the mel-frequency cepstrum coefficient to the advanced capsule, obtaining the intermediate vector of the mel-frequency cepstrum coefficient through the dynamic routing formula of the preset capsule network comprises: when the primary capsule forwards propagates the Mel frequency cepstrum coefficient to the advanced capsule, acquiring a weight value of the capsule network model;

Based on a first preset formula of the capsule network model and the weight value, acquiring a vector of the Mel frequency cepstrum coefficient, and acquiring a coupling coefficient of the capsule network model; the expression of the first preset formula is as follows:

wherein u is a vector of the mel frequency cepstrum coefficient, w is a weight value of a preset capsule network model, and u is the mel frequency cepstrum coefficient output by the primary capsule;

based on a second preset formula of the capsule network model, the vector and the coupling coefficient, obtaining an intermediate vector of the Mel frequency cepstrum coefficient; the expression of the second preset formula is as follows:

where s is the intermediate vector of the input mel-frequency cepstrum coefficient of the advanced capsule, C is the coupling coefficient,

is a vector of mel-frequency cepstral coefficients.

8. A composite speech recognition device, the composite speech recognition device comprising:

the detection module is used for detecting the composite voice in the preset enclosure in real time or at fixed time;

the generation module is used for carrying out short-time Fourier transform on the sound signals and generating a time-frequency diagram of the composite voice;

The second acquisition module is used for extracting a plurality of frequency spectrums of the time-frequency diagram based on a preset capsule network model and acquiring the mel frequency cepstrum coefficient of each frequency spectrum;

9. A computer device, the computer device comprising: memory, a processor and a composite speech recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the composite speech recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a composite speech recognition program is stored, which, when executed by a processor, implements the steps of the composite speech recognition method according to any one of claims 1 to 7.