WO2019232845A1 - 语音数据处理方法、装置、计算机设备及存储介质 - Google Patents

语音数据处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2019232845A1
WO2019232845A1 PCT/CN2018/094184 CN2018094184W WO2019232845A1 WO 2019232845 A1 WO2019232845 A1 WO 2019232845A1 CN 2018094184 W CN2018094184 W CN 2018094184W WO 2019232845 A1 WO2019232845 A1 WO 2019232845A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
frame
speech
term
short
Prior art date
Application number
PCT/CN2018/094184
Other languages
English (en)
French (fr)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232845A1 publication Critical patent/WO2019232845A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for processing voice data.
  • VAD Voice Activity Detection
  • a voice data processing method includes:
  • VAD Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested
  • ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
  • the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  • a voice data processing device includes:
  • Raw voice data acquisition module used to obtain raw voice data
  • the voice data to be tested module is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
  • the filter feature acquisition module under test is configured to use ASR voice feature extraction algorithm to perform feature extraction on the frame of voice data to be tested for each frame to acquire the filter feature of the filter under test;
  • a recognition probability value acquisition module configured to use the trained ASR-LSTM speech recognition model to recognize the voice characteristics of the filter under test to obtain a recognition probability value
  • the target voice data acquisition module is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • VAD Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested
  • ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
  • the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  • One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
  • VAD Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested
  • ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
  • the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  • FIG. 1 is an application environment diagram of a voice data processing method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a voice data processing method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S20 in FIG. 2;
  • FIG. 4 is a specific flowchart of step S30 in FIG. 2;
  • FIG. 5 is another flowchart of a voice data processing method according to an embodiment of the present application.
  • step S63 in FIG. 5 is a specific flowchart of step S63 in FIG. 5;
  • FIG. 7 is a schematic diagram of a voice data processing apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the voice data processing method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network.
  • Computer devices can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as a stand-alone server.
  • the voice data processing method is applied to a computer device configured by a financial institution such as a bank, securities, insurance or the like, and is used to preprocess the original voice data by using the voice data processing method to obtain training data in order to use the training Data is used to train voiceprint models or other speech models to improve the accuracy of model recognition.
  • a financial institution such as a bank, securities, insurance or the like
  • a method for processing voice data is provided.
  • the method is applied to the server in FIG. 1 as an example, and includes the following steps:
  • the original voice data is speaker voice data obtained by using a recording device, and the original voice data is unprocessed voice data.
  • the original voice data may be voice data in wav, mp3, or other formats.
  • the original voice data includes target voice data and interference voice data, where the target voice data refers to a voice part in which the voiceprint continuously changes significantly in the original voice data, and the target voice data is generally a speaker voice.
  • the interfering voice data refers to a voice portion other than the target voice data in the original voice data, that is, the interfering voice data is a voice other than the speaker voice.
  • the interfering speech data includes a mute segment and a noise segment, where the mute segment refers to a speech portion of the original speech data that is not pronounced due to silence, such as the collected original speech data due to the speaker thinking and breathing during the speaking process
  • the voice part when no sound is produced the voice part is a mute section.
  • the noise section refers to the voice part corresponding to the environmental noise in the original voice data, and the sounds such as the opening and closing of doors and windows and the collision of objects can be considered as the noise section.
  • S20 The VAD algorithm is used to frame and segment the original voice data to obtain at least two frames of voice data to be tested.
  • the voice data to be tested is the original voice data obtained by cutting out the mute section in the interference voice data using the VAD algorithm.
  • the VAD (Voice Activity Detection) algorithm is an algorithm that accurately locates the start and end of target voice data from a noisy environment.
  • the VAD algorithm can be used to identify and eliminate long silent segments from the signal stream of the original speech data, in order to eliminate the interference speech data of the silent segment in the original speech data, and improve the accuracy of speech data processing.
  • Frame is the smallest unit of observation in speech data. Framing is the process of dividing according to the timing of speech data. Since the original speech data is not stable as a whole, but can be regarded as stable locally, the original speech data is considered. Framed to obtain a relatively stable single-frame voice data. In the process of speech recognition or voiceprint recognition, a stable signal needs to be input, so the server needs to perform frame processing on the original speech data first.
  • Segmentation is a process of cutting out a single frame of speech data belonging to a mute segment in the original speech data.
  • the VAD algorithm is used to perform segmentation processing on the original voice data after frame processing, and to remove the mute segment to obtain at least two frames of voice data to be tested.
  • step S20 the original voice data is framed and segmented using the VAD algorithm to obtain at least two frames of voice data to be tested, which specifically includes the following steps:
  • S21 Perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.
  • Framing is the collection of N sampling points into an observation unit, called a frame.
  • N 256 or 512
  • the time covered is about 20-30ms.
  • M sampling points.
  • M 1/2 or 1/3 of N.
  • This process is called framing. Specifically, after framing the original voice data, at least two frames of single-frame voice data may be acquired, and each frame of single-frame voice data includes N sampling points.
  • each frame can show the characteristics of the periodic function. Therefore, it is necessary to perform windowing and pre-emphasis processing on each single-frame speech data after the frame. To get better single-frame voice data.
  • a Hamming Window is to multiply each frame by a Hamming Window (namely, Hamming Window). Because the amplitude and frequency characteristics of the Hamming window is a large sidelobe attenuation, the server can increase the left end of the frame and the frame by windowing the single frame of voice data. Right end continuity. That is, by windowing the single-frame speech data after framed, non-stationary speech signals can be converted into short-term stationary signals.
  • Signal-to-noise ratio refers to the ratio of signal to noise in an electronic device or electronic system.
  • H (Z) 1- ⁇ z -1 , where ⁇ is between 0.9-1.0, and Z represents a single-frame voice data.
  • the goal is to improve the high-frequency part, make the spectrum of the signal smoother, keep it in the entire low-frequency to high-frequency band, can use the same signal-to-noise ratio to find the spectrum, and highlight the high-frequency formants.
  • the pre-processed single-frame voice data has the advantages of high resolution, good stability, and small errors from the original voice data.
  • subsequent segmentation processing is performed on at least two frames of single-frame voice data, the efficiency and quality of obtaining at least two frames of voice data to be tested can be improved.
  • the short-term energy calculation formula is used to segment the single-frame voice data to obtain the short-term energy corresponding to the single-frame voice data, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as the first voice data.
  • N is a frame length of a single frame of voice data
  • x n (m) is an n-th frame of single frame of voice data
  • E (n) is short-term energy
  • m is a time series.
  • short-term energy refers to the energy of a frame of voice signals.
  • the first threshold value is a threshold value with a lower preset value.
  • the first voice data refers to voice data in which the short-term energy corresponding to a single frame of voice data in a single frame of voice data is greater than a first threshold.
  • the VAD algorithm can detect the four parts of speech in a single frame of voice data: the mute segment, the transition segment, the speech segment, and the end segment. Specifically, the short-term energy calculation formula is used to calculate each frame of single-frame voice data, and the short-term energy corresponding to each frame of single-frame voice data is obtained, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as First voice data.
  • single-frame voice data whose short-term energy is greater than the first threshold is retained, that is, the starting point is marked, and it is proved that the single-frame voice data after the starting point enters the transition section, that is, the first voice data finally obtained includes the transition section. , Speech segment, and ending segment.
  • the first voice data obtained based on the short-term energy in step S21 is obtained by segmenting a single frame of voice data whose short-term energy is not greater than the first threshold threshold, that is, the mute in the single-frame voice data is removed. This part of the segment interferes with speech data.
  • S23 Use the zero-crossing rate calculation formula to segment the first voice data to obtain the zero-crossing rate corresponding to the first voice data, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames to be tested Voice data.
  • the formula for calculating the zero-crossing rate is specifically Among them, sgn [] is a symbolic function, and its function formula is x n (m) is the first speech data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
  • the second threshold value is a preset threshold value with a relatively high value. Because the first threshold is not necessarily the beginning of the speech segment, it may be caused by short noise. Therefore, the calculation of the first speech data (that is, the original speech data in the transition period and after the transition period) of each frame needs to be calculated. Zero rate. If the zero-crossing rate corresponding to the first voice data is not greater than the second threshold threshold, the first voice data is considered to be in the mute section, and the first voice data of this segment is segmented, that is, the retained zero-crossing rate is greater than the second Threshold the first voice data to obtain at least two frames of voice data to be tested, thereby achieving the purpose of further segmenting the interference voice data in the transition section of the first voice data.
  • the short-term energy calculation formula is used to perform segmentation processing on the original speech data to obtain the corresponding short-term energy, and to retain a single frame of speech data whose short-term energy is greater than the first threshold value, that is, to mark the starting point to prove
  • the single frame of voice data after the starting point enters the transition section, and the mute section in the single frame of voice data can be initially cut off.
  • the first voice data of each frame that is, the original voice data in the transition section and after the transition section
  • the VAD algorithm cuts the interference voice data corresponding to the mute segment in the first voice data by using a dual threshold method, which is simple to implement and improves the processing efficiency of voice data.
  • ASR voice feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested.
  • the voice feature of the filter under test is a filter feature obtained by performing feature extraction of the voice data to be tested using an ASR voice feature extraction algorithm.
  • Filter-Bank Fbank for short
  • the filter features are used instead of the commonly used Mel features in this embodiment. Helps improve the accuracy of subsequent model recognition.
  • ASR Automatic Speech Recognition
  • ASR speech feature extraction algorithm is an algorithm used in ASR technology to implement speech feature extraction.
  • an ASR speech feature extraction algorithm is used to perform feature extraction on each frame of voice data to be tested to obtain the voice characteristics of the filter to be tested, which can provide technical support for subsequent model recognition.
  • step S30 the ASR voice feature extraction algorithm is used to perform feature extraction to obtain the voice features of the filter to be tested, which specifically includes the following steps:
  • S31 Perform fast Fourier transform on the voice data to be tested for each frame, and obtain a frequency spectrum corresponding to the voice data to be tested for each frame.
  • the spectrum corresponding to the voice data to be tested refers to the energy spectrum of the voice data to be tested in the frequency domain. Because the transformation of the speech signal in the time domain is usually difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe. Different energy distributions represent the characteristics of different speech.
  • the fast Fourier transform is performed on the voice data to be tested for each frame to obtain the spectrum of the voice data to be tested for each frame, that is, the energy spectrum.
  • FFT Fast Fourier Transform
  • DFT Discrete Fourier Transform
  • the calculation formula of the discrete Fourier transform is among them, N is the number of sampling points included in each frame of voice data to be tested. Because the DFT algorithm has high complexity, large amount of calculation, and time-consuming when the amount of data is large, fast Fourier transform is used for calculation to speed up the calculation and save time.
  • the fast Fourier transform uses the rotation factor in the discrete Fourier transform formula
  • the characteristics, namely periodicity, symmetry and reducibility, are transformed by butterfly operation to reduce the algorithm complexity.
  • the DFT operation of N sampling points is called a butterfly operation, and the FFT operation is composed of several stages of iterative butterfly operations.
  • the number of sampling points of the speech data to be tested is 2 ⁇ L (L is a positive integer). If the number of sampling points is less than 2 ⁇ L, you can use 0's complement to know that the number of sampling points in the frame is 2 ⁇ L .
  • the calculation formula for butterfly operation is Among them, X '(k') is a discrete Fourier transform of an even-numbered branch, and X "(k") is a discrete Fourier transform of an even-numbered branch.
  • the DFT operation of N sampling points is converted into an odd-numbered discrete Fourier transform and an even-numbered discrete Fourier transform through butterfly operations to reduce the complexity of the algorithm and achieve the purpose of efficient operations.
  • the Mel filter bank refers to passing the energy spectrum (that is, the spectrum of the voice data to be measured) output by the fast Fourier transform through a set of triangular filter banks of Mel scale to define a M filter
  • the Mel filter bank is used to smooth the spectrum and eliminate filtering. It can highlight the formant characteristics of speech and reduce the amount of calculation.
  • M is the number of triangular filters
  • m is the m-th triangular filter
  • H m (w) is the frequency response of the m-th triangular filter
  • X i (w) is the correspondence of the voice data to be measured in the i-th frame.
  • the spectrum of the voice signal, w represents the frequency in the spectrum of the voice signal
  • the logarithmic energy is the voice characteristics of the filter under test.
  • fast Fourier transform is performed on the voice data to be tested for each frame to obtain the frequency spectrum corresponding to the voice data to be tested for each frame, so as to reduce the computational complexity, speed up the calculation, and save time.
  • S40 Use the trained ASR-LSTM speech recognition model to identify the speech features of the filter under test, and obtain the recognition probability value.
  • the ASR-LSTM speech recognition model is a model that is pre-trained to distinguish between speech and noise in the speech features of the filter under test.
  • the ASR-LSTM speech recognition model is a speech recognition model obtained by using LSTM (long-short term memory, long-term memory neural network) to train the training filter speech features extracted using the ASR speech feature extraction algorithm.
  • the recognition probability value is the probability that when the ASR-LSTM speech recognition model is used to recognize the speech features of the filter under test, it is recognized as speech.
  • the recognition probability value may be a real number between 0-1.
  • the speech feature of the filter to be tested corresponding to the speech data of each frame to be tested is input to the ASR-LSTM speech recognition model for recognition, so as to obtain the recognition probability value corresponding to the speech feature of the filter to be tested per frame, that is, The possibility of speech.
  • the voice data to be tested is a single frame of voice data with the mute segment removed, interference from the mute segment is eliminated. Specifically, if the recognition probability value is greater than a preset probability value, the voice data to be tested is not considered to be a noise segment, that is, the voice data to be tested with a recognition probability value greater than the preset probability value is determined as the target voice data. Understandably, the server recognizes the voice data to be tested after removing the mute segment, and can exclude interference voice data such as mute segment and noise segment from the target voice data, so as to use the target voice data as training data for the voiceprint model or other The speech model is trained to improve the recognition accuracy of the model.
  • the recognition probability value is not greater than the preset probability value, it proves that the piece of voice data to be tested is likely to be noise. This piece of voice data to be tested is excluded to avoid subsequent recognition of the model when training the model based on the target voice data. The problem of low accuracy.
  • the original voice data is obtained first, and the original voice data includes target voice data and interference voice data.
  • the VAD algorithm is used to perform frame and segment processing on the original voice data in order to initially cut off the interference of the mute section. Obtaining more pure target voice data provides protection.
  • ASR voice feature extraction algorithm is used to extract the feature of each frame of the voice data to be tested, and to obtain the voice characteristics of the filter to be tested, which effectively solves the problem of reducing the dimensionality of the data during model training and causing partial information loss.
  • the voice data to be tested is considered to be the target voice data, so that the acquired target voice data does not include cut-off interference voice data such as mute segments and noise segments, that is, to obtain a purer target voice
  • the data helps to use the target voice data as training data to train the voiceprint model or other voice models in order to improve the recognition accuracy of the model.
  • the speech data processing method further includes: pre-training an ASR-LSTM speech recognition model.
  • pre-training the ASR-LSTM speech recognition model includes the following steps:
  • the training voice data is the voice data that continuously changes with time obtained from the open source voice database and is used for model training.
  • the training voice data includes pure voice data and pure noise data.
  • the open-source speech database has labeled pure speech data and pure noise data for model training.
  • the ratio of pure voice data and pure noise data in the training voice data is 1: 1, that is, obtaining equal proportions of pure voice data and pure noise data can effectively prevent the model training from overfitting, so as to pass
  • the recognition effect of the model obtained by training voice data training is more accurate.
  • the training voice data needs to be framed to obtain at least two frames of training voice data in order to perform feature extraction for each frame of training voice data subsequently.
  • the ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features.
  • the server uses the ASR speech feature extraction algorithm to perform feature extraction on each frame of training speech data, obtains the training filter speech features carrying the timing state, and provides technical support for subsequent model training.
  • the steps of using ASR speech feature extraction algorithm for feature extraction of training speech data are the same as the step of feature extraction in step S30, in order to avoid repetition, and will not be repeated here.
  • the training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
  • a long-short-term memory neural network (long-term memory LSTM) model is a time-recurrent neural network model that is suitable for processing and predicting important events with time series and relatively long time series intervals and delays.
  • the LSTM model has the function of temporal memory, so it is used to process the speech features of the training filter that carry the timing state.
  • the LSTM model is one of the neural network models with long-term memory capabilities. It has a three-layer network structure of an input layer, a hidden layer, and an output layer. Among them, the input layer is the first layer of the LSTM model and is used to receive external signals, that is, it is responsible for receiving the voice characteristics of the training filter.
  • the output layer is the last layer of the LSTM model and is used to output signals to the outside world, that is, it is responsible for outputting the calculation results of the LSTM model.
  • Hidden layers are layers other than the input layer and output layer in the LSTM model. They are used to train the filter speech features to adjust the parameters of each layer of the hidden layer in the LSTM model to obtain the ASR-LSTM speech recognition model. Understandably, using the LSTM model for model training increases the temporality of the filter's speech features, thereby improving the accuracy of the ASR-LSTM speech recognition model.
  • the output layer of the LSTM model uses Softmax (regression model) for regression processing, which is used to classify the output weight matrix.
  • Softmax regression model
  • an equal proportion of speech data and noise data is first obtained from an open source speech database to prevent over-fitting of the model training, and the recognition effect of the speech recognition model obtained by training the speech data training is more accurate.
  • the ASR speech feature extraction algorithm is used to extract the features of each frame of training speech data to obtain the training filter speech features.
  • the speech features of the training filter are trained by using a long-term and short-term memory neural network model with temporal memory capability to obtain a trained ASR-LSTM speech recognition model, which makes the recognition accuracy of the ASR-LSTM speech recognition model high.
  • step S63 the voice characteristics of the training filter are input to a long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained, which specifically includes the following steps:
  • the first activation function is used to calculate the voice characteristics of the training filter to obtain the neurons carrying the identification of the activation state.
  • each neuron in the hidden layer of the short-term memory neural network model includes three gates, which are an input gate, a forget gate, and an output gate.
  • the forget gate determines the past information to be discarded in the neuron.
  • the input gate determines the information to be added to the neuron.
  • the output gate determines the information to be output in the neuron.
  • the first activation function is a function for activating a neuron state.
  • the state of the neuron determines the information discarded, added, and output by each gate (ie, input gate, forget gate, and output gate).
  • the activation status flag includes a pass flag and a fail flag.
  • the identifiers corresponding to the input gate, the forget gate, and the output gate in this embodiment are i, f, and o, respectively.
  • the Sigmoid (S-shaped growth curve) function is specifically selected as the first activation function.
  • the Sigmoid function is a S-type function commonly used in biology.
  • Sigmoid function is often used as the threshold function of neural networks, which can map variables to 0-1.
  • the calculation formula for the first activation function is Among them, z represents the output value of the forget gate.
  • the activation state of each neuron (training filter voice feature) is calculated to obtain the neuron that carries the activation state identifier as the pass identifier.
  • the forgetting gate also includes the forgetting threshold.
  • a 0-1 interval scalar ie, forgetting threshold
  • This scalar determines the neuron according to the current state and the past state. Comprehensively determine the proportion of past information received in order to reduce the dimensionality of the data, reduce the amount of calculation, and improve training efficiency.
  • a second activation function is used to calculate the neuron carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model.
  • the output value of the hidden layer of the short-term memory neural network model includes the output value of the input gate, the output value of the output gate, and the state of the neuron.
  • a second activation function is used to carry the activation state identifier to perform calculation through the identified neurons to obtain the output value of the hidden layer.
  • a tanh (hyperbolic tangent) function is used as the activation function of the input gate (ie, the second activation function).
  • Non-linear factors can be added to make the trained ASR-LSTM speech recognition Models can solve more complex problems.
  • the activation function tanh has the advantage of fast convergence speed, which can save training time and increase training efficiency.
  • the output value of the input gate is calculated by a calculation formula of the input gate.
  • the input gate also includes an input threshold.
  • the calculation formula of the state of the neuron is adopted.
  • W c represents the weight matrix of the neuron state
  • b c represents the bias term of the neuron state
  • C t represents the state of the neuron at the current moment.
  • update the formula based on the weights Updating the error back propagation wherein, T is time, W is the weight, such as W i, W c, W o or W f, B represents a value such as the output i t, f t, o t or ⁇ represents the error term, Is the state data of the neuron at the last moment, and b t-1 h is the output value of the hidden layer at the last moment.
  • Update formula based on offset Update the offset. Among them, b is the offset term of each gate, and ⁇ a, t represents the error of each gate at time t.
  • the updated weights can be obtained, and the offsets are updated according to the offset update formula.
  • the obtained weights and offsets of each layer are applied to the short-term memory neural network.
  • the trained ASR-LSTM speech recognition model can be obtained from the model.
  • each weight in the ASR-LSTM speech recognition model implements the functions of the ASR-LSTM speech recognition model to decide which old information to discard, which new information to add, and which information to output.
  • the probability value will eventually be output.
  • the probability value indicates the probability that the training speech data is determined to be speech data after being recognized by the ASR-LSTM speech recognition model, and can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.
  • the first activation function is used to calculate the training filter speech features in the hidden layer of the long-term and short-term memory neural network model to obtain the neurons carrying the identification of the active state in order to reduce the data and reduce the amount of calculation. Improve training efficiency.
  • a second activation function is used to calculate the neurons carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model.
  • the output value is used to update the long-term and short-term memory neural network model by error back-propagation, and obtain the updated weights and offsets. Applying the updated weights and offsets to the long-term and short-term memory neural network model can obtain ASR-
  • the LSTM speech recognition model can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.
  • a voice data processing device corresponds to the voice data processing method in the above embodiment. As shown in FIG. 7, the voice data processing device includes an original voice data acquisition module 10, a test voice data acquisition module 20, a filter voice characteristic acquisition module 30, a recognition probability value acquisition module 40, and a target voice data acquisition module 50. .
  • the detailed description of each function module is as follows:
  • the original voice data acquisition module 10 is configured to acquire original voice data.
  • the voice data to be tested module 20 is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested.
  • the voice feature acquisition module 30 of the filter under test is configured to use ASR voice feature extraction algorithm to extract features of the voice data of each frame to obtain the voice feature of the filter under test.
  • the recognition probability value acquisition module 40 is configured to recognize the voice characteristics of the filter to be tested by using the trained ASR-LSTM speech recognition model to obtain the recognition probability value.
  • the target voice data acquisition module 50 is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
  • the voice data acquisition module 20 includes a single frame voice data acquisition unit 21, a first voice data acquisition unit 22, and a voice data acquisition unit 23.
  • the single-frame voice data obtaining unit 21 is configured to perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.
  • the first voice data obtaining unit 22 is configured to perform segmentation processing on a single frame of voice data by using a short-term energy calculation formula, to obtain corresponding short-term energy, and to retain single-frame voice data whose short-term energy is greater than a first threshold threshold, as a first A voice data.
  • the voice data acquisition unit 23 uses the zero-crossing rate calculation formula to perform segmentation processing on the first voice data, acquires the corresponding zero-crossing rate, retains the first voice data whose zero-crossing rate is greater than the second threshold, and obtains at least two frames. Voice data under test.
  • N is a frame length of a single frame of voice data
  • x n (m) is an n-th frame of single frame of voice data
  • E (n) is short-term energy
  • m is a time series.
  • the zero-crossing rate calculation formula is Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
  • the voice feature acquisition module 30 of the filter to be tested includes a spectrum acquisition unit 31 and a voice feature acquisition unit 32 of the filter to be tested.
  • the frequency spectrum acquiring unit 31 is configured to perform fast Fourier transform on each frame of voice data to be tested to acquire a frequency spectrum corresponding to the voice data to be tested.
  • the voice feature acquisition unit 32 of the filter under test is configured to pass the frequency spectrum through the Mel filter bank to obtain the voice feature of the filter under test.
  • the speech data processing device further includes an ASR-LSTM speech recognition model training module 60 for pre-training the ASR-LSTM speech recognition model.
  • the ASR-LSTM speech recognition model training module 60 includes a training speech data acquisition unit 61, a training filter speech feature acquisition unit 62, and an ASR-LSTM speech recognition model acquisition unit 63.
  • the training voice data acquiring unit 61 is configured to acquire training voice data.
  • the training filter speech feature obtaining unit 62 is configured to use ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain the training filter speech features.
  • the ASR-LSTM speech recognition model acquisition unit 63 is configured to input the training filter speech features into a long-term and short-term memory neural network model for training, and obtain a trained ASR-LSTM speech recognition model.
  • the ASR-LSTM speech recognition model acquisition unit 63 includes an activation state neuron acquisition subunit 631, a model output value acquisition subunit 632, and an ASR-LSTM speech recognition model acquisition subunit 633.
  • the activation state neuron acquisition subunit 631 is configured to calculate a speech filter feature of a training filter by using a first activation function in a hidden layer of a long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier.
  • the model output value acquisition subunit 632 is configured to calculate a neuron carrying an activation state identifier in a hidden layer of the long-term and short-term memory neural network model by using a second activation function to obtain an output value of the hidden layer of the long-term and short-term memory neural network model.
  • the ASR-LSTM speech recognition model acquisition subunit 633 is configured to perform error back propagation update of the long-term and short-term memory neural network model based on the output value of the hidden layer of the long-term and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.
  • Each module in the above-mentioned voice data processing device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data generated or obtained during the execution of the voice data processing method, such as target voice data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a voice data processing method.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented: Voice data; use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; use ASR voice feature extraction algorithm to feature extraction for each frame of voice data to be tested, and obtain the filter to be tested Speech features; the trained ASR-LSTM speech recognition model is used to identify the speech features of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the measured speech data is used as the target speech data.
  • the processor executes the computer-readable instructions, the following steps are further implemented: frame processing the original voice data to obtain at least two frames of single-frame voice data; and using a short-term energy calculation formula to cut the single-frame voice data Divide the processing to obtain the corresponding short-term energy, and retain a single frame of voice data with short-term energy greater than the first threshold as the first voice data; use the zero-crossing rate calculation formula to perform segmentation processing on the first voice data to obtain the corresponding Zero-crossing rate, retaining first voice data with a zero-crossing rate greater than a second threshold, and obtaining at least two frames of voice data to be tested.
  • N is the frame length of a single frame of speech data
  • x n (m) is the nth frame of single frame of speech data
  • E (n) is short-term energy
  • m is a time series
  • the formula for calculating the zero-crossing rate is Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
  • the processor when the processor executes the computer-readable instructions, the following steps are further implemented: performing fast Fourier transform on each frame of voice data to be tested to obtain a frequency spectrum corresponding to the voice data to be tested; passing the spectrum through a Mel filter Group to obtain the voice characteristics of the filter under test.
  • the processor when the processor executes the computer-readable instructions, the following steps are further implemented: acquiring training speech data; using ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain training filter speech features; and training the filter speech
  • the features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
  • the processor when the processor executes the computer-readable instructions, the processor further implements the following steps: in the hidden layer of the long-term and short-term memory neural network model, the first activation function is used to calculate the voice characteristics of the training filter to obtain the nerve carrying the activation state identifier. Element; in the hidden layer of the long-term and short-term memory neural network model, the second activation function is used to calculate the neuron carrying the activation status identifier to obtain the output value of the hidden layer of the long-term and short-term memory neural network model; based on the hidden layer of the long-term and short-term memory neural network model The output value is used to update the long-term and short-term memory neural network model by error back propagation to obtain the ASR-LSTM speech recognition model.
  • one or more non-volatile readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, the one or more Each processor performs the following steps: obtaining the original voice data; using the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; and using an ASR voice feature extraction algorithm for each frame of voice data to be tested Perform feature extraction to obtain the voice characteristics of the filter under test; use the trained ASR-LSTM speech recognition model to identify the voice characteristics of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the Speech data is used as the target speech data.
  • the execution of the one or more processors further implements the following steps: framing the original voice data to obtain at least two Frame single frame of voice data; use the short-term energy calculation formula to segment the single-frame voice data to obtain the corresponding short-term energy, and retain the single-frame voice data whose short-term energy is greater than the first threshold value as the first voice data; The first voice data is segmented using a zero-crossing rate calculation formula to obtain the corresponding zero-crossing rate, and the first voice data with the zero-crossing rate greater than the second threshold is retained, and at least two frames of voice data to be tested are obtained.
  • N is the frame length of a single frame of speech data
  • x n (m) is the nth frame of single frame of speech data
  • E (n) is short-term energy
  • m is a time series
  • the formula for calculating the zero-crossing rate is Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
  • the execution of the one or more processors further implements the following steps: performing fast Fourier Fourier processing on each frame of voice data to be tested.
  • the leaf transform obtains the frequency spectrum corresponding to the voice data to be measured; passes the spectrum through the Mel filter bank to obtain the voice characteristics of the filter to be tested.
  • the execution of the one or more processors further implements the following steps: obtaining training voice data; and adopting an ASR voice feature extraction algorithm to The training speech data is used for feature extraction to obtain the training filter speech features; the training filter speech features are input to the long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained.
  • the execution of the one or more processors further implements the following steps:
  • the hidden layer of the long-term memory neural network model adopts the first step.
  • An activation function calculates the voice characteristics of the training filter to obtain the neurons carrying the identification of the active state; in the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neurons carrying the identification of the active state to obtain the duration.
  • the output value of the hidden layer of the memory neural network model based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long- and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音数据处理方法、装置、计算机设备及存储介质,该语音数据处理方法包括:获取原始语音数据(S10);采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据(S20);采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征(S30);采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值(S40);若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据(S50)。该语音数据处理方法可有效去除噪音及静音的干扰,提高模型识别的准确率。

Description

语音数据处理方法、装置、计算机设备及存储介质
本专利申请以2018年6月4日提交的申请号为201810561725.6,名称为“语音数据处理方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种语音数据处理方法、装置、计算机设备及存储介质。
背景技术
语音活动检测(Voice Activity Detection,以下简称VAD)又称语音端点检测或语音边界检测,是从声音信号流中识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用。
目前,在语音识别模型训练或识别时,需要获取较纯净的语音数据进行模型训练,但用于当前的语音数据往往夹杂着噪音或静音,导致在使用夹杂噪音的语音数据进行训练时,获取的语音识别模型的准确率较低,不利于语音识别模型的推广应用。
发明内容
基于此,有必要针对上述技术问题,提供一种语音数据处理方法、装置、计算机设备及存储介质,用于解决现有技术中语音识别模型的准确率较低的技术问题。
一种语音数据处理方法,包括:
获取原始语音数据;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
一种语音数据处理装置,包括:
原始语音数据获取模块,用于获取原始语音数据;
待测语音数据获取模块,用于采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
待测滤波器语音特征获取模块,用于采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
识别概率值获取模块,用于采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
目标语音数据获取模块,用于若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取原始语音数据;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波 器语音特征;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取原始语音数据;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中语音数据处理方法的一应用环境图;
图2是本申请一实施例中语音数据处理方法的一流程图;
图3是图2中步骤S20的一具体流程图;
图4是图2中步骤S30的一具体流程图;
图5是本申请一实施例中语音数据处理方法的又一流程图;
图6是图5中步骤S63的一具体流程图;
图7是本申请一实施例中语音数据处理装置的一示意图;
图8是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的语音数据处理方法,可应用在如图1的应用环境中,其中,计算机设备通过网络与服务器进行通信。计算机设备可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器来实现。
具体地,该语音数据处理方法应用在银行、证券、保险等金融机构或者其他机构配置的计算机设备上,用于采用语音数据处理方法对原始语音数据进行预处理,获取训练数据,以便采用该训练数据训练声纹模型或其他语音模型,以提高模型识别的准确率。
在一实施例中,如图2所示,提供一种语音数据处理方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10:获取原始语音数据。
其中,原始语音数据是采用录音设备录制得到的说话人语音数据,该原始语音数据是未经处理的语音数据。本实施例中,该原始语音数据可以是wav、mp3或其他格式的语音 数据。该原始语音数据包括目标语音数据和干扰语音数据,其中,目标语音数据是指原始语音数据中声纹连续变化明显的语音部分,该目标语音数据一般为说话人语音。相应地,干扰语音数据是指原始语音数据中目标语音数据之外的语音部分,即干扰语音数据为说话人语音之外的语音。具体地,干扰语音数据包括静音段和噪音段,其中,静音段是指原始语音数据中由于静默而没有发音的语音部分,如采集到的原始语音数据中因说话人在说话过程由于思考和呼吸等而没有发出声音时的语音部分,该语音部分则为静音段。噪音段是指原始语音数据中的环境噪音对应的语音部分,如门窗的开关和物体的碰撞等发出的声音都可以认为是噪音段。
S20:采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据。
其中,待测语音数据是采用VAD算法将干扰语音数据中的静音段切除后获取的原始语音数据。VAD(Voice Activity Detection,语音活动检测)算法是从噪音环境中准确定位出目标语音数据的开始和结束的算法。VAD算法可用于从原始语音数据的信号流中识别和消除长时间的静音段,以消除原始语音数据中的静音段这一干扰语音数据,提高语音数据处理的精度。
帧是语音数据中最小的观测单位,分帧是依据语音数据的时序进行划分的过程,由于原始语音数据整体上看不是平稳的,但是在局部上可以看作是平稳的,所以将原始语音数据进行分帧可获取较平稳的单帧语音数据。在语音识别或声纹识别过程中需要输入的是平稳信号,所以服务器需要先对原始语音数据进行分帧处理。
切分是将原始语音数据中属于静音段的单帧语音数据切除的过程。本实施例中,采用VAD算法对分帧处理后的原始语音数据进行切分处理,去除静音段,以获取至少两帧待测语音数据。
在一实施例中,如图3所示,步骤S20中,即采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,具体包括如下步骤:
S21:对原始语音数据进行分帧处理,获取至少两帧单帧语音数据。
分帧是将N个采样点集合成一个观测单位,称为帧。通常情况下N的值为256或512,涵盖的时间约为20-30ms左右。为避免相邻两帧的变化过大,通过使相邻两帧之间有一段重叠区域,此重叠区域包含了M个采样点,通常M的值约为N的1/2或1/3,此过程称为分帧。具体地,在对原始语音数据进行分帧后,可获取至少两帧单帧语音数据,每一帧单帧语音数据包含N个采样点数。
进一步地,由于对原始语音数据进行分帧处理后获取的至少两帧单帧语音数据中,每一帧的起始段和末尾端会出现不连续的地方,分帧越多会导致分帧后的单帧语音数据与分帧前的原始语音数据的误差越大。为了使分帧后的单帧语音数据变得连续,每一帧都可以表现出周期函数的特征,因此,还需要对分帧后的每一单帧语音数据进行加窗处理和预加重处理,以获取质量更好的单帧语音数据。
加窗是每一帧乘以汉明窗(即Hamming Window),由于汉明窗的幅频特性是旁瓣衰减较大,服务器通过对单帧语音数据进行加窗处理,可增加帧左端和帧右端的连续性。即通过对分帧后的单帧语音数据进行加窗处理,可将非平稳语音信号转变为短时平稳信号。设分帧后的信号为S(n),n=0,1…,N-1,N为帧的大小,汉明窗的信号为W(n),则加窗处理后的信号为S'(n)=S(n)×W(n),其中,
Figure PCTCN2018094184-appb-000001
N为帧的大小,不同的a值会产生不同的汉明窗,一般情况下a取0.46。
为了增加语音信号相对于低频分量的高频分量的幅度,以消除声门激励和口鼻辐射的 影响,需要对单帧语音数据进行预加重处理,有助于提高信噪比。信噪比是指一个电子设备或者电子***中信号与噪音的比例。
预加重是将加窗后的单帧语音数据通过一个高通滤波器H(Z)=1-μz -1,其中,μ值介于0.9-1.0之间,Z表示单帧语音数据,预加重的目标是提升高频部分,使信号的频谱更平滑,保持在低频到高频的整个频带中,能用同样的信噪比求频谱,突出高频的共振峰。
可以理解地,通过对原始语音数据进行分帧、加窗和预加重等预处理,使得预处理后的单帧语音数据具有分辨率高、平稳性好且与原始语音数据误差较小的优点,使得后续对至少两帧单帧语音数据进行切分处理时,可提高获取至少两帧待测语音数据的效率和质量。
S22:采用短时能量计算公式对单帧语音数据进行切分处理,获取单帧语音数据对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。
其中,短时能量计算公式具体为
Figure PCTCN2018094184-appb-000002
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列。
其中,短时能量是指一帧语音信号的能量。第一门限阈值是预先设定的数值较低的门限阈值。第一语音数据是指单帧语音数据中某帧单帧语音数据对应的短时能量大于第一门限阈值的语音数据。VAD算法可检测出单帧语音数据中的静音段、过渡段、语音段和结束段这四部分语音。具体地,采用短时能量计算公式对每一帧单帧语音数据进行计算,获取每一帧单帧语音数据对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。本实施例中,保留短时能量大于第一门限阈值的单帧语音数据,即标记起始点,证明该起始点之后的单帧语音数据进入过渡段,即最终获取的第一语音数据包括过渡段、语音段和结束段。可以理解地,步骤S21中基于短时能量获取到的第一语音数据是将短时能量不大于第一门限阈值的单帧语音数据切分后所得到的,即去除了单帧语音数据中静音段这一部分干扰语音数据。
S23:采用过零率计算公式对第一语音数据进行切分处理,获取第一语音数据对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。
其中,过零率计算公式具体为
Figure PCTCN2018094184-appb-000003
其中,sgn[]为符号函数,其函数公式为
Figure PCTCN2018094184-appb-000004
x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
其中,第二门限阈值是预先设定好的数值较高的门限阈值。由于第一门限阈值被超过未必是语音段的开始,有可能是很短的噪音引起的,因此需要计算每一帧第一语音数据(即处于过渡段及过渡段以后的原始语音数据)的过零率,若第一语音数据对应的过零率不大于第二门限阈值,则认为该第一语音数据处于静音段,将该段第一语音数据进行切分,即保留过零率大于第二门限阈值的第一语音数据,从而获取至少两帧待测语音数据,达到进一步切分第一语音数据的过渡段中的干扰语音数据的目的。
本实施例中,先采用短时能量计算公式对原始语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,即标记起始点,证明该起始点之后的单帧语音数据进入过渡段,可初步切除单帧语音数据中的静音段;然后,计算每一 帧第一语音数据(即处于过渡段及过渡段以后的原始语音数据)的过零率,将过零率不大于第二门限阈值的第一语音数据切除,以获取过零率大于第二门限阈值的至少两帧待测语音数据。本实施例中,VAD算法通过采用双门限的方式切分第一语音数据中静音段对应的干扰语音数据,实现简单,提高语音数据的处理效率。
S30:采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征。
其中,待测滤波器语音特征是采用ASR语音特征提取算法对待测语音数据进行特征提取所获取的滤波器特征。滤波器(Filter-Bank,简称Fbank)特征是语音识别过程中常用的语音特征。由于当前常用的梅尔特征在进行模型训练或识别过程中会进行降维处理,导致部分信息的丢失,为避免上述问题出现,本实施例中采用滤波器特征代替常用的梅尔特征,可有助于提高后续模型识别的准确率。ASR(Automatic Speech Recognition,自动语音识别),是一种将人的语音转换为文本的技术,一般包括语音特征提取、声学模型与模式匹配和语言模型与语言处理三大部分。ASR语音特征提取算法是ASR技术中用于实现语音特征提取的算法。
由于声学模型或语音识别模型的识别是基于待测语音数据进行特征提取后的语音特征进行识别,而不能直接基于待测语音数据进行识别,因此,需先对待测语音数据进行特征提取。本实施例中,采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,以获取待测滤波器语音特征,可为后续模型识别提供技术支持。
在一实施例中,如图4所示,步骤S30中,即采用ASR语音特征提取算法对待测语音数据进行特征提取,获取待测滤波器语音特征,具体包括如下步骤:
S31:对每一帧待测语音数据进行快速傅里叶变换,获取与每一帧待测语音数据对应的频谱。
其中,待测语音数据对应的频谱是指待测语音数据在频域上的能量谱。由于语音信号在时域上的变换通常很难看出信号的特性,通常需将它转换为频域上的能量分布来观察,不同的能量分布代表不同语音的特性。本实施例中对每一帧待测语音数据进行快速傅里叶变换得到各帧待测语音数据频谱,即能量谱。
快速傅里叶变换(Fast Fourier Transform,以下简称FFT)是由离散傅里叶变换(Discrete Fourier Transform,以下简称DFT)的快速计算的统称。快速傅里叶变换用于将时域信号转换为频域能量谱的变换过程。由于待测语音数据是对原始语音数据进行预处理和语音活动检测处理后的信号,主要体现为时域上的信号,很难看出信号的特性,因此,需将对每一帧待测语音数据进行快速傅里叶变换以得到在频谱上的能量分布。
快速傅里叶变换的公式为X i(w)=FFT{x i(k)};其中,x i(k)为时域上的第i帧待测语音数据,X i(w)为频域上的第i帧待测语音数据对应的语音信号频谱,k表示时间序列,w表示语音信号频谱中的频率。具体地,离散傅里叶变换的计算公式为
Figure PCTCN2018094184-appb-000005
其中,
Figure PCTCN2018094184-appb-000006
N为每一帧待测语音数据所包含的采样点数。由于在数据量较大时,DFT的算法复杂度高,计算量较大,耗费时间,因此采用快速傅里叶变换进行计算,以加快计算速度,节省时间。具体地,快速傅里叶变换是利用离散傅里叶变换公式中的旋转因子
Figure PCTCN2018094184-appb-000007
的特性,即周期性、对称性和可约性,采用蝶形运算对上述公式进行转换,以降低算法复杂度。
具体地,N个采样点的DFT运算称为蝶形运算,而FFT运算就由若干级迭代的蝶形运算 组成。假设每一帧待测语音数据的采样点数为2^L个,(L为正整数),若采样点不足2^L个,可以用0补位,知道满足帧内采样点数在2^L个,则蝶形运算的计算公式为
Figure PCTCN2018094184-appb-000008
其中,X'(k')为偶数项分支的离散傅立叶变换,X”(k”)为奇数项分支的离散傅立叶变换。通过蝶形运算将N个采样点的DFT运算转换为奇数项离散傅里叶变换和偶数项离散傅里叶变换进行计算,降低算法复杂度,实现高效运算的目的。
S32:将频谱通过Mel滤波器组,获取待测滤波器语音特征。
其中,Mel滤波器组是指将快速傅里叶变换输出的能量谱(即待测语音数据的频谱)通过一组Mel(梅尔)尺度的三角滤波器组,定义一个有M个滤波器的滤波器组,采用的滤波器为三角滤波器,中心频率为f(m),m=1,2,...,M。M通常取22-26。梅尔滤波器组用于对频谱进行平滑化,并起消除滤波作用,可以突出语音的共振峰特征,可降低运算量。然后计算梅尔滤波器组中每个三角滤波器输出的对数能量
Figure PCTCN2018094184-appb-000009
其中,M是三角滤波器的个数,m表示第m个三角滤波器,H m(w)表示第m个三角滤波器的频率响应,X i(w)表示第i帧待测语音数据对应的语音信号频谱,w表示语音信号频谱中的频率,该对数能量即为待测滤波器语音特征。
本实施例中,先对每一帧待测语音数据进行快速傅里叶变换,获取与每一帧待测语音数据对应的频谱,以降低运算复杂度加快计算速度,节省时间。然后,将频谱通过Mel滤波器组并计算梅尔滤波器组中每个三角滤波器输出的对数能量,获取待测滤波器语音特征,以消除滤波,突出语音的共振峰特征,降低运算量。
S40:采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值。
其中,ASR-LSTM语音识别模型是预先训练好的用于区分待测滤波器语音特征中的语音和噪音的模型。具体地,ASR-LSTM语音识别模型是采用LSTM(long-short term memory,长短时记忆神经网络)对采用ASR语音特征提取算法提取出的训练滤波器语音特征进行训练后获得的语音识别模型。识别概率值是采用ASR-LSTM语音识别模型对待测滤波器语音特征进行识别时,识别其为语音的概率。该识别概率值可以为0-1之间的实数。具体地,将每一帧待测语音数据对应的待测滤波器语音特征输入到ASR-LSTM语音识别模型中进行识别,以获取每一帧待测滤波器语音特征对应的识别概率值,即为语音的可能性。
S50:若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。
由于待测语音数据是去除了静音段的单帧语音数据,因此排除了静音段的干扰。具体地,若识别概率值大于预设概率值,则认为该待测语音数据不为噪音段,即将识别概率值大于预设概率值的待测语音数据确定为目标语音数据。可以理解地,服务器通过对已去除静音段的待测语音数据进行识别,可排除目标语音数据中携带静音段和噪音段等干扰语音数据,以便采用目标语音数据作为训练数据对声纹模型或其他语音模型进行训练,以提高模型的识别准确率。若识别概率值不大于预设概率值,则证明该段待测语音数据很可能为噪音,将该段待测语音数据排除,以避免后续基于目标语音数据训练模型时,导致训练所得的模型识别准确率不高的问题。
本实施例中,先获取原始语音数据,该原始语音数据包括目标语音数据和干扰语音数据,采用VAD算法对原始语音数据进行分帧和切分处理,以便初步的切除静音段的干扰, 为后续获取较纯净的目标语音数据提供保障。采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征,有效解决了解决模型训练时对数据进行降维处理,造成部分信息丢失的问题。若识别概率值大于预设概率值,则认为该待测语音数据为目标语音数据,使得获取的目标语音数据不包含静音段和噪音段等被切除的干扰语音数据,即获取较纯净的目标语音数据,有助于后续利用目标语音数据作为训练数据对声纹模型或其他语音模型进行训练,以提高模型的识别准确率。
在一实施例中,该语音数据处理方法还包括:预先训练ASR-LSTM语音识别模型。
如图5所示,预先训练ASR-LSTM语音识别模型,具体包括如下步骤:
S61:获取训练语音数据。
其中,训练语音数据是从开源语音数据库中获取的随时间连续变化的语音数据,用于进行模型训练。该训练语音数据包括纯净的语音数据和纯净的噪音数据。开源语音数据库中已经将纯净的语音数据和纯净的噪音数据进行标记,以便进行模型训练。该训练语音数据中纯净的语音数据和纯净的噪音数据的比例为1:1,即获取同等比例的纯净的语音数据和纯净的噪音数据,能够有效防止模型训练过拟合的情况,以使通过训练语音数据训练所获得的模型的识别效果更加精准。本实施例中,在服务器获取训练语音数据之后,还需要对训练语音数据进行分帧,获取至少两帧训练语音数据,以便后续对每一帧训练语音数据进行特征提取。
S62:采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征。
由于声学模型训练是基于训练语音数据进行特征提取后的语音特征进行训练,而不是直接基于训练语音数据进行训练,因此,需先对训练语音数据进行特征提取,以获取待测滤波器语音特征。可以理解地,由于训练语音数据是具备时序性的,因此对每一帧待测语音数据进行特征提取所获取的训练滤波器语音特征是具备时序性的。具体地,服务器采用ASR语音特征提取算法对每一帧训练语音数据进行特征提取,获取携带时序状态的训练滤波器语音特征,为后续模型训练提供技术支持。本实施例中,采用ASR语音特征提取算法对训练语音数据进行特征提取的步骤与步骤S30的特征提取的步骤相同,为避免赘述,在此不再重复。
S63:将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
其中,长短时记忆神经网络(long-short term memory,以下简称LSTM)模型是一种时间递归神经网络模型,适合于处理和预测具有时间序列,且时间序列间隔和延迟相对较长的重要事件。LSTM模型具有时间记忆功能,因而用来处理携带时序状态的训练滤波器语音特征。LSTM模型是具有长时记忆能力的神经网络模型中的一种,具有输入层、隐藏层和输出层这三层网络结构。其中,输入层是LSTM模型的第一层,用于接收外界信号,即负责接收训练滤波器语音特征。输出层是LSTM模型的最后一层,用于向外界输出信号,即负责输出LSTM模型的计算结果。隐藏层是LSTM模型中除输入层和输出层之外的各层,用于对滤波器语音特征进行训练,以调整LSTM模型中隐藏层的各层的参数,以获取ASR-LSTM语音识别模型。可以理解地,采用LSTM模型进行模型训练增加了滤波器语音特征的时序性,从而提高了ASR-LSTM语音识别模型的准确率。本实施例中,LSTM模型的输出层采用Softmax(回归模型)进行回归处理,用于分类输出权重矩阵。Softmax(回归模型)是一种常用于神经网络的分类函数,它将多个神经元的输出,映射到[0,1]区间内,可以理解成概率,计算起来简单方便,从而来进行多分类输出,使其输出结果更准确。
本实施例中,先从开源语音数据库中获取同等比例的语音数据和噪音数据,以防止模型训练过拟合的情况,使通过训练语音数据训练获得的语音识别模型的识别效果更加精准。然后,采用ASR语音特征提取算法对每帧训练语音数据进行特征提取,获取训练滤波 器语音特征。最后,通过采用具有时间记忆能力的长短时记忆神经网络模型对训练滤波器语音特征进行训练,获取训练好的ASR-LSTM语音识别模型,使得该ASR-LSTM语音识别模型的识别准确率较高。
在一实施例中,如图6所示,步骤S63中,将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,具体包括如下步骤:
S631:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元。
其中,长短时记忆神经网络模型的隐藏层中的每个神经元包括三个门,分别为输入门、遗忘门和输出门。遗忘门决定在神经元中所要丢弃的过去的信息。输入门决定在神经元中所要增加的信息。输出门决定在神经元中所要输出的信息。第一激活函数是用于激活神经元状态的函数。神经元状态决定各个门(即输入门、遗忘门和输出门)的丢弃、增加和输出的信息。激活状态标识包括通过标识和不通过标识。本实施例中的输入门、遗忘门和输出门对应的标识分别为i、f和o。
本实施例中,具体选用Sigmoid(S型生长曲线)函数作为第一激活函数,Sigmoid函数是一个在生物学中常见的S型的函数,在信息科学中,由于其具有单增以及反函数单增等性质,Sigmoid函数常被用作神经网络的阈值函数,可将变量映射到0-1之间。第一激活函数的计算公式为
Figure PCTCN2018094184-appb-000010
其中,z表示遗忘门的输出值。
具体地,通过计算每一神经元(训练滤波器语音特征)的激活状态,以获取携带激活状态标识为通过标识的神经元。本实施例中,采用遗忘门的计算公式f t=σ(z)=σ(W f·[h t-1,x t]+b f),计算遗忘门哪些信息被接收(即只接收携带激活状态标识为通过标识的神经元),其中,f t表示遗忘门限(即激活状态),W f表示遗忘门的权重矩阵,b f表示遗忘门的权值偏置项,h t-1表示上一时刻神经元的输出,x t表示当前时刻的输入数据即训练滤波器语音特征,t表示当前时刻,t-1表示上一时刻。遗忘门中还包括遗忘门限,通过遗忘门的计算公式对训练滤波器语音特征进行计算,会得到一个0-1区间的标量(即遗忘门限),此标量决定了神经元根据当前状态和过去状态的综合判断所接收过去信息的比例,以达到数据的降维,减少计算量,提高训练效率。
S632:在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值。
其中,长短时记忆神经网络模型隐藏层的输出值包括输入门的输出值、输出门的输出值和神经元状态。具体地,在长短时记忆神经网络模型的隐藏层中的输入门中,采用第二激活函数携带激活状态标识为通过标识的神经元进行计算,获取隐藏层的输出值。本实施例中,由于线性模型的表达能力不够,因此采用tanh(双曲正切)函数作为输入门的激活函数(即第二激活函数),可加入非线性因素使得训练出的ASR-LSTM语音识别模型能够解决更复杂的问题。并且,激活函数tanh(双曲正切)具有收敛速度快的优点,可以节省训练时间,增加训练效率。
具体地,通过输入门的计算公式计算输入门的输出值。其中,输入门中还包括输入门限,输入门的计算公式为i t=σ(W i·[h t-1,x t]+b i),其中,W i为输入门的权值矩阵,i t表示输入门限,b i表示输入门的偏置项,通过输入门的计算公式对训练滤波器语音特征进 行计算会得到一个0-1区间的标量(即输入门限),此标量控制了神经元根据当前状态和过去状态的综合判断所接收当前信息的比例,即接收新输入的信息的比例,以减少计算量,提高训练效率。
然后,采用神经元状态的计算公式
Figure PCTCN2018094184-appb-000011
Figure PCTCN2018094184-appb-000012
计算当前神经元状态;其中,W c表示神经元状态的权重矩阵,b c表示神经元状态的偏置项,
Figure PCTCN2018094184-appb-000013
表示上一时刻的神经元状态,C t表示当前时刻神经元状态。通过将神经元状态和遗忘门限(输入门限)进行点乘操作,以便模型只输出所需的信息,提高模型学习的效率。
最后,采用输出门的计算公式o t=σ(W o[h t-1,x t]+b o)计算输出门中哪些信息被输出,再采用公式h t=o t*tanh(C t)计算当前时刻神经元的输出值,其中,o t表示输出门限,W o表示输出门的权重矩阵,b o表示输出门的偏置项,h t表示当前时刻神经元的输出值。
S633:基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。
首先,根据公式
Figure PCTCN2018094184-appb-000014
Figure PCTCN2018094184-appb-000015
Figure PCTCN2018094184-appb-000016
计算任意t时刻的输出门的误差项
Figure PCTCN2018094184-appb-000017
输入门的误差项
Figure PCTCN2018094184-appb-000018
遗忘门的误差项
Figure PCTCN2018094184-appb-000019
和神经元状态的误差项
Figure PCTCN2018094184-appb-000020
然后,根据权值更新公式
Figure PCTCN2018094184-appb-000021
进行误差反传更新,其中,T表示时刻,W表示权值,如W i、W c、W o或W f,B表示输出值如i t、f t,o t
Figure PCTCN2018094184-appb-000022
δ表示误差项,
Figure PCTCN2018094184-appb-000023
为上一时刻神经元的状态数据,b t-1 h为上一时刻隐藏层的输出值。根据偏置更新公式
Figure PCTCN2018094184-appb-000024
更新偏置。其中,b为各门的偏置项,δ a,t表示t时刻各门的误差。
最后,根据该权值更新公式进行运算即可获取更新后的权值,根据偏置更新公式更新偏置,将获取的更新后的各层的权值和偏置,应用到长短时记忆神经网络模型中即可获取训练好的ASR-LSTM语音识别模型。进一步地,该ASR-LSTM语音识别模型中的各权值实现了ASR-LSTM语音识别模型决定丢弃哪些旧信息、增加哪些新信息以及输出哪些信息的功能。在ASR-LSTM语音识别模型的输出层最终会输出概率值。该概率值表示训练语音数据 在通过ASR-LSTM语音识别模型识别后确定其为语音数据的概率,可广泛应用于语音数据处理方面,以达到准确识别训练滤波器语音特征的目的。
本实施例中,通过在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元,以达到数据的降维,减少计算量,提高训练效率。在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值,以便基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取更新后的各权值和偏置,将更新后的各权值和偏置应用到长短时记忆神经网络模型中即可获取ASR-LSTM语音识别模型,可广泛应用于语音数据处理方面,以达到准确识别训练滤波器语音特征的目的。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种语音数据处理装置,该语音数据处理装置与上述实施例中语音数据处理方法一一对应。如图7所示,该语音数据处理装置包括原始语音数据获取模块10、待测语音数据获取模块20、待测滤波器语音特征获取模块30、识别概率值获取模块40和目标语音数据获取模块50。各功能模块详细说明如下:
原始语音数据获取模块10,用于获取原始语音数据。
待测语音数据获取模块20,用于采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据。
待测滤波器语音特征获取模块30,用于采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征。
识别概率值获取模块40,用于采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值。
目标语音数据获取模块50,用于若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。
具体地,待测语音数据获取模块20包括单帧语音数据获取单元21、第一语音数据获取单元22和待测语音数据获取单元23。
单帧语音数据获取单元21,用于对原始语音数据进行分帧处理,获取至少两帧单帧语音数据。
第一语音数据获取单元22,用于采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。
待测语音数据获取单元23,采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000025
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列。
过零率计算公式为
Figure PCTCN2018094184-appb-000026
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
具体地,待测滤波器语音特征获取模块30包括频谱获取单元31和待测滤波器语音特征获取单元32。
频谱获取单元31,用于对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱。
待测滤波器语音特征获取单元32,用于将频谱通过Mel滤波器组,获取待测滤波器语音特征。
具体地,语音数据处理装置还包括ASR-LSTM语音识别模型训练模块60,用于预先训练ASR-LSTM语音识别模型。
ASR-LSTM语音识别模型训练模块60包括训练语音数据获取单元61、训练滤波器语音特征获取单元62和ASR-LSTM语音识别模型获取单元63。
训练语音数据获取单元61,用于获取训练语音数据。
训练滤波器语音特征获取单元62,用于采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征。
ASR-LSTM语音识别模型获取单元63,用于将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
具体地,ASR-LSTM语音识别模型获取单元63包括激活状态神经元获取子单元631、模型输出值获取子单元632和ASR-LSTM语音识别模型获取子单元633。
激活状态神经元获取子单元631,用于在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元。
模型输出值获取子单元632,用于在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值。
ASR-LSTM语音识别模型获取子单元633,用于基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。
关于语音数据处理装置的具体限定可以参见上文中对于语音数据处理方法的限定,在此不再赘述。上述语音数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作***和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储执行语音数据处理方法过程中生成或获取的数据,如目标语音数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音数据处理方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取原始语音数据;采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征;采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值;若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:对原始语音数据进 行分帧处理,获取至少两帧单帧语音数据;采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000027
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列;过零率计算公式为
Figure PCTCN2018094184-appb-000028
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;将频谱通过Mel滤波器组,获取待测滤波器语音特征。
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:获取训练语音数据;采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值;基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取ASR-LSTM语音识别模型。
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:获取原始语音数据;采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征;采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值;若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:对原始语音数据进行分帧处理,获取至少两帧单帧语音数据;采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000029
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列;过零率计算公式为
Figure PCTCN2018094184-appb-000030
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;将频谱通过Mel滤波器组,获取待测滤波器语音特征。
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:获取训练语音数据;采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值;基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音数据处理方法,其特征在于,包括:
    获取原始语音数据;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
  2. 如权利要求1所述的语音数据处理方法,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。
  3. 如权利要求2所述的语音数据处理方法,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100001
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数据,E(n)为所述短时能量,m为时间序列;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100002
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
  4. 如权利要求1所述的语音数据处理方法,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。
  5. 如权利要求1所述的语音数据处理方法,其特征在于,所述语音数据处理方法还包括:预先训练所述ASR-LSTM语音识别模型;
    所述预先训练所述ASR-LSTM语音识别模型,包括:
    获取训练语音数据;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
  6. 如权利要求5所述的语音数据处理方法,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。
  7. 一种语音数据处理装置,其特征在于,包括:
    原始语音数据获取模块,用于获取原始语音数据;
    待测语音数据获取模块,用于采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
    待测滤波器语音特征获取模块,用于采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
    识别概率值获取模块,用于采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
    目标语音数据获取模块,用于若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
  8. 如权利要求7所述的语音数据处理装置,其特征在于,所述待测语音数据获取模块包括:
    单帧语音数据获取单元,用于对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;
    第一语音数据获取单元,用于采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的原始语音数据,作为第一语音数据;
    待测语音数据获取单元,用于采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的原始语音数据,获取至少两帧所述待测语音数据。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取原始语音数据;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
  10. 如权利要求9所述的计算机设备,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。
  11. 如权利要求10所述的计算机设备,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100003
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数据,E(n)为所述短时能量,m为时间序列;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100004
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
  12. 如权利要求9所述的计算机设备,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。
  13. 如权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:预先训练所述ASR-LSTM语音识别模型;
    所述预先训练所述ASR-LSTM语音识别模型,包括:
    获取训练语音数据;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
  14. 如权利要求13所述的计算机设备,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取原始语音数据;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。
  17. 如权利要求16所述的计算机设备,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100005
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数 据,E(n)为所述短时能量,m为时间序列;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100006
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:预先训练所述ASR-LSTM语音识别模型;
    所述预先训练所述ASR-LSTM语音识别模型,包括:
    获取训练语音数据;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。
  20. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。
PCT/CN2018/094184 2018-06-04 2018-07-03 语音数据处理方法、装置、计算机设备及存储介质 WO2019232845A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810561725.6 2018-06-04
CN201810561725.6A CN108877775B (zh) 2018-06-04 2018-06-04 语音数据处理方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2019232845A1 true WO2019232845A1 (zh) 2019-12-12

Family

ID=64336394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094184 WO2019232845A1 (zh) 2018-06-04 2018-07-03 语音数据处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN108877775B (zh)
WO (1) WO2019232845A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667817A (zh) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 一种语音识别方法、装置、计算机***及可读存储介质
CN111862973A (zh) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 一种基于多命令词的语音唤醒方法及其***
CN112001482A (zh) * 2020-08-14 2020-11-27 佳都新太科技股份有限公司 振动预测及模型训练方法、装置、计算机设备和存储介质
CN112750461A (zh) * 2020-02-26 2021-05-04 腾讯科技(深圳)有限公司 语音通信优化方法、装置、电子设备及可读存储介质
CN113140222A (zh) * 2021-05-10 2021-07-20 科大讯飞股份有限公司 一种声纹向量提取方法、装置、设备及存储介质
CN115862636A (zh) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 一种基于语音识别技术的互联网人机验证方法

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106935247A (zh) * 2017-03-08 2017-07-07 珠海中安科技有限公司 一种用于正压式空气呼吸器和狭小密闭空间的语音识别控制装置及方法
CN109584887B (zh) * 2018-12-24 2022-12-02 科大讯飞股份有限公司 一种声纹信息提取模型生成、声纹信息提取的方法和装置
CN109658943B (zh) * 2019-01-23 2023-04-14 平安科技(深圳)有限公司 一种音频噪声的检测方法、装置、存储介质和移动终端
CN110060667B (zh) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 语音信息的批量处理方法、装置、计算机设备及存储介质
CN110111797A (zh) * 2019-04-04 2019-08-09 湖北工业大学 基于高斯超矢量和深度神经网络的说话人识别方法
CN112017676A (zh) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 音频处理方法、装置和计算机可读存储介质
CN110473552A (zh) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 语音识别认证方法及***
CN110600018B (zh) * 2019-09-05 2022-04-26 腾讯科技(深圳)有限公司 语音识别方法及装置、神经网络训练方法及装置
CN111048071B (zh) * 2019-11-11 2023-05-30 京东科技信息技术有限公司 语音数据处理方法、装置、计算机设备和存储介质
CN110856064B (zh) * 2019-11-27 2021-06-04 内蒙古农业大学 一种家畜牧食声音信号采集装置及使用该装置的采集方法
CN111582020B (zh) * 2020-03-25 2024-06-18 平安科技(深圳)有限公司 信号处理方法、装置、计算机设备及存储介质
CN112116912B (zh) * 2020-09-23 2024-05-24 平安国际智慧城市科技股份有限公司 基于人工智能的数据处理方法、装置、设备及介质
CN112349277B (zh) * 2020-09-28 2023-07-04 紫光展锐(重庆)科技有限公司 结合ai模型的特征域语音增强方法及相关产品
CN112242147B (zh) * 2020-10-14 2023-12-19 福建星网智慧科技有限公司 一种语音增益控制方法及计算机存储介质
CN112259114A (zh) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 语音处理方法及装置、计算机存储介质、电子设备
CN112908309A (zh) * 2021-02-06 2021-06-04 漳州立达信光电子科技有限公司 语音识别方法、装置、设备及按摩沙发

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854661A (zh) * 2014-03-20 2014-06-11 北京百度网讯科技有限公司 一种提取音乐特征的方法及装置
WO2017003903A1 (en) * 2015-06-29 2017-01-05 Amazon Technologies, Inc. Language model speech endpointing
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN107680597A (zh) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN107705802A (zh) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 语音转换方法、装置、电子设备及可读存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9786270B2 (en) * 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
CN105118502B (zh) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 语音识别***的端点检测方法及***
US9842106B2 (en) * 2015-12-04 2017-12-12 Mitsubishi Electric Research Laboratories, Inc Method and system for role dependent context sensitive spoken and textual language understanding with neural networks
US9972310B2 (en) * 2015-12-31 2018-05-15 Interactive Intelligence Group, Inc. System and method for neural network based feature extraction for acoustic model development
CN105825871B (zh) * 2016-03-16 2019-07-30 大连理工大学 一种无前导静音段语音的端点检测方法
US10373612B2 (en) * 2016-03-21 2019-08-06 Amazon Technologies, Inc. Anchored speech detection and speech recognition
US11069335B2 (en) * 2016-10-04 2021-07-20 Cerence Operating Company Speech synthesis using one or more recurrent neural networks
CN107704918B (zh) * 2017-09-19 2019-07-12 平安科技(深圳)有限公司 驾驶模型训练方法、驾驶人识别方法、装置、设备及介质
CN107832400B (zh) * 2017-11-01 2019-04-16 山东大学 一种基于位置的lstm和cnn联合模型进行关系分类的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854661A (zh) * 2014-03-20 2014-06-11 北京百度网讯科技有限公司 一种提取音乐特征的方法及装置
WO2017003903A1 (en) * 2015-06-29 2017-01-05 Amazon Technologies, Inc. Language model speech endpointing
CN107527620A (zh) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 电子装置、身份验证的方法及计算机可读存储介质
CN107705802A (zh) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 语音转换方法、装置、电子设备及可读存储介质
CN107680597A (zh) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750461A (zh) * 2020-02-26 2021-05-04 腾讯科技(深圳)有限公司 语音通信优化方法、装置、电子设备及可读存储介质
CN112750461B (zh) * 2020-02-26 2023-08-01 腾讯科技(深圳)有限公司 语音通信优化方法、装置、电子设备及可读存储介质
CN111667817A (zh) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 一种语音识别方法、装置、计算机***及可读存储介质
CN111862973A (zh) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 一种基于多命令词的语音唤醒方法及其***
CN112001482A (zh) * 2020-08-14 2020-11-27 佳都新太科技股份有限公司 振动预测及模型训练方法、装置、计算机设备和存储介质
CN112001482B (zh) * 2020-08-14 2024-05-24 佳都科技集团股份有限公司 振动预测及模型训练方法、装置、计算机设备和存储介质
CN113140222A (zh) * 2021-05-10 2021-07-20 科大讯飞股份有限公司 一种声纹向量提取方法、装置、设备及存储介质
CN113140222B (zh) * 2021-05-10 2023-08-01 科大讯飞股份有限公司 一种声纹向量提取方法、装置、设备及存储介质
CN115862636A (zh) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 一种基于语音识别技术的互联网人机验证方法

Also Published As

Publication number Publication date
CN108877775A (zh) 2018-11-23
CN108877775B (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
WO2019232845A1 (zh) 语音数据处理方法、装置、计算机设备及存储介质
CN108198547B (zh) 语音端点检测方法、装置、计算机设备和存储介质
CN109243491B (zh) 在频谱上对语音进行情绪识别的方法、***及存储介质
CN110120224B (zh) 鸟声识别模型的构建方法、装置、计算机设备及存储介质
WO2021139294A1 (zh) 语音分离模型训练方法、装置、存储介质和计算机设备
US9792897B1 (en) Phoneme-expert assisted speech recognition and re-synthesis
WO2019232829A1 (zh) 声纹识别方法、装置、计算机设备及存储介质
CN111582020B (zh) 信号处理方法、装置、计算机设备及存储介质
WO2019232848A1 (zh) 语音区分方法、装置、计算机设备及存储介质
CN108305639B (zh) 语音情感识别方法、计算机可读存储介质、终端
Archana et al. Gender identification and performance analysis of speech signals
Thirumuru et al. Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition
Labied et al. An overview of automatic speech recognition preprocessing techniques
Sharan Cough sound detection from raw waveform using SincNet and bidirectional GRU
Ribeiro et al. Binary neural networks for classification of voice commands from throat microphone
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
Nasr et al. Arabic speech recognition by bionic wavelet transform and mfcc using a multi layer perceptron
Mahesha et al. Classification of speech dysfluencies using speech parameterization techniques and multiclass SVM
Sahoo et al. MFCC feature with optimized frequency range: An essential step for emotion recognition
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Yerigeri et al. Meta-heuristic approach in neural network for stress detection in Marathi speech
Manjutha et al. An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset
Bai et al. CIAIC-BAD system for DCASE2018 challenge task 3
Budiga et al. CNN trained speaker recognition system in electric vehicles
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921688

Country of ref document: EP

Kind code of ref document: A1