CN109545226B

CN109545226B - Voice recognition method, device and computer readable storage medium

Info

Publication number: CN109545226B
Application number: CN201910014557.3A
Authority: CN
Inventors: 贾雪丽; 程宁; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2022-11-22
Anticipated expiration: 2039-01-04
Also published as: CN109545226A; WO2020140609A1

Abstract

The embodiment of the invention discloses a voice recognition method, a device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a first digital voice signal to be detected, wherein the first digital voice signal consists of a digital password, and the digital password consists of a plurality of digits; performing preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals; processing each second digital voice signal according to a preset signal processing method, determining a logarithmic Mel power frequency spectrum corresponding to each second digital voice signal, and extracting target characteristic information of each second digital voice signal from the logarithmic Mel power frequency spectrum; identifying the target characteristic information of each second digital voice signal to obtain a target number corresponding to each second digital voice signal; and determining a target digital password corresponding to the first digital voice signal according to the target number so as to improve the performance and effectiveness of voice recognition.

Description

Voice recognition method, device and computer readable storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a speech recognition device, and a computer-readable storage medium.

Background

An Identity-Vector (I-Vector) based speaker recognition system is a classic method for solving the problem of text-independent speaker recognition, however, in recent years, the field is receiving more and more attention from deep learning. Deep learning method techniques for solving acoustic problems can be divided into two categories: (1) A Deep Neural Network (DNN) is connected behind a Hidden Markov Model (HMM) to train the statistical parameters of Baum-Welch; (2) A training method combining bottleneck characteristics and Mel Frequency Cepstral Coeffient (MFCC) characteristics. Since text-dependent problems are primarily based on text-independent problems, DNN can also be used to solve text-dependent speaker recognition problems. However, the DNN is used to extract features for speaker differentiation, so how to improve the performance and effectiveness of the speaker recognition system becomes the key point of research.

Disclosure of Invention

Embodiments of the present invention provide a speech recognition method, a speech recognition device, and a computer-readable storage medium, which can improve performance and effectiveness of a speech recognition system.

In a first aspect, an embodiment of the present invention provides a speech recognition method, where the method includes:

acquiring a first digital voice signal to be detected, wherein the first digital voice signal consists of a digital password, and the digital password consists of a plurality of digits;

performing preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, wherein each second digital voice signal is determined by one digit;

processing each second digital voice signal according to a preset signal processing method, determining a logarithmic Mel power frequency spectrum corresponding to each second digital voice signal, and extracting target characteristic information of each second digital voice signal from the logarithmic Mel power frequency spectrum;

identifying target characteristic information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal;

and determining a target digital password corresponding to the first digital voice signal according to the target digital corresponding to each second digital voice signal.

Further, before the acquiring the first digital voice signal to be detected, the method further includes:

acquiring a training sample set, wherein the training sample set comprises target characteristic information of sample digital voice signals, and each sample digital voice signal is determined by one number;

generating an initial neural network model according to a preset neural network algorithm;

and training and optimizing the initial neural network model based on the target characteristic information of each sample digital voice signal in the training sample set to obtain the neural network model.

Further, the identifying, based on the neural network model, the target feature information of each second digital voice signal to obtain a target number corresponding to each second digital voice signal includes:

calculating the similarity between the target characteristic information of the second digital voice signal and the target characteristic information of each sample digital voice signal in the training sample set;

acquiring target characteristic information of at least one sample digital voice signal with the similarity larger than a preset similarity threshold;

determining target characteristic information of the target sample digital voice signal with the maximum similarity from the target characteristic information of the at least one sample digital voice signal;

and determining target digits corresponding to the target characteristic information of the target sample digital voice signal according to the corresponding relation between the preset target characteristic information of the sample digital voice signal and the digits.

Further, after determining the target number corresponding to the target characteristic information of the target sample digital voice signal, the method further includes:

acquiring the number of target digits corresponding to the target characteristic information of the target sample digital voice signal and acquiring the number of digits corresponding to the first digital voice signal to be detected;

calculating a quantity ratio between the quantity of the target digits and the quantity of digits corresponding to the first digital voice signal;

determining the probability of success detection of the first digital voice signal to be detected according to the quantity ratio;

judging whether the probability is smaller than a preset threshold value or not;

and if so, selecting a sample training set similar to the first digital voice signal to train and adjust the neural network model.

Further, the processing each second digital voice signal according to a preset signal processing method to determine a logarithmic mel power spectrum corresponding to each second digital voice signal includes:

performing frame division and windowing processing on each second digital voice signal to obtain a voice frame corresponding to each second digital voice signal;

performing fast Fourier transform on the voice frame corresponding to each second digital voice signal to obtain a frequency spectrum signal of the voice frame corresponding to each second digital voice signal;

and converting the spectrum signal of the voice frame corresponding to each second digital voice signal into logarithmic mel spectrum power so as to obtain the logarithmic mel power spectrum corresponding to each second digital voice signal.

Further, the converting the spectrum signal of the speech frame corresponding to each second digital speech signal into a logarithmic mel spectrum power includes:

taking an absolute value of a frequency spectrum signal of a voice frame corresponding to each second digital voice signal to obtain a power frequency spectrum of the voice frame corresponding to each second digital voice signal;

and converting the power spectrum of the voice frame corresponding to each second digital voice signal into a logarithmic Mel power spectrum.

taking a square value of a frequency spectrum signal of a voice frame corresponding to each second digital voice signal to obtain a power frequency spectrum of the voice frame corresponding to each second digital voice signal;

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus, which includes means for performing the method of the first aspect.

In a third aspect, an embodiment of the present invention provides another speech recognition device, which includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program that supports the speech recognition device to execute the foregoing method, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the foregoing method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method of the first aspect.

In the embodiment of the present invention, the voice recognition device may perform segmentation processing on the acquired first digital voice signal to obtain a plurality of second digital voice signals, process each second digital voice signal to obtain a logarithmic mel power spectrum corresponding to each second digital voice signal, extract target feature information from the logarithmic mel power spectrum, and recognize each target feature information to obtain a target number, so as to determine a target digital password corresponding to the first digital voice signal according to the target number. In this way, the voice signal of text-independent type can be effectively recognized, and the performance and the effectiveness of the voice recognition system are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech recognition method provided by an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another speech recognition method provided by the embodiments of the present invention;

FIG. 3 is a schematic block diagram of a speech recognition device provided by an embodiment of the present invention;

fig. 4 is a schematic block diagram of another speech recognition apparatus provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The voice recognition method provided by the embodiment of the invention can be executed by a voice recognition device, wherein in some embodiments, the voice recognition device can be arranged on an intelligent terminal such as a mobile phone, a computer, a tablet, an intelligent watch and the like. In some embodiments, the speech recognition device may be installed on a smart terminal, in some embodiments, the speech recognition device may be spatially independent from the smart terminal, in some embodiments, the speech recognition device may be a component of the smart terminal, i.e., the smart terminal includes a speech recognition device.

In one embodiment, the voice recognition device may obtain a first digital voice signal to be detected, and in some embodiments, the first digital voice signal is composed of a digital code, and the digital code is composed of a plurality of numbers. The voice recognition equipment can perform preset segmentation processing on the first digital voice signal after acquiring the first digital voice signal to obtain a plurality of second digital voice signals; in some embodiments, the second digital speech signal is determined by a number. After obtaining the second digital voice signals, the voice recognition device may process each of the second digital voice signals obtained by the segmentation by using a preset signal processing method, obtain a logarithmic mel-power spectrum corresponding to each of the second digital voice signals, and extract target feature information of each of the second digital voice signals from the logarithmic mel-power spectrum. The voice recognition device may recognize target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal, and determine a target number password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal. The following schematically illustrates a speech recognition method according to an embodiment of the present invention.

Referring to fig. 1 specifically, fig. 1 is a schematic flowchart of a speech recognition method provided in an embodiment of the present invention, and as shown in fig. 1, the method may be executed by a speech recognition device, and a specific explanation of the speech recognition device is as described above and is not described herein again. Specifically, the method of the embodiment of the present invention includes the following steps.

S101: the method comprises the steps of obtaining a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of a plurality of numbers.

In the embodiment of the invention, the voice recognition device can acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of a plurality of numbers. In some embodiments, the digital code is composed of any one or more digits from 0 to 9, for example, the digital code may be a string of digits from 0 to 9 spoken by a speaker as a voice signal.

S102: and carrying out preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, wherein each second digital voice signal is determined by one number.

In this embodiment of the present invention, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, and in some embodiments, each of the second digital voice signals is determined by one number.

In one embodiment, the speech recognition device may perform a preset segmentation process on the first digital speech signal through an HMM to segment the first digital speech signal into a plurality of mutually independent digital second digital speech signals. In some embodiments, the speech recognition device may segment a first digital speech signal composed of a digital cipher into a second digital speech signal composed of mutually independent digits by using the HMM-based segmentation method, so that a neural network model recognizes the second digital speech signal.

In an embodiment, when the voice recognition device performs preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, the voice recognition device may record a sequence of each split second digital voice signal arranged in the first digital voice signal, so as to subsequently recognize a target number corresponding to each second digital voice signal, and then may determine a sequence of the target number arrangement according to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, and form a target digital password corresponding to the first digital voice signal according to the sequence of the target number arrangement.

S103: and processing each second digital voice signal according to a preset signal processing method, determining a logarithmic Mel power frequency spectrum corresponding to each second digital voice signal, and extracting target characteristic information of each second digital voice signal from the logarithmic Mel power frequency spectrum.

In this embodiment of the present invention, the voice recognition device may process each of the second digital voice signals obtained by the segmentation according to a preset signal processing method to obtain a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and extract target feature information of each of the second digital voice signals from the logarithmic mel power spectrum.

In an embodiment, when the speech recognition device processes each of the second digital speech signals obtained by segmentation according to a preset signal processing method to obtain a logarithmic mel power spectrum corresponding to each of the second digital speech signals, the speech recognition device may perform frame division windowing on each of the second digital speech signals obtained by segmentation to obtain a speech frame corresponding to each of the second digital speech signals; performing fast Fourier transform on the voice frame corresponding to each second digital voice signal to obtain a frequency spectrum signal of the voice frame corresponding to each second digital voice signal; and converting the spectrum signal of the speech frame corresponding to each second digital speech signal into logarithmic mel spectrum power to obtain the logarithmic mel power spectrum corresponding to each second digital speech signal.

In some embodiments, the logarithmic mel-power spectrum refers to power values at the mel scale, which in some embodiments is a non-linear frequency scale (i.e., hertz can be converted to mel by a formula) based on the sensory judgment of the human ear on equidistant pitch changes.

In an embodiment, when extracting the target feature information of each second digital voice signal from the logarithmic mel power spectrum, the voice recognition device may perform normalization processing on the logarithmic mel power spectrum corresponding to the second digital voice signal corresponding to each digit to obtain the feature information of the second digital voice signal corresponding to each digit. The normalization processing refers to normalizing the logarithmic Mel power spectrum characteristics of the second digital voice signal corresponding to each digit, so that the subsequent processing of the neural network model is facilitated, and convergence can be accelerated. In some embodiments, the speech recognition device may take as input a corresponding log mel-power spectrum of the second digital speech signal; in some embodiments, the input features have a frequency domain length of 64 bandwidths and a time domain length of 96 frames (equal to the longest numeric pronunciation time).

In an embodiment, when the speech recognition device converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the power of the log mel spectrum, the speech recognition device may obtain an absolute value of the spectrum signal of the speech frame corresponding to each second digital speech signal to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into the power of the log mel spectrum.

In an embodiment, when the speech recognition device converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the power of the logarithmic mel spectrum, the speech recognition device may obtain the power spectrum of the speech frame corresponding to each second digital speech signal by taking a square value of the spectrum signal of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into the power of the logarithmic mel spectrum.

S104: and identifying the target characteristic information of each second digital voice signal based on the neural network model to obtain a target number corresponding to each second digital voice signal.

In this embodiment of the present invention, the speech recognition device may recognize the target feature information of each second digital speech signal based on a neural network model, and obtain a target number corresponding to each second digital speech signal.

In some embodiments, the neural network model may be composed of a predetermined convolutional neural network, wherein the predetermined convolutional neural network structure is an MFM-CNN structure, and an activation function of the MFM is applied to a feature map obtained from convolutional layers, and the activation function of the MFM is as follows:

where x is the MFM layer input tensor of WxHxN size, and y is

The magnitude of the output tensor, i and j, is divided into time domains, and k represents the index of the channel.

The convolution layer is used for extracting features, the MFM layer is used as an activation function layer to perform nonlinear transformation, and the pooling layer is used for translating without deformation to reduce the number of parameters; the input to the network is the logarithmic Mel power spectrum, which is actually equivalent to the input of a matrix, and the matrix is input to the neural network for training.

In one embodiment, the MFM-CNN structure is composed of groups of layers, each group being preceded by a convolutional layer followed by an MFM layer and a pooling layer, from the very beginning by a feature that accepts the log-Mel power spectrum as input. The layers are stacked together and then connected by a full link layer to create the embedded layer.

S105: and determining a target digital password corresponding to the first digital voice signal according to the target digital corresponding to each second digital voice signal.

In this embodiment of the present invention, the voice recognition device may determine, according to the target number corresponding to each second digital voice signal, the target number password corresponding to the first digital voice signal.

In an embodiment, the voice recognition device may determine the sequence of the target digit arrangement according to the sequence of each recorded second digital voice signal arranged in the first digital voice signal, and form the target digital password corresponding to the first digital voice signal according to the sequence of the target digit arrangement.

In the embodiment of the present invention, the voice recognition device may obtain a first digital voice signal to be detected, segment the first digital voice signal to obtain a plurality of second digital voice signals corresponding to the first digital voice signal, pre-process each of the second digital voice signals obtained by the segmentation to obtain a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and extract target feature information of each of the second digital voice signals from the logarithmic mel power spectrum, so that the voice recognition device may recognize the target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals, and determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals. In this way, the speech signal of text-independent type can be effectively recognized, and the performance and effectiveness of the speech recognition system can be improved.

Referring to fig. 2, fig. 2 is a schematic flowchart of another speech recognition method according to an embodiment of the present invention, and as shown in fig. 2, the method may be executed by a speech recognition device, and a specific explanation of the speech recognition device is as described above, and is not described here again. The embodiment of the present invention is different from the embodiment described in fig. 1, in that the embodiment of the present invention mainly schematically illustrates a detailed implementation process of the embodiment of the present invention. Specifically, the method of the embodiment of the present invention includes the following steps.

S201: a training sample set is obtained, wherein the training sample set comprises target characteristic information of sample digital voice signals, and each sample digital voice signal is determined by a number.

In an embodiment of the invention, the speech recognition device may obtain a training sample set comprising target feature information of sample digital speech signals, each sample digital speech signal being determined by a number.

S202: and training and optimizing an initial neural network model based on the target characteristic information of each sample digital voice signal in the training sample set to obtain the neural network model.

In the embodiment of the present invention, the speech recognition device may generate an initial neural network model according to a preset neural network algorithm, use the target feature information of the sample digital speech signals as input data, perform training optimization on the initial neural network model based on the target feature information of each sample digital speech signal in the training sample set, and output a number corresponding to the target feature information, thereby obtaining the neural network model. The neural network model is explained as above, and is not described herein again.

S203: the method comprises the steps of obtaining a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of a plurality of numbers.

In the embodiment of the present invention, a voice recognition device may obtain a first digital voice signal to be detected, where the first digital voice signal is composed of a digital password, and the digital password is composed of a plurality of numbers. The explanation of the numeric code is as described above, and is not described herein again.

S204: and carrying out preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, wherein each second digital voice signal is determined by one number.

In this embodiment of the present invention, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, where each of the second digital voice signals is determined by one number. The specific embodiments are as described above, and are not described herein again.

S205: and determining a logarithmic Mel power spectrum corresponding to each second digital voice signal according to a preset signal processing method, and extracting target characteristic information of each second digital voice signal from the logarithmic Mel power spectrum.

In this embodiment of the present invention, the voice recognition device may process each of the second digital voice signals obtained by segmentation by using a preset signal processing method, obtain a logarithmic mel-power spectrum corresponding to each of the second digital voice signals, and extract target feature information of each of the second digital voice signals from the logarithmic mel-power spectrum. The specific embodiments are as described above and will not be described herein.

S206: and calculating the similarity between the target characteristic information of the second digital voice signal and the target characteristic information of each sample digital voice signal in the training sample set.

In the embodiment of the present invention, when the speech recognition device recognizes the target feature information of each second digital speech signal based on a neural network model, the speech recognition device may calculate a similarity between the target feature information of the second digital speech signal and the target feature information of each sample digital speech signal in the training sample set, so as to determine the target digital speech signal according to the similarity.

S207: acquiring target characteristic information of at least one sample digital voice signal with the similarity larger than a preset similarity threshold, and determining the target characteristic information of the target sample digital voice signal with the maximum similarity from the target characteristic information of the at least one sample digital voice signal.

In an embodiment of the present invention, after calculating the similarity between the second digital voice signal and each sample digital voice signal in the training sample set, the voice recognition device may obtain target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold, and determine, from the target feature information of the at least one sample digital voice signal, the target feature information of the target sample digital voice signal with the largest similarity.

S208: and determining a target digit corresponding to the target characteristic information of the target sample digital voice signal according to the corresponding relation between the target characteristic information and the digits of a preset sample digital voice signal.

In the embodiment of the present invention, after the voice recognition device determines the target digital voice signal with the maximum similarity, the voice recognition device may further determine a target number corresponding to the target characteristic information of the target sample digital voice signal according to a corresponding relationship between the target characteristic information of a preset sample digital voice signal and a number.

In an embodiment, the voice recognition device may further determine the similarity between the first digital voice signal to be detected and the target digital voice signal by calculating the cosine similarity between the first digital voice signal to be detected and the target digital voice signal, and determine the number with the cosine similarity larger than a preset threshold as the target number corresponding to the target digital voice signal.

In one embodiment, after determining the target number corresponding to the target digital voice signal, the voice recognition device may obtain the number of target numbers corresponding to the target digital voice signal and the number of digits corresponding to the first digital voice signal to be detected, and calculate a number ratio between the number of target numbers and the number of digits corresponding to the first digital voice signal, so as to determine, according to the number ratio, a probability that the first digital voice signal to be detected is successfully detected. The voice recognition device may detect whether the probability is smaller than a preset threshold, and if the probability is smaller than the preset threshold, may select a sample training set similar to the first digital voice signal to train and adjust the neural network model, so as to train and optimize the neural network model in real time, so as to further improve performance and effectiveness of recognizing the voice signal.

S209: and determining a target digital password corresponding to the first digital voice signal according to the target digital corresponding to each second digital voice signal.

In the embodiment of the present invention, the voice recognition device may obtain a first digital voice signal to be detected, perform preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, process each of the second digital voice signals according to a preset signal processing method to obtain a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and extract target feature information of each of the second digital voice signals from the logarithmic mel power spectrum, so that the voice recognition device may recognize the target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals, and determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals. In this way, the speech signal of text-independent type can be effectively recognized, and the performance and effectiveness of the speech recognition system can be improved.

The embodiment of the invention also provides a voice recognition device, which is used for executing the unit of the method in any one of the preceding claims. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a speech recognition device according to an embodiment of the present invention. The speech recognition apparatus of the present embodiment includes: an acquisition unit 301, a segmentation processing unit 302, a preprocessing unit 303, an identification unit 304, and a determination unit 305.

An obtaining unit 301, configured to obtain a first digital voice signal to be detected, where the first digital voice signal is composed of a digital password, and the digital password is composed of a plurality of digits;

a segmentation processing unit 302, configured to perform preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, where each second digital voice signal is determined by a number;

a preprocessing unit 303, configured to process each second digital voice signal according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each second digital voice signal, and extract target feature information of each second digital voice signal from the logarithmic mel power spectrum;

the identifying unit 304 is configured to identify target feature information of each second digital voice signal based on a neural network model, so as to obtain a target number corresponding to each second digital voice signal;

a determining unit 305, configured to determine a target digital code corresponding to the first digital voice signal according to a target digital corresponding to each of the second digital voice signals.

Further, before the obtaining unit 301 obtains the first digital voice signal to be detected, it is further configured to:

Further, when the identifying unit 304 identifies the target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal, the identifying unit is specifically configured to:

and determining a target digit corresponding to the target characteristic information of the target sample digital voice signal according to the corresponding relation between the target characteristic information and the digits of a preset sample digital voice signal.

Further, after the determining unit 305 determines the target number corresponding to the target characteristic information of the target sample digital voice signal, it is further configured to:

Further, the preprocessing unit 303 is configured to, when processing each second digital voice signal according to a preset signal processing method and determining a logarithmic mel-power spectrum corresponding to each second digital voice signal, specifically:

Further, when the preprocessing unit 303 converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the logarithmic mel spectrum power, it is specifically configured to:

Further, when the preprocessing unit 303 converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the power of the logarithmic mel spectrum, it is specifically configured to:

In this embodiment of the present invention, the obtaining unit 301 of the voice recognition device may obtain a first digital voice signal to be detected, the segmentation processing unit 302 performs preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, the preprocessing unit 303 processes each second digital voice signal according to a preset signal processing method to obtain a logarithmic mel power spectrum corresponding to each second digital voice signal, and extracts target feature information of each second digital voice signal from the logarithmic mel power spectrum, so that the recognition unit 304 may recognize the target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal, and the determining unit 305 determines a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal. In this way, the speech signal of text-independent type can be effectively recognized, and the performance and effectiveness of the speech recognition system can be improved.

Referring to fig. 4, fig. 4 is a schematic block diagram of another speech recognition apparatus provided in the embodiment of the present invention. The speech recognition device in the present embodiment as shown in the figure may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and memory 404. The processor 401, input device 402, output device 403, and memory 404 are connected by a bus 405. The memory 404 is used to store a computer program comprising program instructions and the processor 401 is used to execute the program instructions stored by the memory 404. Wherein the processor 401 is configured to call the program instruction to perform:

performing preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, wherein each second digital voice signal is determined by a number;

Further, before the processor 401 acquires the first digital voice signal to be detected, it is further configured to:

Further, the processor 401, when recognizing the target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal, is specifically configured to:

Further, after the processor 401 determines the target number corresponding to the target characteristic information of the target sample digital voice signal, it is further configured to:

determining the probability of success of the first digital voice signal to be detected according to the quantity ratio;

Further, when the processor 401 processes each second digital voice signal according to a preset signal processing method and determines a logarithmic mel-power spectrum corresponding to each second digital voice signal, it is specifically configured to:

Further, when the processor 401 converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the log mel spectrum power, it is specifically configured to:

Further, when the processor 401 converts the spectrum signal of the speech frame corresponding to each second digital speech signal into the power of the logarithmic mel spectrum, it is specifically configured to:

In the embodiment of the present invention, the voice recognition device may obtain a first digital voice signal to be detected, perform preset segmentation processing on the first digital voice signal to obtain a plurality of second digital voice signals, and process each second digital voice signal according to a preset signal processing method to obtain a logarithmic mel power spectrum corresponding to each second digital voice signal, and extract target feature information of each second digital voice signal from the logarithmic mel power spectrum, so that the voice recognition device may recognize the target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal, and determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal. In this way, the voice signal of text-independent type can be effectively recognized, and the performance and the effectiveness of the voice recognition system are improved.

It should be understood that, in the embodiment of the present invention, the Processor 401 may be a Central Processing Unit (CPU), and the Processor may also be other general processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Input devices 402 may include a touch pad, microphone, etc., and output devices 403 may include a display (LCD, etc.), speakers, etc.

The memory 404 may include a read-only memory and a random access memory, and provides instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.

In a specific implementation, the processor 401, the input device 402, and the output device 403 described in this embodiment of the present invention may execute the implementation manner described in the method embodiment described in fig. 1 or fig. 2 of the speech recognition method provided in this embodiment of the present invention, and may also execute the implementation manner of the speech recognition device described in fig. 3 or fig. 4 in this embodiment of the present invention, which is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for speech recognition described in the embodiment corresponding to fig. 1 or fig. 2 is implemented, and a speech recognition device according to the embodiment corresponding to fig. 3 or fig. 4 of the present invention may also be implemented, which are not described herein again.

The computer readable storage medium may be an internal storage unit of the speech recognition device according to any of the foregoing embodiments, for example, a hard disk or a memory of the speech recognition device. The computer readable storage medium may also be an external storage device of the speech recognition device, such as a plug-in hard disk provided on the speech recognition device, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the voice recognition device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the speech recognition device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims

1. A speech recognition method, comprising:

performing preset segmentation processing on the first digital voice signal by using an HMM (hidden Markov model) segmentation method to obtain a plurality of second digital voice signals, and recording the sequence of each second digital voice signal in the first digital voice signal; wherein each second digital speech signal is determined by a digit;

processing each second digital voice signal according to a preset signal processing method, determining a logarithmic mel power frequency spectrum corresponding to each second digital voice signal, and performing normalization processing on the logarithmic mel power frequency spectrum corresponding to each second digital voice signal so as to extract target characteristic information of each second digital voice signal from the logarithmic mel power frequency spectrum;

determining the sequence of the target digit arrangement according to the target digit corresponding to each second digital voice signal and the recorded sequence of each second digital voice signal in the first digital voice signal, and determining the target digit password corresponding to the first digital voice signal according to the sequence of the target digit arrangement;

acquiring the number of target digits corresponding to target characteristic information of the digital voice signal of the target sample and acquiring the number of digits corresponding to the first digital voice signal to be detected; the target sample digital voice signal is a sample digital voice signal which has the similarity between the target characteristic information of each sample digital voice signal in the training sample set and the target characteristic information of the second digital voice signal and is larger than a preset similarity threshold and has the maximum similarity;

calculating a quantity ratio between the quantity of the target digits and the quantity of digits corresponding to the first digital voice signal; determining the probability of success detection of the first digital voice signal to be detected according to the quantity ratio, and judging whether the probability is smaller than a preset threshold value;

2. The method of claim 1, wherein before the obtaining the first digital voice signal to be detected, further comprising:

3. The method of claim 2, wherein the identifying the target feature information of each of the second digital voice signals based on the neural network model to obtain the target number corresponding to each of the second digital voice signals comprises:

4. The method of claim 1, wherein the processing each of the second digital voice signals according to a predetermined signal processing method to determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals comprises:

and converting the spectrum signal of the speech frame corresponding to each second digital speech signal into logarithmic mel spectrum power to obtain the logarithmic mel power spectrum corresponding to each second digital speech signal.

5. The method of claim 4, wherein said converting the spectral signal of the corresponding speech frame of each of the second digital speech signals to a logarithmic mel spectral power comprises:

6. The method of claim 4, wherein converting the spectral signal of the corresponding speech frame of each of the second digital speech signals to logarithmic Mel spectral power comprises:

7. A speech recognition device comprising means for performing the method of any one of claims 1-6.

8. A speech recognition device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1 to 6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-6.