WO2019232829A1 - Voiceprint recognition method and apparatus, computer device and storage medium - Google Patents

Voiceprint recognition method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2019232829A1
WO2019232829A1 PCT/CN2018/092598 CN2018092598W WO2019232829A1 WO 2019232829 A1 WO2019232829 A1 WO 2019232829A1 CN 2018092598 W CN2018092598 W CN 2018092598W WO 2019232829 A1 WO2019232829 A1 WO 2019232829A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
recognized
truncated
feature
Prior art date
Application number
PCT/CN2018/092598
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232829A1 publication Critical patent/WO2019232829A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • the present application relates to the technical field of biometrics, and in particular, to a voiceprint recognition method, device, computer equipment, and storage medium.
  • the communication equipment manufacturer configures the communication device with a voice gain control module to make the voice call more friendly.
  • the working principle of the automatic gain control module is achieved by adjusting the saturation value of the voice volume, that is, the truncated voice, which specifically includes adding a larger gain to the voice with a lower volume and assigning a smaller gain to the voice with a higher volume.
  • the truncation phenomenon of the voice in the communication device frequently occurs, so that the voiceprint recognition based on the voice collected by the communication device will weaken the accuracy of the voiceprint recognition.
  • a voiceprint recognition method includes:
  • the truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
  • a voiceprint recognition device includes:
  • a to-be-recognized voice module for obtaining the to-be-recognized voice, and the to-be-recognized voice carries a speaker identifier
  • a feature to be identified module for obtaining corresponding features to be recognized based on the voice to be recognized
  • a target voice feature module is used to detect a voice to be recognized using a truncated voice detection algorithm. If the voice to be recognized is a truncated voice segment, a truncated voice repair model is used to repair the feature of the voice to be identified to obtain the target voice feature;
  • a voiceprint recognition result module is used for voiceprint recognition of the target voice feature and the standard voice feature based on the standard voice feature corresponding to the speaker identification, and whether the target voice feature and the standard voice feature correspond to the same Voiceprint recognition results of the speaker.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
  • One or more non-volatile readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
  • FIG. 1 is a schematic diagram of an application environment of a voiceprint recognition method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a voiceprint recognition method according to an embodiment of the present application.
  • FIG. 3 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application.
  • FIG. 4 is another specific flowchart of a voiceprint recognition method according to an embodiment of the present application.
  • FIG. 5 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application.
  • FIG. 6 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a voiceprint recognition device in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the voiceprint recognition method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where the voice collection terminal communicates with the recognition server through a network.
  • the voice collection terminal includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the identification server can be implemented by an independent server or a server cluster composed of multiple servers.
  • Voiceprint information is a spectrum of sound waves carrying speech information displayed by electroacoustic instruments.
  • Human vocal control organs include vocal cords, soft palate, tongue, teeth and lips, etc.
  • Human vocal resonators include pharyngeal cavity, oral cavity and nasal cavity. These organs have differences in size, shape, and function. These differences cause changes in vocal airflow, resulting in differences in sound quality and timbre.
  • people's vocal habits can vary from fast to slow, and the amount of force exerted can vary, which also results in differences in sound intensity and length.
  • Pitch, intensity, length, and timbre are called "four elements" of speech in linguistics. These factors can be decomposed into more than ninety features. These characteristics are expressed as the wavelength, frequency, intensity and rhythm of different sounds, which can be drawn into a time-based power spectrum through acoustic tools, that is, forming the voiceprint information of the speaker.
  • Voiceprint recognition a type of biometric technology, also known as speaker recognition, has two types, speaker recognition and speaker confirmation. Different tasks and applications will use different voiceprint recognition technologies. For example, identification techniques may be needed to reduce the scope of criminal investigations, while bank transactions require confirmation techniques. Applied to this embodiment, description is made based on speaker confirmation technology.
  • a voiceprint recognition method is provided.
  • the voiceprint recognition method is applied to the recognition server in FIG. 1 as an example for description, and includes the following steps:
  • the to-be-recognized voice is a voice directly collected by a voice collection terminal and needs to be identified.
  • the to-be-recognized voice carries a speaker identifier, and is used to identify a speaker corresponding to the to-be-recognized voice.
  • the speaker identification is a speaker identification provided by the speaker to be identified for identity verification, including, but not limited to: a name, a registered name, or an identification number, which can represent the speaker's unique identity.
  • a voice gain control module is configured on the voice acquisition terminal to keep the collected speaker's voice within a proper volume range, so that the voice acquisition terminal directly collects
  • the speech to be recognized includes truncated speech segments and normal speech segments. Specifically, when the voice collection terminal records the speech to be recognized by the speaker, if the speaker's volume is too high or too low, the voice gain control module of the voice collection terminal will adaptively adjust the amplitude threshold corresponding to the highest sound threshold or the lowest sound threshold. , And then truncating the amplitude portion of the volume of the speech to be recognized above the highest sound threshold or the amplitude portion below the minimum sound threshold and recording it as the amplitude threshold to form a truncated speech segment. Correspondingly, the portion of the voice to be recognized recorded by the voice collection terminal that has a volume between the lowest sound threshold and the highest sound threshold does not need to be gain processed by the voice gain control module, so it is a normal voice segment.
  • the amplitude threshold of the speech acquisition terminal is Eq
  • the maximum amplitude Em when the signal truncation exceeds the amplitude threshold Eq it will directly cause sampling
  • the point value is truncated at the amplitude threshold Eq, and the portion displayed on the waveform that is greater than the amplitude threshold Eq is truncated, so as to form the truncated speech segment in this embodiment.
  • the voice collection terminal may automatically adjust the gain size, and it may happen that the received sample is randomly recorded as a value Ec below the amplitude threshold Eq. At this time, Ec is adaptively adjusted to the amplitude threshold.
  • the feature to be recognized is a feature to distinguish the to-be-recognized voice from other people's voices, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) may be used as the to-be-recognized feature.
  • MFCC features Mel-Frequency Cepstral Coefficients
  • Speech features are often expressed in the field of voiceprint recognition technology by MFCC (Mel-scale Frequency, Cepstral Coefficients, Coefficients, Mel).
  • MFCC Mobile-scale Frequency, Cepstral Coefficients, Coefficients, Mel.
  • the speech signal from 200Hz to 5000Hz has a great impact on speech intelligibility.
  • masking effect When two sounds of different loudness are acting on the human ear, the presence of frequency components with higher loudness will affect the perception of frequency components with lower loudness, making it difficult to perceive. This phenomenon is called masking effect.
  • the bass is easy to mask the treble, and the treble is more difficult to mask the bass.
  • the critical bandwidth for sound masking at low frequencies is smaller at higher frequencies. Therefore, in the frequency band from low frequency to high frequency, a set of Mel scale band-pass filters can be arranged from dense to sparse according to the size of the critical bandwidth, and the input signal is filtered so that each frequency band corresponds to a value.
  • the resolution of the Mel-scale filter bank is high in the low-frequency part, which is consistent with the auditory characteristics of the human ear. This is also the physical meaning of the Mel-scale.
  • a truncated speech detection algorithm is used to detect the speech to be recognized. If the speech to be recognized is a truncated speech segment, then a truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech features.
  • the target speech feature includes the speech feature corresponding to the normal speech segment, and also includes the speech feature formed by repairing the speech feature corresponding to the truncated speech segment using a truncated speech repair model. That is, the target speech feature is a speech feature formed by performing speech repair on the speech feature to be recognized.
  • the truncated speech repair model is a model that can restore the input speech features to be recognized and output them as the target speech features.
  • the truncated speech repair model has been trained in advance and stored on the recognition server, so that the recognition server can call the model in real time to repair the truncated speech segment.
  • the truncated voice detection algorithm is an algorithm that detects the kind of voice to be recognized collected by the voice collection terminal.
  • speech to be recognized There are two types of speech to be recognized: truncated speech segments after truncated speech processing and normal speech segments without truncated speech processing. Understandably, since the normal speech segment does not process the speech signal, it retains the voiceprint characteristics of the speaker, and the truncated speech segment is the speech segment obtained after the speech to be recognized is cut off less than the minimum sound threshold or greater than the maximum sound threshold. There is a distortion of the voice signal. If speech recognition is performed directly based on the speech to be recognized that includes the truncated speech segment, there may be a phenomenon of inaccurate recognition. Therefore, it is necessary to use a truncated speech detection algorithm to first determine the type of speech to be recognized, and provide a technical basis for subsequent speech recognition.
  • a truncated speech repair model is used to repair the speech features to be identified to obtain the target speech features.
  • the truncated speech repair model is a model formed by training the initial training model and used to repair the speech features corresponding to truncated speech segments.
  • DNN Deep Neural Networks
  • DBN Deep Belief Nets
  • CDBN Convolutional Deep Belief Networks
  • DNN model is widely used in many important Internet applications, such as speech recognition, image recognition, natural language processing, etc.
  • DNN model is widely used in speech recognition products of many companies because of its high computational complexity and can greatly improve the accuracy of speech recognition.
  • the structure of the current DNN model includes an input layer, several intermediate layers, and an output layer.
  • the input layer is responsible for receiving input information from the outside world and passing it to the middle layer;
  • the middle layer is an internal information processing layer that is responsible for information transformation.
  • the middle layer can be designed as a single middle layer or multiple middle layer structure; After the information transmitted from the layer to the output layer is further processed, a learning forward propagation process is completed, and the output layer outputs the information processing result to the outside world.
  • the number of neurons in each layer generally ranges from several hundred to tens of thousands, and the layers and layers are fully connected networks.
  • the training calculation of the DNN model is that one layer is calculated and then the next layer is expected. The layers cannot be parallel to each other.
  • a DNN training can be expressed in the following stages: forward calculation, reverse error calculation, and finally the weight of each layer is updated according to the results of the forward calculation and reverse error calculation.
  • the forward calculation process is calculated from the input layer to the output layer, and the calculation is serial.
  • the reverse calculation process is calculated from the output layer to the first layer, and the calculation is also serial.
  • a batch Each time a small piece of training data is input, it is called a batch.
  • One batch completes one training. That is to say, after obtaining a new weight, it will use this weight and the next batch of the new input to train and get an updated weight. Until all inputs have been calculated, it is called a round. Generally a complete training requires 10-20 rounds.
  • the DNN training process is a process of forward and backward information propagation and error back propagation. It is a process of continuously adjusting the weights of each layer. It is also a process of neural network learning and training. This process continues until the error of the network output is reduced to an acceptable level. , Or up to a preset number of learnings.
  • step S30 the truncated speech repair model is used to repair the speech features to be identified, and the target speech features are obtained, which specifically include the following steps:
  • a truncated speech repair model based on the DNN model is used to repair the features of the speech to be recognized and obtain the target speech features.
  • the truncated speech repair model is a model formed by training a DNN model and used to repair the speech features corresponding to truncated speech segments and output the target speech features.
  • the truncated speech repair model generated in step S30 may be used to repair the speech features to be identified.
  • the DNN model can be trained to obtain the target speech feature output by the DNN model, that is, the original MFCC feature.
  • the recognition server uses the truncated speech repair model based on the DNN model to repair the speech features to be identified, obtains the target speech features, and enters the truncated speech repair model of truncated speech segments (truncated MFCC features) into the truncated speech repair model.
  • truncated MFCC features truncated speech repair model of truncated speech segments
  • the technical basis of speech recognition used to obtain the target speech feature (MFCC feature) of the repaired speech segment after repairing by the truncated speech repair model. Because the MFCC feature is based on no assumptions or restrictions on the input speech signal, and is produced using an auditory model, it has good robustness and is more in line with the auditory characteristics of the human ear, even when the signal-to-noise ratio is reduced. Better speech recognition performance.
  • a truncated speech repair model based on a DNN model is used to repair truncated speech segments, which can greatly improve the accuracy of speech repair.
  • the gain module of the voice collection terminal adaptively adjusts the amplitude threshold, it is difficult for the recognition server to determine the truncated voice segment by specifying a fixed amplitude threshold.
  • the percentage of the sampling points for the treble volume sub-range can be determined, that is, The truncated speech detection algorithm proposed in this step to make a judgment can effectively improve the accuracy of the judgment result.
  • the voiceprint recognition method proposed in this embodiment can detect whether the speech to be recognized is a truncated speech segment by using a truncated speech detection algorithm. If so, the truncated speech repair model can repair the speech feature of the truncated speech segment to be identified as The target speech features are compared to the standard speech features of the speaker to identify the true identity of the speaker. This embodiment can effectively improve the reliability and accuracy of speech recognition by repairing the speech features to be identified in the speech to be identified and obtaining target speech features close to the original speech of the speaker.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice features and the standard voice features to obtain whether the target voice features and the standard voice features correspond to the voiceprint recognition result of the same speaker.
  • the standard voice is a voice corresponding to the speaker identification, which is stored in a pre-stored standard voice database and records the speaker without truncated voice processing.
  • the standard voice feature is the MFCC feature corresponding to the standard voice.
  • the preset voiceprint recognition model is a model for scoring similarity between standard voice features and target voice features.
  • GMM-UBM Globalsian mixture model-universal background model
  • i-vector identity-vector, identity authentication vector
  • the recognition server may store a standard i-vector corresponding to the standard MFCC feature in a database, so that the standard i-vector can be directly used as a comparison standard when performing speech recognition based on the i-vector.
  • Cosine distance also called cosine similarity
  • a vector is a directional line segment in a multi-dimensional space. If the directions of two vectors are the same, that is, the included angle is close to zero, and the cosine value of the included angle, that is, the cosine distance approaches 1, the two vectors are similar.
  • a distance threshold can be set according to the actual situation. When the cosine distance between the standard i-vector and the original i-vector is greater than the distance threshold, it can be determined that the target speech feature and the standard speech feature correspond to the same speaker.
  • step S20 that is, acquiring the corresponding feature of the voice to be recognized based on the voice to be recognized, specifically including the following steps:
  • S21 Pre-process the speech to be recognized, and obtain pre-processed speech data.
  • the speech to be recognized is pre-processed, and corresponding pre-processed voice data is obtained.
  • Pre-processing the speech to be recognized can better extract the speech features to be recognized, so that the extracted speech features can better represent the speech to be recognized, and use the speech features to distinguish the speech.
  • pre-processing the speech to be recognized to obtain pre-processed speech data includes the following steps:
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the use of the pre-emphasis processing can eliminate the interference caused by the vocal cords and lips during the utterance, can effectively compensate the suppressed high-frequency part of the speech to be recognized, and can highlight the high-frequency formants of the speech to be identified, enhancing the signal amplitude of the speech to be identified To help extract the features of the speech to be recognized.
  • frame processing should also be performed.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • Framed speech processing is performed, and the speech data to be recognized can be divided into several pieces of speech data. The speech to be recognized can be subdivided to facilitate the extraction of the characteristics of the speech to be recognized.
  • S213 Perform windowing on the framed to-be-recognized speech to obtain pre-processed speech data.
  • the calculation formula for windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • windowing can solve this problem, make the framed speech to be recognized continuous, and make each frame exhibit the characteristics of a periodic function.
  • the windowing process specifically refers to using a window function to process the speech to be recognized.
  • the window function can select the Hamming window.
  • the formula for windowing is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed.
  • the pre-processing operations of the speech to be recognized in the above steps S211 to S213 provide a basis for extracting the speech features to be recognized, and can make the extracted speech features to be more representative of the speech to be recognized, and according to the speech features to be recognized Make a speech distinction.
  • S22 Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the speech to be identified, and obtain the power spectrum of the speech to be identified according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • FFT refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short.
  • the use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
  • fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain.
  • the formula for calculating the spectrum is 1 ⁇ k ⁇ N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • the power spectrum of the pre-processed voice data can be directly obtained according to the frequency spectrum.
  • the power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished.
  • N is the frame size
  • s (k) is the signal amplitude in the frequency domain.
  • the pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the speech to be identified is obtained according to the signal amplitude in the frequency domain. Recognizing speech features provides an important technical basis.
  • a Mel scale filter bank is used to process the power spectrum of the speech to be recognized, and a Mel power spectrum of the speech to be recognized is obtained.
  • the power spectrum using the Mel scale filter bank to process the speech to be recognized is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • a Mel scale filter bank is used to process the power spectrum of the speech to be recognized, a Mel power spectrum of the speech to be recognized is obtained, and a frequency domain signal is segmented by using the Mel scale filter bank, so that the final The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the speech to be recognized can be obtained.
  • the Mel frequency analysis is performed on the power spectrum of the speech to be recognized, so that the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the speech to be recognized.
  • S24 Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the speech to be recognized is analyzed and obtained.
  • the features contained in the Mel power spectrum of the speech to be recognized that are too high in original feature dimension and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum.
  • the Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from the features of the to-be-recognized voice.
  • the features of the to-be-recognized voice can reflect the difference between the voices and can be used to identify and distinguish the to-be-recognized voices.
  • step S24 cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the speech to be recognized, including the following steps:
  • a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.
  • S242 Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
  • a discrete cosine transform is performed on the Mel power spectrum m to be transformed to obtain the corresponding Mel frequency cepstrum coefficients of the speech to be recognized.
  • the second to thirteenth coefficients are taken as To-be-recognized voice features, which can reflect differences between voice data.
  • the formula for discrete cosine transform of the transformed Mel power spectrum m is N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters.
  • Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the results of the discrete cosine transform have no imaginary parts, and have obvious advantages in calculation.
  • Steps S21-S24 the feature of the voice to be recognized obtained after feature extraction of the voice to be recognized, can accurately reflect the feature of the voice to be recognized, and is beneficial to voice recognition based on the feature of the voice to be recognized.
  • step S30 the truncated voice detection algorithm is used to process the speech to be recognized to obtain the truncated voice segment, which specifically includes the following steps:
  • the speech to be recognized is evenly divided into at least two speech sub-segments according to the time sequence.
  • the to-be-recognized voice is a voice recorded by the voice collection terminal after truncating the original voice of the speaker to be recognized.
  • the speech to be recognized is divided into several non-overlapping at least two speech sub-segments according to the time sequence, and each period can be set to 0.5s to form the minimum basic unit for truncated speech recognition detection.
  • the speech to be recognized is divided into at least two speech sub-segments for detection in order to improve the accuracy of detecting truncated speech segments. Understandably, the more speech sub-segments are segmented, the higher the accuracy of detecting truncated speech segments is.
  • the voice sub-segment is evenly divided into at least two volume sub-intervals according to the volume change, and the number of treble sampling points of the volume sub-interval where the highest sound is located is obtained.
  • the number of treble sampling points is the number of voice sampling points obtained in the volume sub-interval where the highest pitch is located.
  • the recognition server first obtains the maximum amplitude (volume) Vm of each voice sub-segment, and divides the interval [0, Vm] into at least two volume sub-sections that do not overlap. Because the truncated voice processing may occur in the volume sub-interval where the highest sound in each voice sub-segment is located, if the truncated voice processing occurs, the volume sub-interval where the highest sound is located is the interval where the amplitude threshold is located.
  • the number of treble sampling points in the volume sub-interval where the highest note is located can be used to determine whether the voice sub-segment is a truncated voice segment as a technical basis.
  • the total number of sampling points is the number of all sampling points that sample the voice volume in each voice field.
  • the treble sampling percentage is the percentage of the number of treble sampling points to the total number of sampling points.
  • the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  • the preset threshold is a percentage of the number of treble sampling points relative to the total number of sampling points set according to actual experience. If the treble sampling percentage exceeds a preset threshold, it indicates that the voice sub-segment in which the volume sub-interval is located is a truncated voice segment.
  • the treble sampling percentage of the volume sub-section where the highest note is located does not all exceed the preset threshold, it means that the number of voice sampling points of the voice sub-segment where the volume sub-section belongs is in the normal range, and the voice of the volume sub-section The sub-segments are normal speech segments.
  • the gain module of the voice collection terminal adaptively adjusts the amplitude threshold, it is difficult for the recognition server to determine a truncated speech segment by specifying a fixed amplitude threshold.
  • the treble sampling percentage of the number of treble sampling points relative to the total number of sampling points may be used. Judging by comparing with a preset threshold can effectively improve the flexibility of the judgment method and improve the accuracy of the judgment result.
  • the voiceprint recognition method before step S30, that is, before the step of using a truncated speech repair model to repair the speech features to be identified, the voiceprint recognition method further includes:
  • the original training speech is the original speech from the speaker without truncated speech processing.
  • truncated training speech is the speech sent by the speaker after truncated speech processing.
  • the recognition server performs truncation processing on the original training speech, that is, only the speech signals of the original training speech between the highest and lowest thresholds are retained, and the voice signals exceeding the above range are recorded as amplitude thresholds, and Get the corresponding truncated training voice.
  • the truncated training feature corresponding to the truncated training voice is used as the input layer of the DNN model
  • the original training feature corresponding to the original training voice is used as the output layer of the DNN model
  • the feature parameters of the DNN model are calibrated to generate a truncation based on the DNN model Top speech repair model.
  • MFCC features are often used to represent voice features in the field of voiceprint recognition technology. Because the MFCC feature is based on no assumptions or restrictions on the input speech signal, and is produced using an auditory model, it has good robustness and is more in line with the auditory characteristics of the human ear, even when the signal-to-noise ratio is reduced. Better speech recognition performance. Therefore, the truncated training feature is the truncated MFCC feature corresponding to the truncated training voice, and the original training feature is the original MFCC feature corresponding to the original training voice.
  • the process of extracting the truncated MFCC features of truncated training speech includes: converting the truncated training speech from a time domain signal to a frequency domain signal based on a Fourier transform; and filtering the frequency domain signal to obtain a Mel power spectrum ; Perform cepstrum analysis on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the initial speech, which is also the MFCC feature.
  • cepstrum refers to the inverse Fourier transform of a signal's Fourier transform spectrum after logarithmic operation, which is converted into easy-to-use speech features (mel frequency used for training or recognition) Cepstrum coefficient feature vector).
  • cepstrum refers to the inverse Fourier transform of a signal's Fourier transform spectrum after logarithmic operation, which is converted into easy-to-use speech features (mel frequency used for training or recognition) Cepstrum coefficient feature vector).
  • the above process is also applicable to extract the original MFCC features corresponding to the original training speech.
  • the structure of the DNN model includes an input layer, several intermediate layers, and an output layer.
  • the input layer is responsible for receiving input information from the outside world and passing it to the middle layer;
  • the middle layer is an internal information processing layer that is responsible for information transformation.
  • the middle layer can be designed as a single middle layer or multiple middle layer structure; the highest The information transmitted from the intermediate layer where the volume is located to the output layer is further processed to complete a learning forward propagation process, and the output layer outputs the information processing results to the outside world.
  • the recognition server uses the truncated MFCC feature and the original MFCC feature as the input layer and the output layer of the DNN model to obtain the feature parameters of the truncated speech repair model of the DNN model.
  • the output layer of the DNN model includes n neurons, which are the original MFCC feature values corresponding to the truncated MFCC feature output for each neuron in the input.
  • each layer has multiple neurons, and the different layers are fully connected; each layer sets its own excitation function f (the excitation function indicates that each neuron in the neural network accepts input values, and Input values are passed to the next layer, a function of the input-output relationship between neurons).
  • the input is the feature vector v
  • the transition matrix from the i-th layer to the i + 1-th layer is wi (i + 1)
  • the bias vector of the i + 1th layer is b (i + 1)
  • the output of the i-th layer is outi
  • the input of the i + 1th is ini + 1
  • the calculation process is:
  • the parameters of the DNN model include the transfer matrix w between layers and the offset vector b of each layer.
  • the main task of training the DNN model is to determine the above-mentioned feature parameters, and finally generate a truncated speech repair model based on the DNN model.
  • the recognition server uses the truncated speech repair model generated based on the DNN model to repair truncated speech segments, which can greatly improve the accuracy of speech repair.
  • a DBN (Deep Belief Nets) model or a CDBN (Convolutional Deep Belief Networks) model can also be used as the initial training model, wherein the network architecture for training the DBN model is faster than the DNN model , More suitable for training data of large speech database; CDBN model is suitable for training data of large speech database.
  • step S30 the truncated speech repair model is used to repair the truncated speech segment to obtain the repaired speech segment, which specifically includes the following steps:
  • a truncated speech repair model based on the DNN model is used to repair the features of the speech to be identified, and the target speech features of the repaired speech segment are obtained.
  • the recognition server first obtains the to-be-recognized speech features of the truncated speech segment; then uses the speech features of the truncated speech segment as the input layer of the DNN model, and after repairing the truncated speech repair model based on the DNN model obtained in step S32
  • the corresponding target speech features can be obtained at the output layer of the DNN model.
  • the recognition server uses the truncated speech repair model based on the DNN model to repair the truncated speech segment, which can effectively improve the accuracy of the target speech feature obtained in the output layer.
  • step S40 based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, including: The following steps:
  • a preset voiceprint recognition model is used to process the target speech feature and the standard speech feature, respectively, to obtain the original speech vector and the standard speech vector.
  • the preset voiceprint recognition model is a model for scoring similarity between standard voice features and voice features to be recognized.
  • an i-vector model may be used as a preset voiceprint recognition model to obtain an i-vector vector corresponding to each speaker.
  • a vector is a directional line segment in a multi-dimensional space. If the directions of two vectors are the same, that is, the included angle is close to zero, and the cosine value of the included angle, that is, the cosine distance approaches 1, the two vectors are similar.
  • the original speech vector is the repair i-vector
  • the standard speech vector is the standard i-vector.
  • the recognition server uses the i-vector model to obtain the original speech vector as follows:
  • the recognition server trains the GMM-UBM representing the speech space through the target speech features; uses the trained UBM to calculate sufficient statistics for the speech features of each frame, and maps the sufficient statistics to the total variable space to obtain the original i- vector.
  • the original i-vector may also be processed by channel compensation using LDA (Linear Discriminant Analysis) to minimize the distance between homogeneous samples and maximize the distance between non-homogeneous samples. In the same way, we can know the realization process of obtaining the standard speech vector.
  • LDA Linear Discriminant Analysis
  • the spatial distance applied to this embodiment may refer to a cosine distance between two vectors.
  • Cosine distance also called cosine similarity, is a measure of the difference between two individuals by using the cosine of the angle between two vectors in a vector space.
  • the distance threshold is a value expressed by a cosine value set according to actual experience.
  • the recognition server compares the spatial distance between the standard i-vector and the original i-vector obtained in step S42, that is, the cosine distance. If the cosine distance between the two is greater than the distance threshold preset according to the actual situation, it can be determined that the target speech feature and the standard speech feature originate from the same speaker.
  • step S42 obtaining the spatial distance between the original speech vector and the standard speech vector, specifically includes the following steps:
  • obtaining the spatial distance between the original speech vector and the standard speech vector can be determined by the following formula:
  • Ai and Bi represent the components of the original speech vector and the standard speech vector, respectively.
  • the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. Between -1 and 1 represents the similarity or dissimilarity between the two vectors. Understandably, the closer the similarity is to 1, the closer the two vectors are.
  • the distance threshold of cos ⁇ can be set in advance according to actual experience.
  • the similarity between the original speech vector and the standard speech vector is greater than the distance threshold, it is considered that the original speech vector and the standard speech vector are similar, that is, it can be determined that the target speech feature and the standard speech feature correspond to the voiceprint recognition result of the same speaker.
  • the cosine similarity algorithm can be used to determine the similarity between the original speech vector and the standard speech vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
  • this embodiment proposes a voiceprint recognition method.
  • a truncated speech detection algorithm can detect whether the speech to be recognized is a truncated speech segment. If so, the truncated speech repair model can be used to identify the truncated speech segment to be identified.
  • Speech feature restoration is the target speech feature, and the speaker's true identity is identified by comparing the speaker's standard speech features.
  • This embodiment can effectively improve the reliability and accuracy of speech recognition by repairing the speech features to be identified in the speech to be identified and obtaining target speech features close to the original speech of the speaker.
  • the recognition server can obtain the target speech features obtained after feature extraction of the speech to be recognized, which can accurately reflect the characteristics of the speech to be recognized, and is conducive to speech recognition based on the characteristics of the speech to be recognized; using the number of treble sampling points relative to the total number of sampling points The determination of the treble sampling percentage against a preset threshold can effectively improve the flexibility of the determination method and improve the accuracy of the determination result.
  • the truncated speech repair model based on the DNN model is used to repair the truncated speech segment, which can greatly improve the speech.
  • the accuracy of the repair; the similarity between the original speech vector and the standard speech vector can be determined by the cosine similarity algorithm, which is simple and fast, which is helpful for quickly confirming the recognition result.
  • FIG. 7 shows a principle block diagram of a voiceprint recognition device corresponding to the voiceprint recognition method in the above embodiment.
  • the voiceprint recognition device includes a voice module 10 to be recognized, a feature module 20 to be recognized, a target voice feature module 30, and a voiceprint recognition result module 40.
  • the functions of obtaining the speech module 10 to be recognized, obtaining the speech segment module 20, repairing the speech segment module 30, and determining the speaker module 40 correspond to the steps corresponding to the voiceprint recognition method in the above embodiment, in order to avoid redundant description, the present
  • the examples are not detailed one by one.
  • the to-be-recognized voice module 10 is configured to obtain the to-be-recognized voice, and the to-be-recognized voice carries a speaker identifier.
  • the feature-to-be-recognized feature module 20 is configured to obtain a corresponding feature to-be-recognized voice based on the to-be-recognized voice.
  • a target voice feature module 30 is used to detect a voice to be recognized using a truncated voice detection algorithm. If the voice to be recognized is a truncated voice segment, a truncated voice repair model is used to repair the feature of the voice to be identified to obtain the target voice feature.
  • the voiceprint recognition result module 40 is configured to perform voiceprint recognition on the target voice feature and the standard voice feature based on the standard voice feature corresponding to the speaker identification, and obtain whether the target voice feature corresponds to the standard voice feature. Voiceprint recognition results for the same speaker.
  • obtaining the feature to be identified module 20 includes obtaining a voice data unit 21, obtaining a power spectrum unit 22, obtaining a Mel power spectrum unit 23, and obtaining a Mel coefficient unit 24.
  • the voice data acquisition unit 21 is configured to preprocess the voice to be recognized and acquire preprocessed voice data.
  • the power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the speech to be identified, and obtain a power spectrum of the speech to be identified according to the frequency spectrum.
  • a Mel power spectrum obtaining unit 23 is configured to process a power spectrum of a speech to be recognized by using a Mel scale filter bank, and obtain a Mel power spectrum of the speech to be recognized.
  • a Mel coefficient obtaining unit 24 is configured to perform cepstrum analysis on a Mel power spectrum to obtain a Mel frequency cepstrum coefficient of a speech to be recognized.
  • the target voice feature acquisition module 30 includes a segmented voice sub-segment unit 31, a sampling unit number obtaining unit 32, a sampling percentage obtaining unit 33, and a preset threshold exceeding unit 34.
  • the speech sub-segment unit 31 is configured to divide the speech to be recognized into at least two speech sub-segments in an average time sequence.
  • the number of sampling points obtaining unit 32 is configured to divide the voice sub-segment evenly into at least two volume sub-intervals according to the volume change, and obtain the number of treble sampling points of the volume sub-interval where the highest note is located.
  • An acquisition sampling percentage unit 33 is configured to count the total number of sampling points of all volume sub-intervals, so as to acquire the percentage of treble sampling points relative to the total number of sampling points.
  • the exceeding threshold unit 34 is configured to, if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  • the voiceprint recognition device further includes an original speech acquisition unit 35 and a repair model generation unit 36.
  • An original speech acquisition unit 35 is configured to obtain original training features corresponding to the original training speech, perform truncated speech processing on the original training speech to obtain corresponding truncated training speech, and then extract truncated training features of the truncated training speech.
  • Generate a repair model unit 36 which is used to take the truncated training features corresponding to the truncated training speech as the input layer of the DNN model, use the original training features corresponding to the original training speech as the output layer of the DNN model, and calibrate the feature parameters of the DNN model to Generate truncated speech repair model based on DNN model.
  • the voiceprint recognition device further includes an original feature obtaining unit 37.
  • the original feature unit 37 is used for repairing the speech features to be identified using a truncated speech repair model based on the DNN model, and acquiring the target speech features of the repaired speech segment.
  • the voiceprint recognition result acquisition module 40 includes an adoption recognition model unit 41, an acquisition space distance unit 42, and an acquisition recognition result unit 43.
  • the recognition model unit 41 is used to process the target voice feature and the standard voice feature by using a preset voiceprint recognition model to obtain the original voice vector and the standard voice vector, respectively.
  • An obtaining space distance unit 42 is configured to obtain a space distance between an original speech vector and a standard speech vector.
  • the recognition result obtaining unit 43 is configured to obtain whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker according to the spatial distance and a preset distance threshold.
  • Each module in the voiceprint recognition device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and the internal structure diagram may be as shown in FIG. 8.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data related to the voiceprint recognition method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a voiceprint recognition method.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the processor implements the following steps: obtaining to-be-identified Voice, to-be-recognized voice carries speaker identification; based on the to-be-recognized voice, obtain corresponding features of the to-be-recognized voice; use truncated voice detection algorithm to detect to-be-recognized voice; if the to-be-recognized voice is truncated voice segment, truncated voice repair The model repairs the to-be-recognized voice features to obtain the target voice features; based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice features and standard voice features to obtain the target voice features and standard voice features Whether it corresponds to the voiceprint recognition result of the same speaker.
  • the training voice features corresponding to the training voice data are extracted, and the processor implements the following steps when the processor executes computer-readable instructions: pre-processing the training voice data to obtain pre-processed voice data; and performing a fast process on the pre-processed voice data
  • the Fourier transform obtains the frequency spectrum of the training speech data, and obtains the power spectrum of the training speech data according to the frequency spectrum; uses the Mel scale filter bank to process the power spectrum of the training speech data, and obtains the Mel power spectrum of the training speech data; Cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
  • a truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, the processor implements the following steps when the processor executes computer-readable instructions: the speech to be identified is divided into at least two in average according to time sequence.
  • Voice sub-segments divide the voice sub-segments into at least two volume sub-segments according to the volume change, and obtain the number of treble sampling points of the volume sub-segment where the highest note is located; count the total number of sampling points of all volume sub-segments to obtain treble sampling The treble sampling percentage of the number of points relative to the total number of sampling points; if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  • the processor before the step of using the truncated speech repair model to repair the features of the speech to be identified, the processor also implements the following steps when the computer executes the computer-readable instructions: obtaining the original training speech, and truncating the original training speech to obtain The corresponding truncated training speech; the truncated training features corresponding to the truncated training speech are used as the input layer of the DNN model, the original training features corresponding to the original training speech are used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate A truncated speech repair model based on the DNN model.
  • the truncated speech repair model is used to repair the speech features to be identified, and the target speech feature is obtained.
  • the processor executes the computer-readable instructions, the following steps are implemented: the truncated speech repair model based on the DNN model is used to repair the speech features to be identified. To obtain the target speech features of the repaired speech segment.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target speech features and standard speech features.
  • the processor executes the computer-readable instructions, the following steps are implemented: Set the voiceprint recognition model to process the target speech features and standard speech features, respectively, to obtain the original speech vector and the standard speech vector; obtain the spatial distance between the original speech vector and the standard speech vector; obtain the target speech based on the spatial distance and a preset distance threshold Whether the feature and the standard speech feature correspond to the voiceprint recognition result of the same speaker.
  • the spatial distance between the original speech vector and the standard speech vector is obtained.
  • the processor executes the computer-readable instructions, the following steps are implemented: the cosine similarity algorithm is used to obtain the spatial distance between the original speech vector and the standard speech vector.
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the following steps are performed: obtaining the speech to be recognized, and the speech to be recognized carries a speaker identification; Based on the to-be-recognized voice, obtain the corresponding to-be-recognized voice characteristics; use the truncated voice detection algorithm to detect the to-be-recognized voice, and if the to-be-recognized voice is a truncated voice segment, use the truncated voice repair model to repair the to-be-recognized voice features and obtain the target voice Feature; based on the standard voice features corresponding to the speaker's identity, the preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature to obtain whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker .
  • the training voice features corresponding to the training voice data are extracted, and the computer-readable instructions are executed by the processor to implement the following steps: pre-processing the training voice data to obtain pre-processed voice data; quickly performing pre-processing voice data Fourier transform to obtain the frequency spectrum of the training speech data, and obtain the power spectrum of the training speech data according to the frequency spectrum; use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data; The cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
  • the truncated speech detection algorithm is used to detect the speech to be recognized. If the speech to be identified is a truncated speech segment, the computer-readable instructions are executed by the processor to implement the following steps: the speech to be identified is evenly divided into at least time according to time sequence.
  • Two voice sub-segments divide the voice sub-segment into at least two volume sub-segments according to the volume change, and obtain the number of treble sampling points of the volume sub-segment where the highest note is located; count the total number of sampling points of all volume sub-segments to obtain the treble
  • the treble sampling percentage of the number of sampling points relative to the total number of sampling points if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  • the following steps are implemented: obtaining the original training speech, and truncating the original training speech to obtain The corresponding truncated training speech; the truncated training features corresponding to the truncated training speech are used as the input layer of the DNN model, the original training features corresponding to the original training speech are used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate A truncated speech repair model based on the DNN model.
  • the truncated speech repair model is used to repair the speech features to be identified, and the target speech features are obtained.
  • the computer-readable instructions are executed by the processor, the following steps are implemented: the truncated speech repair model based on the DNN model is used to repair the speech to be identified. Feature to obtain the target voice feature of the repaired speech segment.
  • a preset voiceprint recognition model is used to perform voiceprint recognition on the target speech features and standard speech features.
  • the preset voiceprint recognition model processes the target speech features and standard speech features respectively to obtain the original speech vector and the standard speech vector; obtain the spatial distance between the original speech vector and the standard speech vector; and obtain the target based on the spatial distance and a preset distance threshold Whether the speech features and standard speech features correspond to the voiceprint recognition result of the same speaker.
  • the spatial distance between the original speech vector and the standard speech vector is obtained.
  • the cosine similarity algorithm is used to obtain the spatial distance between the original speech vector and the standard speech vector.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voiceprint recognition method and apparatus, a computer device and a storage medium, the method comprising: obtaining a voice to be recognized, the voice to be recognized carrying an identifier of a speaker (S10); on the basis of the voice to be recognized, obtaining a corresponding voice feature to be recognized (S20); using a truncated voice detection algorithm to detect the voice to be recognized, and if the voice to be recognized is a truncated voice segment, using a truncated voice repair model to repair the voice feature to be recognized, and obtaining a target voice feature (S30); on the basis of a standard voice feature corresponding to the identifier of the speaker, using a preset voiceprint recognition model to perform voiceprint recognition on the target voice feature and the standard voice feature, and obtaining a voiceprint recognition result (S40). By means of repairing a voice feature to be recognized in a voice to be recognized and then obtaining a target voice feature close to an original voice of a speaker, the described method effectively improves the reliability and accuracy of recognizing the speaker.

Description

声纹识别方法、装置、计算机设备及存储介质Voiceprint recognition method, device, computer equipment and storage medium
本申请以2018年06月06日提交的申请号为201810573715.4,名称为“声纹识别方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。This application is based on a Chinese invention application with application number 201810573715.4 filed on June 06, 2018 and entitled "Voiceprint Recognition Method, Device, Computer Equipment and Storage Medium" and claims its priority.
技术领域Technical field
本申请涉及生物识别技术领域,尤其涉及一种声纹识别方法、装置、计算机设备及存储介质。The present application relates to the technical field of biometrics, and in particular, to a voiceprint recognition method, device, computer equipment, and storage medium.
背景技术Background technique
通讯设备制造商为了使通话音量保持在一个合适的音量范围内,给通讯设备配置有语音增益控制模块,以使语音通话更为友好。自动增益控制模块的工作原理是通过调节语音音量的饱和值也即截顶语音来实现的,具体包括给音量小的语音增添较大增益,给音量大的语音分配较小增益。而这样的设置方式也随之带来问题:通信设备中的语音频繁出现截顶现象,使得基于该通信设备采集到的语音在进行声纹识别时,会削弱声纹识别的准确性。In order to keep the volume of the call within a proper volume range, the communication equipment manufacturer configures the communication device with a voice gain control module to make the voice call more friendly. The working principle of the automatic gain control module is achieved by adjusting the saturation value of the voice volume, that is, the truncated voice, which specifically includes adding a larger gain to the voice with a lower volume and assigning a smaller gain to the voice with a higher volume. However, such a setting method also brings a problem: the truncation phenomenon of the voice in the communication device frequently occurs, so that the voiceprint recognition based on the voice collected by the communication device will weaken the accuracy of the voiceprint recognition.
发明内容Summary of the Invention
基于此,有必要针对上述技术问题,提供一种可以增强声纹识别准确性的声纹识别方法、装置、计算机设备及存储介质。Based on this, it is necessary to provide a voiceprint recognition method, device, computer equipment, and storage medium that can enhance the accuracy of voiceprint recognition in response to the above technical problems.
一种声纹识别方法,包括:A voiceprint recognition method includes:
获取待识别语音,待识别语音携带说话人标识;Obtain the speech to be recognized, and the speech to be recognized carries a speaker identifier;
基于待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;The truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
一种声纹识别装置,包括:A voiceprint recognition device includes:
获取待识别语音模块,用于获取待识别语音,待识别语音携带说话人标识;A to-be-recognized voice module for obtaining the to-be-recognized voice, and the to-be-recognized voice carries a speaker identifier;
获取待识别特征模块,用于基于待识别语音,获取对应的待识别语音特征;A feature to be identified module for obtaining corresponding features to be recognized based on the voice to be recognized;
获取目标语音特征模块,用于采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;A target voice feature module is used to detect a voice to be recognized using a truncated voice detection algorithm. If the voice to be recognized is a truncated voice segment, a truncated voice repair model is used to repair the feature of the voice to be identified to obtain the target voice feature;
获取声纹识别结果模块,用于基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。A voiceprint recognition result module is used for voiceprint recognition of the target voice feature and the standard voice feature based on the standard voice feature corresponding to the speaker identification, and whether the target voice feature and the standard voice feature correspond to the same Voiceprint recognition results of the speaker.
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取待识别语音,待识别语音携带说话人标识;Obtain the speech to be recognized, and the speech to be recognized carries a speaker identifier;
基于待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;The truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取待识别语音,待识别语音携带说话人标识;Obtain the speech to be recognized, and the speech to be recognized carries a speaker identifier;
基于待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;The truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, then the truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech characteristics.
基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, and whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获取其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without paying creative labor.
图1是本申请一实施例中声纹识别方法的一应用环境示意图;1 is a schematic diagram of an application environment of a voiceprint recognition method according to an embodiment of the present application;
图2是本申请一实施例中声纹识别方法的一流程图;2 is a flowchart of a voiceprint recognition method according to an embodiment of the present application;
图3是本申请一实施例中声纹识别方法的另一具体流程图;3 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application;
图4是本申请一实施例中声纹识别方法的另一具体流程图;4 is another specific flowchart of a voiceprint recognition method according to an embodiment of the present application;
图5是本申请一实施例中声纹识别方法的另一具体流程图;5 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application;
图6是本申请一实施例中声纹识别方法的另一具体流程图;6 is another specific flowchart of a voiceprint recognition method in an embodiment of the present application;
图7是本申请一实施例中声纹识别装置的一原理框图;7 is a schematic block diagram of a voiceprint recognition device in an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
本申请实施例提供的声纹识别方法,可应用在如图1的应用环境中,其中,语音采集终端通过网络与识别服务器进行通信。其中,语音采集终端包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。识别服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The voiceprint recognition method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1, where the voice collection terminal communicates with the recognition server through a network. Among them, the voice collection terminal includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The identification server can be implemented by an independent server or a server cluster composed of multiple servers.
声纹(Voiceprint)信息,是用电声学仪器显示的携带言语信息的声波频谱。人的发声 控制器官包括声带、软颚、舌头、牙齿和唇等,人的发声共鸣器包括咽腔、口腔和鼻腔。这些器官存在大小、形态及功能上的差异,这些差异导致发声气流的改变,造成音质和音色的差别。此外,人发声的习惯有快有慢,用力有大有小,也造成音强和音长的差别。音高、音强、音长和音色在语言学中被称为语音“四要素”,这些因素又可分解成九十余种特征。这些特征表现成不同声音的波长、频率、强度和节奏,通过声学工具可绘制成基于时域的功率谱,也即形成说话人的声纹信息。Voiceprint information is a spectrum of sound waves carrying speech information displayed by electroacoustic instruments. Human vocal control organs include vocal cords, soft palate, tongue, teeth and lips, etc. Human vocal resonators include pharyngeal cavity, oral cavity and nasal cavity. These organs have differences in size, shape, and function. These differences cause changes in vocal airflow, resulting in differences in sound quality and timbre. In addition, people's vocal habits can vary from fast to slow, and the amount of force exerted can vary, which also results in differences in sound intensity and length. Pitch, intensity, length, and timbre are called "four elements" of speech in linguistics. These factors can be decomposed into more than ninety features. These characteristics are expressed as the wavelength, frequency, intensity and rhythm of different sounds, which can be drawn into a time-based power spectrum through acoustic tools, that is, forming the voiceprint information of the speaker.
声纹识别,生物识别技术的一种,也称为说话人识别,有两类,即说话人辨认和说话人确认。不同的任务和应用会使用不同的声纹识别技术,如缩小刑侦范围时可能需要辨认技术,而银行交易时则需要确认技术。应用于本实施例,基于说话人确认技术进行说明。Voiceprint recognition, a type of biometric technology, also known as speaker recognition, has two types, speaker recognition and speaker confirmation. Different tasks and applications will use different voiceprint recognition technologies. For example, identification techniques may be needed to reduce the scope of criminal investigations, while bank transactions require confirmation techniques. Applied to this embodiment, description is made based on speaker confirmation technology.
在一实施例中,如图2所示,提供一种声纹识别方法,以该声纹识别方法应用在图1中的识别服务器为例进行说明,包括如下步骤:In an embodiment, as shown in FIG. 2, a voiceprint recognition method is provided. The voiceprint recognition method is applied to the recognition server in FIG. 1 as an example for description, and includes the following steps:
S10.获取待识别语音,待识别语音携带说话人标识。S10. Acquire the speech to be recognized, and the speech to be recognized carries a speaker identifier.
其中,待识别语音是语音采集终端直接采集到的需要进行识别的语音,该待识别语音携带有说话人标识,用于识别该待识别语音对应的说话人。说话人标识是待识别说话人提供的用于验证身份的说话人标识,包括但不限于:姓名、注册名或身份证号码等可表示说话人唯一身份的标识。The to-be-recognized voice is a voice directly collected by a voice collection terminal and needs to be identified. The to-be-recognized voice carries a speaker identifier, and is used to identify a speaker corresponding to the to-be-recognized voice. The speaker identification is a speaker identification provided by the speaker to be identified for identity verification, including, but not limited to: a name, a registered name, or an identification number, which can represent the speaker's unique identity.
由于通讯设备制造商在制造语音采集终端过程中,在该语音采集终端配置语音增益控制模块,以使采集到的说话人的语音保持在一个合适的音量范围内,使得该语音采集终端直接采集到的待识别语音包括截顶语音段和正常语音段。具体地,语音采集终端录制说话人发出的待识别语音时,若说话人的音量过大或过小,语音采集终端的语音增益控制模块会自适应调整最高音阈值或最低音阈值对应的幅度阈值,然后把待识别语音的音量高于最高音阈值的幅度部分或者低于最低音阈值的幅度部分截去,并记录为幅度阈值,从而形成截顶语音段。相应地,语音采集终端录制的待识别语音中,音量在最低音阈值和最高音阈值之间的语音部分无需通过语音增益控制模块进行增益处理,故为正常语音段。Since the communication equipment manufacturer is manufacturing a voice acquisition terminal, a voice gain control module is configured on the voice acquisition terminal to keep the collected speaker's voice within a proper volume range, so that the voice acquisition terminal directly collects The speech to be recognized includes truncated speech segments and normal speech segments. Specifically, when the voice collection terminal records the speech to be recognized by the speaker, if the speaker's volume is too high or too low, the voice gain control module of the voice collection terminal will adaptively adjust the amplitude threshold corresponding to the highest sound threshold or the lowest sound threshold. , And then truncating the amplitude portion of the volume of the speech to be recognized above the highest sound threshold or the amplitude portion below the minimum sound threshold and recording it as the amplitude threshold to form a truncated speech segment. Correspondingly, the portion of the voice to be recognized recorded by the voice collection terminal that has a volume between the lowest sound threshold and the highest sound threshold does not need to be gain processed by the voice gain control module, so it is a normal voice segment.
以待识别语音形成的典型正弦声波信号为例,若待识别语音的最大振幅为Em,语音采集终端的幅度阈值为Eq,发生信号截顶时最大振幅Em超过幅度阈值Eq,则会直接导致采样点取值在幅度阈值Eq,在波形上显示为大于幅度阈值Eq的部分被截断,从而形成本实施例所说的截顶语音段。在实际情况下,当语音采集终端采集大量的待识别语音,语音采集终端可能会自动调整增益大小,可能发生接收到的样本会被随机记录成一个低于幅度阈值Eq的值Ec,此时,Ec自适应调整为幅度阈值。Taking a typical sinusoidal sound signal formed by the speech to be identified as an example, if the maximum amplitude of the speech to be identified is Em, the amplitude threshold of the speech acquisition terminal is Eq, and the maximum amplitude Em when the signal truncation exceeds the amplitude threshold Eq, it will directly cause sampling The point value is truncated at the amplitude threshold Eq, and the portion displayed on the waveform that is greater than the amplitude threshold Eq is truncated, so as to form the truncated speech segment in this embodiment. In a practical situation, when the voice collection terminal collects a large amount of speech to be recognized, the voice collection terminal may automatically adjust the gain size, and it may happen that the received sample is randomly recorded as a value Ec below the amplitude threshold Eq. At this time, Ec is adaptively adjusted to the amplitude threshold.
S20.基于待识别语音,获取对应的待识别语音特征。S20. Based on the to-be-recognized voice, obtain corresponding features of the to-be-recognized voice.
其中,待识别语音特征是用以将待识别语音区别于其他人语音的特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为待识别语音特征。Among them, the feature to be recognized is a feature to distinguish the to-be-recognized voice from other people's voices, and is applied in this embodiment, and Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) may be used as the to-be-recognized feature. Speech characteristics.
语音特征在声纹识别技术领域常用MFCC(Mel-scale Frequency Cepstral Coefficients,梅尔频率倒谱系数)特征来表示。根据人耳听觉机理的研究发现,人耳对不同频率的声波有不同的听觉敏感度。从200Hz到5000Hz的语音信号对语音的清晰度影响对大。两个响度不等的声音作用于人耳时,则响度较高的频率成分的存在会影响到对响度较低的频率成分的感受,使其变得不易察觉,这种现象称为掩蔽效应。由于频率较低的声音在内耳蜗基底膜上行波传递的距离大于频率较高的声音,故一般来说,低音容易掩蔽高音,而高音掩蔽低音较困难。在低频处的声音掩蔽的临界带宽较高频要小。所以,从低频 到高频这一段频带内按临界带宽的大小可以由密到疏安排一组梅尔刻度的带通滤波器,对输入信号进行滤波以使每个频率段对应一个数值。梅尔刻度的滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。Speech features are often expressed in the field of voiceprint recognition technology by MFCC (Mel-scale Frequency, Cepstral Coefficients, Coefficients, Mel). According to the study of the hearing mechanism of the human ear, the human ear has different hearing sensitivities to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has a great impact on speech intelligibility. When two sounds of different loudness are acting on the human ear, the presence of frequency components with higher loudness will affect the perception of frequency components with lower loudness, making it difficult to perceive. This phenomenon is called masking effect. Because the lower frequency sound transmits the upward wave of the basement membrane of the inner cochlea than the higher frequency sound, generally speaking, the bass is easy to mask the treble, and the treble is more difficult to mask the bass. The critical bandwidth for sound masking at low frequencies is smaller at higher frequencies. Therefore, in the frequency band from low frequency to high frequency, a set of Mel scale band-pass filters can be arranged from dense to sparse according to the size of the critical bandwidth, and the input signal is filtered so that each frequency band corresponds to a value. The resolution of the Mel-scale filter bank is high in the low-frequency part, which is consistent with the auditory characteristics of the human ear. This is also the physical meaning of the Mel-scale.
S30.采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征。S30. A truncated speech detection algorithm is used to detect the speech to be recognized. If the speech to be recognized is a truncated speech segment, then a truncated speech repair model is used to repair the features of the speech to be identified to obtain the target speech features.
其中,目标语音特征包含正常语音段对应的语音特征,也包含采用截顶语音修复模型修复截顶语音段对应的语音特征所形成的语音特征。即目标语音特征是对待识别语音特征进行语音修复后形成的语音特征。The target speech feature includes the speech feature corresponding to the normal speech segment, and also includes the speech feature formed by repairing the speech feature corresponding to the truncated speech segment using a truncated speech repair model. That is, the target speech feature is a speech feature formed by performing speech repair on the speech feature to be recognized.
截顶语音修复模型是可以将输入的待识别语音特征还原,并输出为目标语音特征的模型。应用于本实施例,截顶语音修复模型已预先训练好并存储在识别服务器上,以便识别服务器可实时调用该模型来修复截顶语音段。The truncated speech repair model is a model that can restore the input speech features to be recognized and output them as the target speech features. Applied to this embodiment, the truncated speech repair model has been trained in advance and stored on the recognition server, so that the recognition server can call the model in real time to repair the truncated speech segment.
本实施例中,截顶语音检测算法是检测语音采集终端采集的待识别语音的种类的算法。待识别语音的种类包括两种:经截顶语音处理后的截顶语音段和未经截顶语音处理的正常语音段。可以理解地,由于正常语音段没有对语音信号进行处理,其保留说话人的声纹特征,而截顶语音段是待识别语音被截除小于最低音阈值或者大于最高音阈值之后获取的语音段,存在语音信号失真现象。若直接基于包括截顶语音段对应的待识别语音进行语音识别,可能存在识别不准确的现象。因此,需要采用截顶语音检测算法首先判定待识别语音的种类,为后续语音识别提供技术基础。In this embodiment, the truncated voice detection algorithm is an algorithm that detects the kind of voice to be recognized collected by the voice collection terminal. There are two types of speech to be recognized: truncated speech segments after truncated speech processing and normal speech segments without truncated speech processing. Understandably, since the normal speech segment does not process the speech signal, it retains the voiceprint characteristics of the speaker, and the truncated speech segment is the speech segment obtained after the speech to be recognized is cut off less than the minimum sound threshold or greater than the maximum sound threshold. There is a distortion of the voice signal. If speech recognition is performed directly based on the speech to be recognized that includes the truncated speech segment, there may be a phenomenon of inaccurate recognition. Therefore, it is necessary to use a truncated speech detection algorithm to first determine the type of speech to be recognized, and provide a technical basis for subsequent speech recognition.
在一实施例中,采用截顶语音修复模型修复待识别语音特征,获取目标语音特征。In one embodiment, a truncated speech repair model is used to repair the speech features to be identified to obtain the target speech features.
其中,截顶语音修复模型是由初始训练模型经训练后形成的、用于修复截顶语音段对应的语音特征的模型。应用于本实施例,可采用DNN(Deep Neural Networks,深度神经网络)、DBN(Deep Belief Nets,深度信念网络)模型或CDBN(Convolutional Deep Belief Networks,卷积的深度信念网络)模型等作为初始训练模型。下述以采用DNN模型为例说明修复截顶语音段的过程:The truncated speech repair model is a model formed by training the initial training model and used to repair the speech features corresponding to truncated speech segments. Applied to this embodiment, DNN (Deep Neural Networks), DBN (Deep Belief Nets) model, or CDBN (Convolutional Deep Belief Networks) model can be used as initial training. model. The following uses the DNN model as an example to illustrate the process of repairing truncated speech segments:
DNN模型被广泛应用在很多重要的互联网应用,如语音识别,图像识别,自然语言处理等。DNN模型由于计算复杂度高,能大幅度提高语音识别的精度,因此被广泛用在很多公司的语音识别产品中。DNN model is widely used in many important Internet applications, such as speech recognition, image recognition, natural language processing, etc. DNN model is widely used in speech recognition products of many companies because of its high computational complexity and can greatly improve the accuracy of speech recognition.
目前的DNN模型的结构包括输入层,数个中间层和输出层。输入层负责接收来自外界的输入信息,并传递给中间层;中间层是内部信息处理层,负责信息变换,根据信息变化能力的需求,中间层可以设计为单中间层或者多中间层结构;中间层传递到输出层的信息经进一步处理后,完成一次学习的正向传播处理过程,由输出层向外界输出信息处理结果。The structure of the current DNN model includes an input layer, several intermediate layers, and an output layer. The input layer is responsible for receiving input information from the outside world and passing it to the middle layer; the middle layer is an internal information processing layer that is responsible for information transformation. According to the needs of information change ability, the middle layer can be designed as a single middle layer or multiple middle layer structure; After the information transmitted from the layer to the output layer is further processed, a learning forward propagation process is completed, and the output layer outputs the information processing result to the outside world.
各层的神经元数一般是几百到几万不等,层以层之间是全连接的网络。DNN模型的训练计算是一层算完再期待下一层,层与层之间不能并行。一般一次DNN训练可以用以下几个阶段表示:前向计算、反向误差计算,最后是根据前向计算和反向误差计算的结果更新每层的权重。前向计算过程是从输入层向后一直算到输出层,计算是串行的。反向计算过程是从输出层向前一直算到第一层,计算也是串行的。The number of neurons in each layer generally ranges from several hundred to tens of thousands, and the layers and layers are fully connected networks. The training calculation of the DNN model is that one layer is calculated and then the next layer is expected. The layers cannot be parallel to each other. Generally, a DNN training can be expressed in the following stages: forward calculation, reverse error calculation, and finally the weight of each layer is updated according to the results of the forward calculation and reverse error calculation. The forward calculation process is calculated from the input layer to the output layer, and the calculation is serial. The reverse calculation process is calculated from the output layer to the first layer, and the calculation is also serial.
每次输入一小段训练数据称为一个batch,一个batch完成一次训练,也就是说得到一个新的权值后,会用这个权值及新输入的下一个batch进行训练,得到更新的一个权值,直到所有的输入计算完毕称为一轮。一般一个完整的训练需要10~20轮。Each time a small piece of training data is input, it is called a batch. One batch completes one training. That is to say, after obtaining a new weight, it will use this weight and the next batch of the new input to train and get an updated weight. Until all inputs have been calculated, it is called a round. Generally a complete training requires 10-20 rounds.
当实际输出与期望输出不符时,进入误差的反向传播阶段。误差通过输出层,按误差梯度下降的方式修正各层权值,向中间层和输入层逐层反传。DNN训练过程是周而复始的 信息正向传播和误差反向传播过程,是各层权值不断调整的过程,也是神经网络学习训练的过程,此过程一直进行到网络输出的误差减少到可以接受的程度,或者预先设定的学习次数为止。When the actual output does not match the expected output, the back propagation phase of the error is entered. The error passes the output layer, and the weights of each layer are corrected in the manner of the error gradient descent, and it is transmitted back to the intermediate layer and the input layer layer by layer. The DNN training process is a process of forward and backward information propagation and error back propagation. It is a process of continuously adjusting the weights of each layer. It is also a process of neural network learning and training. This process continues until the error of the network output is reduced to an acceptable level. , Or up to a preset number of learnings.
在一实施例中,在步骤S30中,即采用截顶语音修复模型修复待识别语音特征,获取目标语音特征,具体包括如下步骤:In an embodiment, in step S30, the truncated speech repair model is used to repair the speech features to be identified, and the target speech features are obtained, which specifically include the following steps:
S31.采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取目标语音特征。S31. A truncated speech repair model based on the DNN model is used to repair the features of the speech to be recognized and obtain the target speech features.
其中,截顶语音修复模型是由DNN模型经训练后形成的、用于修复截顶语音段对应的语音特征并输出为目标语音特征的模型。The truncated speech repair model is a model formed by training a DNN model and used to repair the speech features corresponding to truncated speech segments and output the target speech features.
具体地,本实施例可采用步骤S30生成的截顶语音修复模型用以修复待识别语音特征。将待识别语音特征也即待识别MFCC特征作为DNN模型的输入,经DNN模型的训练,可获取DNN模型输出的目标语音特征,也即原始MFCC特征。Specifically, in this embodiment, the truncated speech repair model generated in step S30 may be used to repair the speech features to be identified. Taking the to-be-recognized speech feature, that is, the MFCC feature to be recognized as the input of the DNN model, the DNN model can be trained to obtain the target speech feature output by the DNN model, that is, the original MFCC feature.
在步骤S30中,识别服务器采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取目标语音特征,即将截顶语音段的待识别语音特征(截顶MFCC特征)输入截顶语音修复模型,用以经截顶语音修复模型修复后获取修复语音段的目标语音特征(MFCC特征),以作为语音识别的技术基础。因MFCC特征是基于对输入的语音信号不做任何的假设和限制,且利用听觉模型而产生的,具有良好的鲁棒性,更符合人耳的听觉特性,即使当信噪比降低时仍然具有较好的语音识别性能。本步骤中采用基于DNN模型生成的截顶语音修复模型来修复截顶语音段,可大幅提高语音修复的准确性。In step S30, the recognition server uses the truncated speech repair model based on the DNN model to repair the speech features to be identified, obtains the target speech features, and enters the truncated speech repair model of truncated speech segments (truncated MFCC features) into the truncated speech repair model. , Used to obtain the target speech feature (MFCC feature) of the repaired speech segment after repairing by the truncated speech repair model, as the technical basis of speech recognition. Because the MFCC feature is based on no assumptions or restrictions on the input speech signal, and is produced using an auditory model, it has good robustness and is more in line with the auditory characteristics of the human ear, even when the signal-to-noise ratio is reduced. Better speech recognition performance. In this step, a truncated speech repair model based on a DNN model is used to repair truncated speech segments, which can greatly improve the accuracy of speech repair.
进一步地,因语音采集终端的增益模块会自适应调整幅度阈值,识别服务器难以通过指定一个固定的幅度阈值来判定截顶语音段,可采用判定高音音量子区间的采样点数的百分比,也即采用本步骤提出的截顶语音检测算法来进行判定,可有效提高判定结果的准确性。Further, since the gain module of the voice collection terminal adaptively adjusts the amplitude threshold, it is difficult for the recognition server to determine the truncated voice segment by specifying a fixed amplitude threshold. The percentage of the sampling points for the treble volume sub-range can be determined, that is, The truncated speech detection algorithm proposed in this step to make a judgment can effectively improve the accuracy of the judgment result.
本实施例提出的声纹识别方法,通过截顶语音检测算法可检测出待识别语音是否为截顶语音段,若是则通过截顶语音修复模型可将截顶语音段的待识别语音特征修复为目标语音特征,以对比说话人对于的标准语音特征来识别说话人的真实身份。本实施例通过修复待识别语音中的待识别语音特征后获取接近说话人原始语音的目标语音特征,可有效提高语音识别的可靠性和准确性。The voiceprint recognition method proposed in this embodiment can detect whether the speech to be recognized is a truncated speech segment by using a truncated speech detection algorithm. If so, the truncated speech repair model can repair the speech feature of the truncated speech segment to be identified as The target speech features are compared to the standard speech features of the speaker to identify the true identity of the speaker. This embodiment can effectively improve the reliability and accuracy of speech recognition by repairing the speech features to be identified in the speech to be identified and obtaining target speech features close to the original speech of the speaker.
S40.基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。S40. Based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice features and the standard voice features to obtain whether the target voice features and the standard voice features correspond to the voiceprint recognition result of the same speaker. .
其中,标准语音是与说话人标识对应的保存在预存标准语音库中、记录未经截顶语音处理的说话人发出的语音。同理地,标准语音特征就是标准语音对应的MFCC特征。Among them, the standard voice is a voice corresponding to the speaker identification, which is stored in a pre-stored standard voice database and records the speaker without truncated voice processing. Similarly, the standard voice feature is the MFCC feature corresponding to the standard voice.
预设声纹识别模型是用于对标准语音特征和目标语音特征进行相似度打分的模型,现有多种成熟的声纹识别模型,比如GMM-UBM(Gaussian mixture model-universal background model,混合模型-通用背景模型)模型或i-vector(identity-vector,身份认证向量)模型等。应用于本实施例,可采用i-vector模型作为预设声纹识别模型。The preset voiceprint recognition model is a model for scoring similarity between standard voice features and target voice features. There are many mature voiceprint recognition models, such as GMM-UBM (Gaussian mixture model-universal background model). -Universal background model) model or i-vector (identity-vector, identity authentication vector) model, etc. Applied to this embodiment, an i-vector model can be used as a preset voiceprint recognition model.
优选地,为了加快语音识别的处理速度,识别服务器可在数据库中关联存储标准MFCC特征对应的标准i-vector,以便基于i-vector进行语音识别时可直接调用该标准i-vector作为对比标准。Preferably, in order to speed up the processing speed of speech recognition, the recognition server may store a standard i-vector corresponding to the standard MFCC feature in a database, so that the standard i-vector can be directly used as a comparison standard when performing speech recognition based on the i-vector.
采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别是通过对比目标语音特征和标准语音特征的余弦距离实现的。余弦距离,也称为余弦相似度,是用向量空 间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。向量,是多维空间中有方向的线段,如果两个向量的方向一致,即夹角接近零,夹角的余弦值也即余弦距离趋近于1,则这两个向量就相近。应用于本实施例,可根据实际情况设置一距离阈值。当标准i-vector和原始i-vector的余弦距离大于距离阈值时,可判定目标语音特征和标准语音特征对应同一说话人。The use of a preset voiceprint recognition model for voiceprint recognition of target voice features and standard voice features is achieved by comparing the cosine distance between the target voice feature and the standard voice feature. Cosine distance, also called cosine similarity, is a measure of the difference between two individuals by using the cosine of the angle between two vectors in a vector space. A vector is a directional line segment in a multi-dimensional space. If the directions of two vectors are the same, that is, the included angle is close to zero, and the cosine value of the included angle, that is, the cosine distance approaches 1, the two vectors are similar. Applied to this embodiment, a distance threshold can be set according to the actual situation. When the cosine distance between the standard i-vector and the original i-vector is greater than the distance threshold, it can be determined that the target speech feature and the standard speech feature correspond to the same speaker.
在一实施例中,如图3所示,在步骤S20中,即基于待识别语音,获取对应的待识别语音特征,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S20, that is, acquiring the corresponding feature of the voice to be recognized based on the voice to be recognized, specifically including the following steps:
S21:对待识别语音进行预处理,获取预处理语音数据。S21: Pre-process the speech to be recognized, and obtain pre-processed speech data.
本实施例中,对待识别语音进行预处理,并获取相对应的预处理语音数据。对待识别语音进行预处理能够更好地提取待识别语音的待识别语音特征,使得提取出的待识别语音特征更能代表该待识别语音,以采用该待识别语音特征进行语音区分。In this embodiment, the speech to be recognized is pre-processed, and corresponding pre-processed voice data is obtained. Pre-processing the speech to be recognized can better extract the speech features to be recognized, so that the extracted speech features can better represent the speech to be recognized, and use the speech features to distinguish the speech.
在一实施例中,步骤S21中,对待识别语音进行预处理,获取预处理语音数据,包括如下步骤:In an embodiment, in step S21, pre-processing the speech to be recognized to obtain pre-processed speech data includes the following steps:
S211:对待识别语音作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。 S211: speech recognition step for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, where, s n-amplitude signal on the time domain, s n-1 with the s The signal amplitude corresponding to n at the previous moment, s' n is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9 <a <1.0.
其中,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,信号在传输过程中受损很大,为了使接收端能得到比较好的信号波形,就需要对受损的信号进行补偿。预加重技术的思想就是在传输线的发送端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减,使得接收端能够得到较好的信号波形。预加重对噪声并没有影响,因此能够有效提高输出信噪比。Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving end, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
本实施例中,对待识别语音作预加重处理,该预加重处理的公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,即语音数据在时域上表达的语音的幅值(幅度),s n-1为与s n相对的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0,这里取0.97预加重的效果比较好。采用该预加重处理能够消除发声过程中声带和嘴唇等造成的干扰,可以有效补偿待识别语音被压抑的高频部分,并且能够突显待识别语音高频的共振峰,加强待识别语音的信号幅度,有助于提取待识别语音特征。 Formula of the present embodiment, the speech to be recognized as a pre-emphasis, the pre-emphasis is s' n = s n -a * s n-1, wherein the amplitude of the signal s n on the time domain, i.e., voice data voice magnitude (amplitude) expression in the time domain, s n-1 s n is the opposite of the signal amplitude of a time, s' n for the signal amplitude in the time domain after the pre-emphasis, a is the coefficient of pre-emphasis, The value range of a is 0.9 <a <1.0. Here, the effect of pre-emphasis of 0.97 is better. The use of the pre-emphasis processing can eliminate the interference caused by the vocal cords and lips during the utterance, can effectively compensate the suppressed high-frequency part of the speech to be recognized, and can highlight the high-frequency formants of the speech to be identified, enhancing the signal amplitude of the speech to be identified To help extract the features of the speech to be recognized.
S212:将预加重后的待识别语音进行分帧处理。S212: Frame the pre-emphasized speech to be recognized.
本实施例中,在预加重待识别语音后,还应进行分帧处理。分帧是指将整段的语音信号切分成若干段的语音处理技术,每帧的大小在10-30ms的范围内,以大概1/2帧长作为帧移。帧移是指相邻两帧间的重叠区域,能够避免相邻两帧变化过大的问题。对待识别语音进行分帧处理,能够将待识别语音分成若干段的语音数据,可以细分待识别语音,便于待识别语音特征的提取。In this embodiment, after pre-emphasizing the speech to be recognized, frame processing should also be performed. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. Framed speech processing is performed, and the speech data to be recognized can be divided into several pieces of speech data. The speech to be recognized can be subdivided to facilitate the extraction of the characteristics of the speech to be recognized.
S213:将分帧后的待识别语音进行加窗处理,获取预处理语音数据,加窗的计算公式为
Figure PCTCN2018092598-appb-000001
其中,N为窗长,n为时间,s n为时域上的信号幅度, s' n为加窗后时域上的信号幅度。
S213: Perform windowing on the framed to-be-recognized speech to obtain pre-processed speech data. The calculation formula for windowing is
Figure PCTCN2018092598-appb-000001
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
本实施例中,在对待识别语音进行分帧处理后,每一帧的起始段和末尾端都会出现不连续的地方,所以分帧越多与待识别语音的误差也就越大。采用加窗能够解决这个问题,可以使分帧后的待识别语音变得连续,并且使得每一帧能够表现出周期函数的特征。加窗处理具体是指采用窗函数对待识别语音进行处理,窗函数可以选择汉明窗,则该加窗的公式为
Figure PCTCN2018092598-appb-000002
N为汉明窗窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。对待识别语音进行加窗处理,获取预处理语音数据,能够使得分帧后的待识别语音在时域上的信号变得连续,有助于提取待识别语音的待识别语音特征。
In this embodiment, after frame processing is performed on the speech to be recognized, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error between the speech to be recognized. The use of windowing can solve this problem, make the framed speech to be recognized continuous, and make each frame exhibit the characteristics of a periodic function. The windowing process specifically refers to using a window function to process the speech to be recognized. The window function can select the Hamming window. The formula for windowing is
Figure PCTCN2018092598-appb-000002
N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Performing window processing on the speech to be recognized and obtaining pre-processed speech data can make the signal of the speech to be recognized in the time domain after framed become continuous, which is helpful for extracting the characteristics of the speech to be recognized.
上述步骤S211-S213对待识别语音的预处理操作,为提取待识别语音的待识别语音特征提供了基础,能够使得提取的待识别语音特征更能代表该待识别语音,并根据该待识别语音特征进行语音区分。The pre-processing operations of the speech to be recognized in the above steps S211 to S213 provide a basis for extracting the speech features to be recognized, and can make the extracted speech features to be more representative of the speech to be recognized, and according to the speech features to be recognized Make a speech distinction.
S22:对预处理语音数据作快速傅里叶变换,获取待识别语音的频谱,并根据频谱获取待识别语音的功率谱。S22: Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the speech to be identified, and obtain the power spectrum of the speech to be identified according to the frequency spectrum.
其中,快速傅里叶变换(Fast Fourier Transformation,简称FFT),指利用计算机计算离散傅里叶变换的高效、快速计算方法的统称,简称FFT。采用这种算法能使计算机计算离散傅里叶变换所需要的乘法次数大为减少,特别是被变换的抽样点数越多,FFT算法计算量的节省就越显著。Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short. The use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
本实施例中,对预处理语音数据进行快速傅里叶变换,以将预处理语音数据从时域上的信号幅度转换为在频域上的信号幅度(频谱)。该计算频谱的公式为
Figure PCTCN2018092598-appb-000003
1≤k≤N,N为帧的大小,s(k)为频域上的信号幅度,s(n)为时域上的信号幅度,n为时间,i为复数单位。在获取预处理语音数据的频谱后,可以根据该频谱直接求得预处理语音数据的功率谱,以下将预处理语音数据的功率谱称为目标待区分语音数据的功率谱。该计算目标待区分语音数据的功率谱的公式为
Figure PCTCN2018092598-appb-000004
N为帧的大小,s(k)为频域上的信号幅度。通过将预处理语音数据从时域上的信号幅度转换为频域上的信号幅度,再根据该频域上的信号幅度获取待识别语音的功率谱,为从待识别语音的功率谱中提取待识别语音特征提供重要的技术基础。
In this embodiment, fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain. The formula for calculating the spectrum is
Figure PCTCN2018092598-appb-000003
1≤k≤N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. After obtaining the frequency spectrum of the pre-processed voice data, the power spectrum of the pre-processed voice data can be directly obtained according to the frequency spectrum. The power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished. The formula for calculating the power spectrum of the target speech data to be distinguished is
Figure PCTCN2018092598-appb-000004
N is the frame size, and s (k) is the signal amplitude in the frequency domain. The pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the speech to be identified is obtained according to the signal amplitude in the frequency domain. Recognizing speech features provides an important technical basis.
S23:采用梅尔刻度滤波器组处理待识别语音的功率谱,获取待识别语音的梅尔功率谱。S23: A Mel scale filter bank is used to process the power spectrum of the speech to be recognized, and a Mel power spectrum of the speech to be recognized is obtained.
其中,采用梅尔刻度滤波器组处理待识别语音的功率谱是对功率谱进行的梅尔频率分析,梅尔频率分析是基于人类听觉感知的分析。观测发现,人耳就像一个滤波器组一样,只关注某些特定的频率分量(人的听觉对频率是有选择性的),也就是说人耳只让某些频率的信号通过,而直接无视不想感知的某些频率信号。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。可以理解地,梅尔刻度滤波器组在低频部分的分辨 率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。Among them, the power spectrum using the Mel scale filter bank to process the speech to be recognized is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. Observation found that the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
本实施例中,采用梅尔刻度滤波器组处理待识别语音的功率谱,获取待识别语音的梅尔功率谱,通过采用梅尔刻度滤波器组对频域信号进行切分,使得最后每个频率段对应一个数值,若滤波器的个数为22,则可以得到待识别语音的梅尔功率谱对应的22个能量值。通过对待识别语音的功率谱进行梅尔频率分析,使得其分析后获取的梅尔功率谱保留着与人耳特性密切相关的频率部分,该频率部分能够很好地反映出待识别语音的特征。In this embodiment, a Mel scale filter bank is used to process the power spectrum of the speech to be recognized, a Mel power spectrum of the speech to be recognized is obtained, and a frequency domain signal is segmented by using the Mel scale filter bank, so that the final The frequency segment corresponds to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the speech to be recognized can be obtained. The Mel frequency analysis is performed on the power spectrum of the speech to be recognized, so that the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the speech to be recognized.
S24:在梅尔功率谱上进行倒谱分析,获取待识别语音的梅尔频率倒谱系数。S24: Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
本实施例中,对梅尔功率谱进行倒谱分析,根据倒谱的结果,分析并获取待识别语音的梅尔频率倒谱系数。通过该倒谱分析,可以将原本特征维数过高,难以直接使用的待识别语音的梅尔功率谱中包含的特征,通过在梅尔功率谱上进行倒谱分析,转换成易于使用的特征(用来进行训练或识别的梅尔频率倒谱系数特征向量)。该梅尔频率倒谱系数能够作为待识别语音特征对不同语音进行区分的系数,该待识别语音特征可以反映语音之间的区别,可以用来识别和区分待识别语音。In this embodiment, a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the speech to be recognized is analyzed and obtained. Through this cepstrum analysis, the features contained in the Mel power spectrum of the speech to be recognized that are too high in original feature dimension and difficult to use directly can be converted into easy-to-use features by performing cepstrum analysis on the Mel power spectrum. (Mel frequency cepstrum coefficient feature vector used for training or identification). The Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from the features of the to-be-recognized voice. The features of the to-be-recognized voice can reflect the difference between the voices and can be used to identify and distinguish the to-be-recognized voices.
在一实施例中,步骤S24中,在梅尔功率谱上进行倒谱分析,获取待识别语音的梅尔频率倒谱系数,包括如下步骤:In an embodiment, in step S24, cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the speech to be recognized, including the following steps:
S241:取梅尔功率谱的对数值,获取待变换梅尔功率谱。S241: Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.
本实施例中,根据倒谱的定义,对梅尔功率谱取对数值log,获取待变换梅尔功率谱m。In this embodiment, according to the definition of the cepstrum, a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.
S242:对待变换梅尔功率谱作离散余弦变换,获取待识别语音的梅尔频率倒谱系数。S242: Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
本实施例中,对待变换梅尔功率谱m作离散余弦变换(Discrete Cosine Transform,DCT),获取相对应的待识别语音的梅尔频率倒谱系数,一般取第2个到第13个系数作为待识别语音特征,该待识别语音特征能够反映语音数据间的区别。对待变换梅尔功率谱m作离散余弦变换的公式为
Figure PCTCN2018092598-appb-000005
N为帧长,m为待变换梅尔功率谱,j为待变换梅尔功率谱的自变量。由于梅尔滤波器之间是有重叠的,所以采用梅尔刻度滤波器获取的能量值之间是具有相关性的,离散余弦变换可以对待变换梅尔功率谱m进行降维压缩和抽象,并获得间接的待识别语音特征,相比于傅里叶变换,离散余弦变换的结果没有虚部,在计算方面有明显的优势。
In this embodiment, a discrete cosine transform (DCT) is performed on the Mel power spectrum m to be transformed to obtain the corresponding Mel frequency cepstrum coefficients of the speech to be recognized. Generally, the second to thirteenth coefficients are taken as To-be-recognized voice features, which can reflect differences between voice data. The formula for discrete cosine transform of the transformed Mel power spectrum m is
Figure PCTCN2018092598-appb-000005
N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters. Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and Compared with the Fourier transform, the results of the discrete cosine transform have no imaginary parts, and have obvious advantages in calculation.
步骤S21-S24对待识别语音进行特征提取后获得的待识别语音特征,能够很准确地体现待识别语音的特征,利于基于待识别语音特征进行语音识别。Steps S21-S24, the feature of the voice to be recognized obtained after feature extraction of the voice to be recognized, can accurately reflect the feature of the voice to be recognized, and is beneficial to voice recognition based on the feature of the voice to be recognized.
在一实施例中,如图4所示,在步骤S30中,即采用截顶语音检测算法对待识别语音进行处理,获取截顶语音段,具体包括如下步骤:In an embodiment, as shown in FIG. 4, in step S30, the truncated voice detection algorithm is used to process the speech to be recognized to obtain the truncated voice segment, which specifically includes the following steps:
S31.将待识别语音按时序平均分割出至少两个语音子段。S31. The speech to be recognized is evenly divided into at least two speech sub-segments according to the time sequence.
其中,待识别语音是语音采集终端将待识别说话人的原始语音经截顶语音处理后记录的语音。The to-be-recognized voice is a voice recorded by the voice collection terminal after truncating the original voice of the speaker to be recognized.
具体地,将待识别语音按时序分割成若干不重叠的至少两个语音子段,每一时段可设置为0.5s,以形成做截顶语音识别检测的最小基本单位。Specifically, the speech to be recognized is divided into several non-overlapping at least two speech sub-segments according to the time sequence, and each period can be set to 0.5s to form the minimum basic unit for truncated speech recognition detection.
本步骤通过将待识别语音按时序平均分割出至少两个语音子段进行检测,可以提高检测截顶语音段的精确度。可以理解地,分割出的语音子段越多,检测截顶语音段的精确度 越高。In this step, the speech to be recognized is divided into at least two speech sub-segments for detection in order to improve the accuracy of detecting truncated speech segments. Understandably, the more speech sub-segments are segmented, the higher the accuracy of detecting truncated speech segments is.
S32.将语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量。S32. The voice sub-segment is evenly divided into at least two volume sub-intervals according to the volume change, and the number of treble sampling points of the volume sub-interval where the highest sound is located is obtained.
其中,高音采样点数量就是在最高音所在的音量子区间获得的语音采样点的数量。The number of treble sampling points is the number of voice sampling points obtained in the volume sub-interval where the highest pitch is located.
具体地,识别服务器首先获取每一语音子段的幅值(音量)最大值Vm,将区间[0,Vm]均匀分割为不重叠的至少两个音量子区间。因每一语音子段中的最高音所在的音量子区间可能发生截顶语音处理的现象,若发生截顶语音处理,则最高音所在的音量子区间就是幅度阈值所在的区间。Specifically, the recognition server first obtains the maximum amplitude (volume) Vm of each voice sub-segment, and divides the interval [0, Vm] into at least two volume sub-sections that do not overlap. Because the truncated voice processing may occur in the volume sub-interval where the highest sound in each voice sub-segment is located, if the truncated voice processing occurs, the volume sub-interval where the highest sound is located is the interval where the amplitude threshold is located.
本步骤通过获取最高音所在的音量子区间的高音采样点数量,可用于判定该语音子段是否为截顶语音段作为技术基础。In this step, the number of treble sampling points in the volume sub-interval where the highest note is located can be used to determine whether the voice sub-segment is a truncated voice segment as a technical basis.
S33.统计所有音量子区间的采样点总数,以获取高音采样点数量相对采样点总数的高音采样百分比。S33. Count the total number of sampling points in all the volume sub-sections to obtain the percentage of treble sampling points relative to the total number of sampling points.
其中,采样点总数是每个语音字段中对语音音量进行采样的所有采样点数。高音采样百分比是高音采样点数量占采样点总数的百分比。The total number of sampling points is the number of all sampling points that sample the voice volume in each voice field. The treble sampling percentage is the percentage of the number of treble sampling points to the total number of sampling points.
S34.若高音采样百分比超过预设阈值,则对应的语音子段为截顶语音段。S34. If the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
其中,预设阈值是根据实际经验设定的高音采样点数量相对采样点总数的百分比。若高音采样百分比超过预设阈值,说明该音量子区间所在的语音子段为截顶语音段。The preset threshold is a percentage of the number of treble sampling points relative to the total number of sampling points set according to actual experience. If the treble sampling percentage exceeds a preset threshold, it indicates that the voice sub-segment in which the volume sub-interval is located is a truncated voice segment.
可以理解地,若最高音所在的音量子区间的高音采样百分比没有都超过预设阈值,说明该音量子区间所在的语音子段的语音采样点的数量属于正常范围,该音量子区间所在的语音子段为正常语音段。Understandably, if the treble sampling percentage of the volume sub-section where the highest note is located does not all exceed the preset threshold, it means that the number of voice sampling points of the voice sub-segment where the volume sub-section belongs is in the normal range, and the voice of the volume sub-section The sub-segments are normal speech segments.
本实施例中,因语音采集终端的增益模块会自适应调整幅度阈值,识别服务器难以通过指定一个固定的幅度阈值来判定截顶语音段,可采用高音采样点数量相对采样点总数的高音采样百分比对比预设阈值进行判定,可有效提高判定方法的灵活性和有利于提高判定结果的准确性。In this embodiment, since the gain module of the voice collection terminal adaptively adjusts the amplitude threshold, it is difficult for the recognition server to determine a truncated speech segment by specifying a fixed amplitude threshold. The treble sampling percentage of the number of treble sampling points relative to the total number of sampling points may be used. Judging by comparing with a preset threshold can effectively improve the flexibility of the judgment method and improve the accuracy of the judgment result.
在一实施例中,如图5所示,在步骤S30之前,即在采用截顶语音修复模型修复待识别语音特征的步骤之前,该声纹识别方法还包括:In an embodiment, as shown in FIG. 5, before step S30, that is, before the step of using a truncated speech repair model to repair the speech features to be identified, the voiceprint recognition method further includes:
S35.获取原始训练语音对应的原始训练特征,对该原始训练语音进行截顶语音处理以获取对应的截顶训练语音,再提取该截顶训练语音的截顶训练特征。S35. Obtain the original training features corresponding to the original training voice, perform truncated voice processing on the original training voice to obtain the corresponding truncated training voice, and then extract the truncated training features of the truncated training voice.
其中,原始训练语音是说话人发出的未经截顶语音处理的原始语音。可以理解地,截顶训练语音就是说话人发出的、经截顶语音处理后的语音。Among them, the original training speech is the original speech from the speaker without truncated speech processing. Understandably, truncated training speech is the speech sent by the speaker after truncated speech processing.
本步骤中,识别服务器通过将原始训练语音进行截顶处理,也即只保留原始训练语音在最高音阈值和最低音阈值之间的语音信号,将超出上述范围的语音信号记录为幅度阈值,以获取对应的截顶训练语音。In this step, the recognition server performs truncation processing on the original training speech, that is, only the speech signals of the original training speech between the highest and lowest thresholds are retained, and the voice signals exceeding the above range are recorded as amplitude thresholds, and Get the corresponding truncated training voice.
S36.将截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将原始训练语音对应的原始训练特征作为DNN模型的输出层,校准DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。S36. The truncated training feature corresponding to the truncated training voice is used as the input layer of the DNN model, the original training feature corresponding to the original training voice is used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate a truncation based on the DNN model Top speech repair model.
其中,语音特征在声纹识别技术领域常用MFCC特征来表示。因MFCC特征是基于对输入的语音信号不做任何的假设和限制,且利用听觉模型而产生的,具有良好的鲁棒性,更符合人耳的听觉特性,即使当信噪比降低时仍然具有较好的语音识别性能。因此,截顶训练特征就是截顶训练语音对应的截顶MFCC特征,原始训练特征就是原始训练语音对应的原始MFCC特征。Among them, MFCC features are often used to represent voice features in the field of voiceprint recognition technology. Because the MFCC feature is based on no assumptions or restrictions on the input speech signal, and is produced using an auditory model, it has good robustness and is more in line with the auditory characteristics of the human ear, even when the signal-to-noise ratio is reduced. Better speech recognition performance. Therefore, the truncated training feature is the truncated MFCC feature corresponding to the truncated training voice, and the original training feature is the original MFCC feature corresponding to the original training voice.
具体地,提取截顶训练语音的截顶MFCC特征的实现过程包括:基于傅里叶变换将截顶训练语音从时域信号转换为频域信号;再过滤该频域信号以获取梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取初始语音的梅尔频率倒谱系数也即MFCC特征。其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,转换成易于使用的语音特征(用来进行训练或识别的梅尔频率倒谱系数特征向量)。上述过程同样适用于提取原始训练语音对应的原始MFCC特征。Specifically, the process of extracting the truncated MFCC features of truncated training speech includes: converting the truncated training speech from a time domain signal to a frequency domain signal based on a Fourier transform; and filtering the frequency domain signal to obtain a Mel power spectrum ; Perform cepstrum analysis on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the initial speech, which is also the MFCC feature. Among them, cepstrum refers to the inverse Fourier transform of a signal's Fourier transform spectrum after logarithmic operation, which is converted into easy-to-use speech features (mel frequency used for training or recognition) Cepstrum coefficient feature vector). The above process is also applicable to extract the original MFCC features corresponding to the original training speech.
DNN模型的结构包括输入层,数个中间层和输出层。输入层负责接收来自外界的输入信息,并传递给中间层;中间层是内部信息处理层,负责信息变换,根据信息变化能力的需求,中间层可以设计为单中间层或者多中间层结构;最高音量所在的中间层传递到输出层的信息经进一步处理后,完成一次学习的正向传播处理过程,由输出层向外界输出信息处理结果。The structure of the DNN model includes an input layer, several intermediate layers, and an output layer. The input layer is responsible for receiving input information from the outside world and passing it to the middle layer; the middle layer is an internal information processing layer that is responsible for information transformation. According to the needs of information change ability, the middle layer can be designed as a single middle layer or multiple middle layer structure; the highest The information transmitted from the intermediate layer where the volume is located to the output layer is further processed to complete a learning forward propagation process, and the output layer outputs the information processing results to the outside world.
具体地,识别服务器将截顶MFCC特征和原始MFCC特征分别作为DNN模型的输入层和输出层,以获取DNN模型有关截顶语音修复模型的特征参数。Specifically, the recognition server uses the truncated MFCC feature and the original MFCC feature as the input layer and the output layer of the DNN model to obtain the feature parameters of the truncated speech repair model of the DNN model.
如果原始MFCC特征涉及的特征数量为n,则DNN模型的输出层包括n个神经元,该神经元是针对输入的每个神经元的截顶MFCC特征输出对应的原始MFCC特征值。If the number of features involved in the original MFCC feature is n, the output layer of the DNN model includes n neurons, which are the original MFCC feature values corresponding to the truncated MFCC feature output for each neuron in the input.
若DNN网络总共有n层,每层有多个神经元,不同层之间全连接;每层设置自己的激励函数f(激励函数是表示神经网络中的每个神经元接受输入值,并将输入值传递给下一层,神经元之间的输入输出关系的函数)。输入为特征向量v,第i层到第i+1层的转移矩阵为wi(i+1),第i+1层的偏置向量为b(i+1),第i层的输出为outi,第i+1的输入为ini+1,计算过程为:If the DNN network has a total of n layers, each layer has multiple neurons, and the different layers are fully connected; each layer sets its own excitation function f (the excitation function indicates that each neuron in the neural network accepts input values, and Input values are passed to the next layer, a function of the input-output relationship between neurons). The input is the feature vector v, the transition matrix from the i-th layer to the i + 1-th layer is wi (i + 1), the bias vector of the i + 1th layer is b (i + 1), and the output of the i-th layer is outi , The input of the i + 1th is ini + 1, and the calculation process is:
ini+1=outi*wi(i+1)+b(i+1)ini + 1 = outi * wi (i + 1) + b (i + 1)
outi+1=f(ini+1)outi + 1 = f (ini + 1)
由此可见DNN模型的参数包括层间的转移矩阵w和每一层的偏置向量b等,训练DNN模型的主要任务就是确定上述特征参数,最终生成基于DNN模型的截顶语音修复模型。It can be seen that the parameters of the DNN model include the transfer matrix w between layers and the offset vector b of each layer. The main task of training the DNN model is to determine the above-mentioned feature parameters, and finally generate a truncated speech repair model based on the DNN model.
本实施例中,识别服务器采用基于DNN模型生成的截顶语音修复模型来修复截顶语音段,可大幅提高语音修复的精度。优选地,还可采用DBN(Deep Belief Nets,深度信念网络)模型或CDBN(Convolutional Deep Belief Networks,卷积的深度信念网络)模型作为初始训练模型,其中,训练DBN模型的网络架构快于DNN模型,更适用于训练大型语音数据库的数据;CDBN模型适用于训练特大型语音数据库的数据。In this embodiment, the recognition server uses the truncated speech repair model generated based on the DNN model to repair truncated speech segments, which can greatly improve the accuracy of speech repair. Preferably, a DBN (Deep Belief Nets) model or a CDBN (Convolutional Deep Belief Networks) model can also be used as the initial training model, wherein the network architecture for training the DBN model is faster than the DNN model , More suitable for training data of large speech database; CDBN model is suitable for training data of large speech database.
在一实施例中,在步骤S30中,即采用截顶语音修复模型修复截顶语音段,获取修复语音段,具体包括如下步骤:In an embodiment, in step S30, the truncated speech repair model is used to repair the truncated speech segment to obtain the repaired speech segment, which specifically includes the following steps:
S37.采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取修复语音段的目标语音特征。S37. A truncated speech repair model based on the DNN model is used to repair the features of the speech to be identified, and the target speech features of the repaired speech segment are obtained.
具体地,识别服务器首先获取截顶语音段的待识别语音特征;然后将截顶语音段的语音特征作为DNN模型的输入层,经步骤S32获得的基于DNN模型的截顶语音修复模型的修复后,可在DNN模型的输出层获得对应的目标语音特征。Specifically, the recognition server first obtains the to-be-recognized speech features of the truncated speech segment; then uses the speech features of the truncated speech segment as the input layer of the DNN model, and after repairing the truncated speech repair model based on the DNN model obtained in step S32 The corresponding target speech features can be obtained at the output layer of the DNN model.
本实施例中,识别服务器采用基于DNN模型的截顶语音修复模型的修复截顶语音段,可有效提高输出层获得的目标语音特征的精确性。In this embodiment, the recognition server uses the truncated speech repair model based on the DNN model to repair the truncated speech segment, which can effectively improve the accuracy of the target speech feature obtained in the output layer.
在一实施例中,如图6所示,在步骤S40中,即基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,具体包括如下步骤:In an embodiment, as shown in FIG. 6, in step S40, based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature, including: The following steps:
S41.采用预设声纹识别模型分别处理目标语音特征和标准语音特征,分别得到原始语音向量和标准语音向量。S41. A preset voiceprint recognition model is used to process the target speech feature and the standard speech feature, respectively, to obtain the original speech vector and the standard speech vector.
其中,预设声纹识别模型是用于对标准语音特征和待识别语音特征进行相似度打分的模型,现有多种成熟的声纹识别模型,比如GMM-UBM(Gaussian mixture model-universal background model,混合模型-通用背景模型)或i-vector模型等。Among them, the preset voiceprint recognition model is a model for scoring similarity between standard voice features and voice features to be recognized. There are many mature voiceprint recognition models, such as GMM-UBM (Gaussian mixture model-universal background model). , Hybrid model-general background model) or i-vector model.
应用于本实施例,可采用i-vector模型作为预设声纹识别模型,以获取每个说话人对应的i-vector向量。向量,是多维空间中有方向的线段,如果两个向量的方向一致,即夹角接近零,夹角的余弦值也即余弦距离趋近于1,则这两个向量就相近。本实施例中,原始语音向量就是修复i-vector,标准语音向量就是标准i-vector。Applied to this embodiment, an i-vector model may be used as a preset voiceprint recognition model to obtain an i-vector vector corresponding to each speaker. A vector is a directional line segment in a multi-dimensional space. If the directions of two vectors are the same, that is, the included angle is close to zero, and the cosine value of the included angle, that is, the cosine distance approaches 1, the two vectors are similar. In this embodiment, the original speech vector is the repair i-vector, and the standard speech vector is the standard i-vector.
具体地,识别服务器采用i-vector模型获得原始语音向量的具体实现过程如下:Specifically, the recognition server uses the i-vector model to obtain the original speech vector as follows:
识别服务器通过目标语音特征来训练表征语音空间的GMM-UBM;利用训练好的UBM计算每帧语音特征的充分统计量,并将该充分统计量映射到总变量空间得到目标语音特征的原始i-vector。优选地,还可采用LDA(Linear Discriminant Analysis,线性鉴别分析)方法进行信道补偿处理该原始i-vector,以通过投影矩阵算法最小化同类样本间距离和最大化非同类样本间距离。同理可知获得标准语音向量的实现过程。The recognition server trains the GMM-UBM representing the speech space through the target speech features; uses the trained UBM to calculate sufficient statistics for the speech features of each frame, and maps the sufficient statistics to the total variable space to obtain the original i- vector. Preferably, the original i-vector may also be processed by channel compensation using LDA (Linear Discriminant Analysis) to minimize the distance between homogeneous samples and maximize the distance between non-homogeneous samples. In the same way, we can know the realization process of obtaining the standard speech vector.
S42.获取原始语音向量和标准语音向量的空间距离。S42. Obtain the spatial distance between the original speech vector and the standard speech vector.
其中,空间距离应用于本实施例可指两个向量之间的余弦距离。余弦距离,也称为余弦相似度,是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量。The spatial distance applied to this embodiment may refer to a cosine distance between two vectors. Cosine distance, also called cosine similarity, is a measure of the difference between two individuals by using the cosine of the angle between two vectors in a vector space.
S43.根据空间距离与预设的距离阈值,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。S43. Obtain whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker according to the spatial distance and the preset distance threshold.
其中,距离阈值是根据实际经验设置的一用余弦值表示的数值。Among them, the distance threshold is a value expressed by a cosine value set according to actual experience.
具体地,识别服务器通过对比步骤S42得到的标准i-vector和原始i-vector两个向量的空间距离也即余弦距离。若两者的余弦距离大于根据实际情况预设的距离阈值时,可判定目标语音特征和标准语音特征是源自同一说话人。Specifically, the recognition server compares the spatial distance between the standard i-vector and the original i-vector obtained in step S42, that is, the cosine distance. If the cosine distance between the two is greater than the distance threshold preset according to the actual situation, it can be determined that the target speech feature and the standard speech feature originate from the same speaker.
在一实施例中,在步骤S42中,即获取原始语音向量和标准语音向量的空间距离,具体包括如下步骤:In an embodiment, in step S42, obtaining the spatial distance between the original speech vector and the standard speech vector, specifically includes the following steps:
S424.采用余弦相似度算法获取原始语音向量和标准语音向量的空间距离。S424. Use the cosine similarity algorithm to obtain the spatial distance between the original speech vector and the standard speech vector.
具体地,获取原始语音向量和标准语音向量的空间距离可由以下公式进行判定:Specifically, obtaining the spatial distance between the original speech vector and the standard speech vector can be determined by the following formula:
Figure PCTCN2018092598-appb-000006
Figure PCTCN2018092598-appb-000006
其中,Ai和Bi分别代表原始语音向量和标准语音向量的各个分量。由上式可知,相似度范围从-1到1,其中-1表示两个向量方向相反,1表示两个向量指向相同;0表示两个向量是独立的。在-1和1之间表示两个向量之间的相似性或相异性,可以理解地,相似度越接近1表示两个向量越接近。应用于本实施例,可根据实际经验预先设定cosθ的距离阈值。若原始语音向量和标准语音向量的相似度大于距离阈值,则认为原始语音向量和标准语音向量相似,也即可判定目标语音特征和标准语音特征对应同一说话人的声纹识别结果。Among them, Ai and Bi represent the components of the original speech vector and the standard speech vector, respectively. It can be known from the above formula that the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. Between -1 and 1 represents the similarity or dissimilarity between the two vectors. Understandably, the closer the similarity is to 1, the closer the two vectors are. Applied to this embodiment, the distance threshold of cosθ can be set in advance according to actual experience. If the similarity between the original speech vector and the standard speech vector is greater than the distance threshold, it is considered that the original speech vector and the standard speech vector are similar, that is, it can be determined that the target speech feature and the standard speech feature correspond to the voiceprint recognition result of the same speaker.
本实施例中,通过余弦相似度算法即可判定原始语音向量和标准语音向量的相似度,简单快捷,利于快速确认识别结果。In this embodiment, the cosine similarity algorithm can be used to determine the similarity between the original speech vector and the standard speech vector, which is simple and fast, which is helpful for quickly confirming the recognition result.
综上,本实施例提出一种声纹识别方法,通过截顶语音检测算法可检测出待识别语音 是否为截顶语音段,若是则通过截顶语音修复模型可将截顶语音段的待识别语音特征修复为目标语音特征,以对比说话人对于的标准语音特征来识别说话人的真实身份。本实施例通过修复待识别语音中的待识别语音特征后获取接近说话人原始语音的目标语音特征,可有效提高语音识别的可靠性和准确性。In summary, this embodiment proposes a voiceprint recognition method. A truncated speech detection algorithm can detect whether the speech to be recognized is a truncated speech segment. If so, the truncated speech repair model can be used to identify the truncated speech segment to be identified. Speech feature restoration is the target speech feature, and the speaker's true identity is identified by comparing the speaker's standard speech features. This embodiment can effectively improve the reliability and accuracy of speech recognition by repairing the speech features to be identified in the speech to be identified and obtaining target speech features close to the original speech of the speaker.
进一步地,识别服务器可对待识别语音进行特征提取后获得的目标语音特征,能够很准确地体现待识别语音的特征,利于基于待识别语音特征进行语音识别;采用高音采样点数量相对采样点总数的高音采样百分比对比预设阈值进行判定,可有效提高判定方法的灵活性和有利于提高判定结果的准确性;采用基于DNN模型生成的截顶语音修复模型来修复截顶语音段,可大幅提高语音修复的精度;通过余弦相似度算法即可判定原始语音向量和标准语音向量的相似度,简单快捷,利于快速确认识别结果。Further, the recognition server can obtain the target speech features obtained after feature extraction of the speech to be recognized, which can accurately reflect the characteristics of the speech to be recognized, and is conducive to speech recognition based on the characteristics of the speech to be recognized; using the number of treble sampling points relative to the total number of sampling points The determination of the treble sampling percentage against a preset threshold can effectively improve the flexibility of the determination method and improve the accuracy of the determination result. The truncated speech repair model based on the DNN model is used to repair the truncated speech segment, which can greatly improve the speech. The accuracy of the repair; the similarity between the original speech vector and the standard speech vector can be determined by the cosine similarity algorithm, which is simple and fast, which is helpful for quickly confirming the recognition result.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
图7示出与上述实施例中声纹识别方法一一对应的声纹识别装置的原理框图。如图7所示,该声纹识别装置包括获取待识别语音模块10、获取待识别特征模块20、获取目标语音特征模块30和获取声纹识别结果模块40。其中,获取待识别语音模块10、获取语音段模块20、修复语音段模块30和确定说话人模块40的实现功能与上述实施例中声纹识别方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。FIG. 7 shows a principle block diagram of a voiceprint recognition device corresponding to the voiceprint recognition method in the above embodiment. As shown in FIG. 7, the voiceprint recognition device includes a voice module 10 to be recognized, a feature module 20 to be recognized, a target voice feature module 30, and a voiceprint recognition result module 40. The functions of obtaining the speech module 10 to be recognized, obtaining the speech segment module 20, repairing the speech segment module 30, and determining the speaker module 40 correspond to the steps corresponding to the voiceprint recognition method in the above embodiment, in order to avoid redundant description, the present The examples are not detailed one by one.
获取待识别语音模块10,用于获取待识别语音,待识别语音携带说话人标识。The to-be-recognized voice module 10 is configured to obtain the to-be-recognized voice, and the to-be-recognized voice carries a speaker identifier.
获取待识别特征模块20,用于基于待识别语音,获取对应的待识别语音特征。The feature-to-be-recognized feature module 20 is configured to obtain a corresponding feature to-be-recognized voice based on the to-be-recognized voice.
获取目标语音特征模块30,用于采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征。A target voice feature module 30 is used to detect a voice to be recognized using a truncated voice detection algorithm. If the voice to be recognized is a truncated voice segment, a truncated voice repair model is used to repair the feature of the voice to be identified to obtain the target voice feature.
获取声纹识别结果模块40,用于基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。The voiceprint recognition result module 40 is configured to perform voiceprint recognition on the target voice feature and the standard voice feature based on the standard voice feature corresponding to the speaker identification, and obtain whether the target voice feature corresponds to the standard voice feature. Voiceprint recognition results for the same speaker.
优选地,获取待识别特征模块20包括获取语音数据单元21、获取功率谱单元22、获取梅尔功率谱单元23和获取梅尔系数单元24。Preferably, obtaining the feature to be identified module 20 includes obtaining a voice data unit 21, obtaining a power spectrum unit 22, obtaining a Mel power spectrum unit 23, and obtaining a Mel coefficient unit 24.
获取语音数据单元21,用于对待识别语音进行预处理,获取预处理语音数据。The voice data acquisition unit 21 is configured to preprocess the voice to be recognized and acquire preprocessed voice data.
获取功率谱单元22,用于对预处理语音数据作快速傅里叶变换,获取待识别语音的频谱,并根据频谱获取待识别语音的功率谱。The power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the speech to be identified, and obtain a power spectrum of the speech to be identified according to the frequency spectrum.
获取梅尔功率谱单元23,用于采用梅尔刻度滤波器组处理待识别语音的功率谱,获取待识别语音的梅尔功率谱。A Mel power spectrum obtaining unit 23 is configured to process a power spectrum of a speech to be recognized by using a Mel scale filter bank, and obtain a Mel power spectrum of the speech to be recognized.
获取梅尔系数单元24,用于在梅尔功率谱上进行倒谱分析,获取待识别语音的梅尔频率倒谱系数。A Mel coefficient obtaining unit 24 is configured to perform cepstrum analysis on a Mel power spectrum to obtain a Mel frequency cepstrum coefficient of a speech to be recognized.
优选地,获取目标语音特征模块30包括分割语音子段单元31、获取采样点数量单元32、获取采样百分比单元33和超过预设阈值单元34。Preferably, the target voice feature acquisition module 30 includes a segmented voice sub-segment unit 31, a sampling unit number obtaining unit 32, a sampling percentage obtaining unit 33, and a preset threshold exceeding unit 34.
分割语音子段单元31,用于将待识别语音按时序平均分割出至少两个语音子段。The speech sub-segment unit 31 is configured to divide the speech to be recognized into at least two speech sub-segments in an average time sequence.
获取采样点数量单元32,用于将语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量。The number of sampling points obtaining unit 32 is configured to divide the voice sub-segment evenly into at least two volume sub-intervals according to the volume change, and obtain the number of treble sampling points of the volume sub-interval where the highest note is located.
获取采样百分比单元33,用于统计所有音量子区间的采样点总数,以获取高音采样点数量相对采样点总数的高音采样百分比。An acquisition sampling percentage unit 33 is configured to count the total number of sampling points of all volume sub-intervals, so as to acquire the percentage of treble sampling points relative to the total number of sampling points.
超过预设阈值单元34,用于若高音采样百分比超过预设阈值,则对应的语音子段为截 顶语音段。The exceeding threshold unit 34 is configured to, if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
优选地,该声纹识别装置还包括获取原始语音单元35和生成修复模型单元36。Preferably, the voiceprint recognition device further includes an original speech acquisition unit 35 and a repair model generation unit 36.
获取原始语音单元35,用于获取原始训练语音对应的原始训练特征,对该原始训练语音进行截顶语音处理以获取对应的截顶训练语音,再提取该截顶训练语音的截顶训练特征。An original speech acquisition unit 35 is configured to obtain original training features corresponding to the original training speech, perform truncated speech processing on the original training speech to obtain corresponding truncated training speech, and then extract truncated training features of the truncated training speech.
生成修复模型单元36,用于将截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将原始训练语音对应的原始训练特征作为DNN模型的输出层,校准DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。Generate a repair model unit 36, which is used to take the truncated training features corresponding to the truncated training speech as the input layer of the DNN model, use the original training features corresponding to the original training speech as the output layer of the DNN model, and calibrate the feature parameters of the DNN model to Generate truncated speech repair model based on DNN model.
优选地,该声纹识别装置还包括获取原始特征单元37。Preferably, the voiceprint recognition device further includes an original feature obtaining unit 37.
获取原始特征单元37,用于采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取修复语音段的目标语音特征。The original feature unit 37 is used for repairing the speech features to be identified using a truncated speech repair model based on the DNN model, and acquiring the target speech features of the repaired speech segment.
优选地,获取声纹识别结果模块40包括采用识别模型单元41、获取空间距离单元42、获取识别结果单元43。Preferably, the voiceprint recognition result acquisition module 40 includes an adoption recognition model unit 41, an acquisition space distance unit 42, and an acquisition recognition result unit 43.
采用识别模型单元41,用于采用预设声纹识别模型分别处理目标语音特征和标准语音特征,分别得到原始语音向量和标准语音向量。The recognition model unit 41 is used to process the target voice feature and the standard voice feature by using a preset voiceprint recognition model to obtain the original voice vector and the standard voice vector, respectively.
获取空间距离单元42,用于获取原始语音向量和标准语音向量的空间距离。An obtaining space distance unit 42 is configured to obtain a space distance between an original speech vector and a standard speech vector.
获取识别结果单元43,用于根据空间距离与预设的距离阈值,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。The recognition result obtaining unit 43 is configured to obtain whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker according to the spatial distance and a preset distance threshold.
关于声纹识别装置的具体限定可以参见上文中对于声纹识别方法的限定,在此不再赘述。上述声纹识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the voiceprint recognition device, refer to the foregoing limitation on the voiceprint recognition method, and details are not described herein again. Each module in the voiceprint recognition device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一实施例中,提供一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过***总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作***和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储与声纹识别方法相关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种声纹识别方法。In one embodiment, a computer device is provided. The computer device may be a server, and the internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data related to the voiceprint recognition method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a voiceprint recognition method.
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取待识别语音,待识别语音携带说话人标识;基于待识别语音,获取对应的待识别语音特征;采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。In an embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the processor implements the following steps: obtaining to-be-identified Voice, to-be-recognized voice carries speaker identification; based on the to-be-recognized voice, obtain corresponding features of the to-be-recognized voice; use truncated voice detection algorithm to detect to-be-recognized voice; if the to-be-recognized voice is truncated voice segment, truncated voice repair The model repairs the to-be-recognized voice features to obtain the target voice features; based on the standard voice features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice features and standard voice features to obtain the target voice features and standard voice features Whether it corresponds to the voiceprint recognition result of the same speaker.
在一实施例中,提取训练语音数据对应的训练语音特征,处理器执行计算机可读指令时实现如下步骤:对训练语音数据进行预处理,获取预处理语音数据;对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱;采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。In one embodiment, the training voice features corresponding to the training voice data are extracted, and the processor implements the following steps when the processor executes computer-readable instructions: pre-processing the training voice data to obtain pre-processed voice data; and performing a fast process on the pre-processed voice data The Fourier transform obtains the frequency spectrum of the training speech data, and obtains the power spectrum of the training speech data according to the frequency spectrum; uses the Mel scale filter bank to process the power spectrum of the training speech data, and obtains the Mel power spectrum of the training speech data; Cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
在一实施例中,采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,处理器执行计算机可读指令时实现如下步骤:将待识别语音按时序平均分割出至少两个语音子段;将语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量;统计所有音量子区间的采样点总数,以获取高音采样点数量相对采样点总数的高音采样百分比;若高音采样百分比超过预设阈值,则对应的语音子段为截顶语音段。In one embodiment, a truncated speech detection algorithm is used to detect the speech to be identified. If the speech to be identified is a truncated speech segment, the processor implements the following steps when the processor executes computer-readable instructions: the speech to be identified is divided into at least two in average according to time sequence. Voice sub-segments; divide the voice sub-segments into at least two volume sub-segments according to the volume change, and obtain the number of treble sampling points of the volume sub-segment where the highest note is located; count the total number of sampling points of all volume sub-segments to obtain treble sampling The treble sampling percentage of the number of points relative to the total number of sampling points; if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
在一实施例中,在采用截顶语音修复模型修复待识别语音特征的步骤之前,处理器执行计算机可读指令时还实现如下步骤:获取原始训练语音,并截顶处理原始训练语音,以获取对应的截顶训练语音;将截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将原始训练语音对应的原始训练特征作为DNN模型的输出层,校准DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。In an embodiment, before the step of using the truncated speech repair model to repair the features of the speech to be identified, the processor also implements the following steps when the computer executes the computer-readable instructions: obtaining the original training speech, and truncating the original training speech to obtain The corresponding truncated training speech; the truncated training features corresponding to the truncated training speech are used as the input layer of the DNN model, the original training features corresponding to the original training speech are used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate A truncated speech repair model based on the DNN model.
在一实施例中,采用截顶语音修复模型修复待识别语音特征,获取目标语音特征,处理器执行计算机可读指令时实现如下步骤:采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取修复语音段的目标语音特征。In one embodiment, the truncated speech repair model is used to repair the speech features to be identified, and the target speech feature is obtained. When the processor executes the computer-readable instructions, the following steps are implemented: the truncated speech repair model based on the DNN model is used to repair the speech features to be identified. To obtain the target speech features of the repaired speech segment.
在一实施例中,基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,处理器执行计算机可读指令时实现如下步骤:采用预设声纹识别模型分别处理目标语音特征和标准语音特征,分别得到原始语音向量和标准语音向量;获取原始语音向量和标准语音向量的空间距离;根据空间距离与预设的距离阈值,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。In one embodiment, based on the standard speech features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target speech features and standard speech features. When the processor executes the computer-readable instructions, the following steps are implemented: Set the voiceprint recognition model to process the target speech features and standard speech features, respectively, to obtain the original speech vector and the standard speech vector; obtain the spatial distance between the original speech vector and the standard speech vector; obtain the target speech based on the spatial distance and a preset distance threshold Whether the feature and the standard speech feature correspond to the voiceprint recognition result of the same speaker.
在一实施例中,获取原始语音向量和标准语音向量的空间距离,处理器执行计算机可读指令时实现如下步骤:采用余弦相似度算法获取原始语音向量和标准语音向量的空间距离。In one embodiment, the spatial distance between the original speech vector and the standard speech vector is obtained. When the processor executes the computer-readable instructions, the following steps are implemented: the cosine similarity algorithm is used to obtain the spatial distance between the original speech vector and the standard speech vector.
在一实施例中,提供一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:获取待识别语音,待识别语音携带说话人标识;基于待识别语音,获取对应的待识别语音特征;采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,则采用截顶语音修复模型修复待识别语音特征,获取目标语音特征;基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the following steps are performed: obtaining the speech to be recognized, and the speech to be recognized carries a speaker identification; Based on the to-be-recognized voice, obtain the corresponding to-be-recognized voice characteristics; use the truncated voice detection algorithm to detect the to-be-recognized voice, and if the to-be-recognized voice is a truncated voice segment, use the truncated voice repair model to repair the to-be-recognized voice features and obtain the target voice Feature; based on the standard voice features corresponding to the speaker's identity, the preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature to obtain whether the target voice feature and the standard voice feature correspond to the voiceprint recognition result of the same speaker .
在一实施例中,提取训练语音数据对应的训练语音特征,计算机可读指令被处理器执行时实现以下步骤:对训练语音数据进行预处理,获取预处理语音数据;对预处理语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱;采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取训练语音数据的MFCC特征。In one embodiment, the training voice features corresponding to the training voice data are extracted, and the computer-readable instructions are executed by the processor to implement the following steps: pre-processing the training voice data to obtain pre-processed voice data; quickly performing pre-processing voice data Fourier transform to obtain the frequency spectrum of the training speech data, and obtain the power spectrum of the training speech data according to the frequency spectrum; use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data; The cepstrum analysis is performed on the power spectrum to obtain the MFCC features of the training speech data.
在一实施例中,采用截顶语音检测算法检测待识别语音,若待识别语音为截顶语音段,计算机可读指令被处理器执行时实现以下步骤:将待识别语音按时序平均分割出至少两个语音子段;将语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量;统计所有音量子区间的采样点总数,以获取高音采样点数量相对采样点总数的高音采样百分比;若高音采样百分比超过预设阈值,则对应的语音子段为截顶语音段。In one embodiment, the truncated speech detection algorithm is used to detect the speech to be recognized. If the speech to be identified is a truncated speech segment, the computer-readable instructions are executed by the processor to implement the following steps: the speech to be identified is evenly divided into at least time according to time sequence. Two voice sub-segments; divide the voice sub-segment into at least two volume sub-segments according to the volume change, and obtain the number of treble sampling points of the volume sub-segment where the highest note is located; count the total number of sampling points of all volume sub-segments to obtain the treble The treble sampling percentage of the number of sampling points relative to the total number of sampling points; if the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
在一实施例中,在采用截顶语音修复模型修复待识别语音特征的步骤之前,计算机可 读指令被处理器执行时实现以下步骤:获取原始训练语音,并截顶处理原始训练语音,以获取对应的截顶训练语音;将截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将原始训练语音对应的原始训练特征作为DNN模型的输出层,校准DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。In an embodiment, before the step of using the truncated speech repair model to repair the features of the speech to be recognized, when the computer-readable instructions are executed by the processor, the following steps are implemented: obtaining the original training speech, and truncating the original training speech to obtain The corresponding truncated training speech; the truncated training features corresponding to the truncated training speech are used as the input layer of the DNN model, the original training features corresponding to the original training speech are used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate A truncated speech repair model based on the DNN model.
在一实施例中,采用截顶语音修复模型修复待识别语音特征,获取目标语音特征,计算机可读指令被处理器执行时实现以下步骤:采用基于DNN模型的截顶语音修复模型修复待识别语音特征,获取修复语音段的目标语音特征。In one embodiment, the truncated speech repair model is used to repair the speech features to be identified, and the target speech features are obtained. When the computer-readable instructions are executed by the processor, the following steps are implemented: the truncated speech repair model based on the DNN model is used to repair the speech to be identified. Feature to obtain the target voice feature of the repaired speech segment.
在一实施例中,基于说话人标识对应的标准语音特征,采用预设声纹识别模型对目标语音特征和标准语音特征进行声纹识别,计算机可读指令被处理器执行时实现以下步骤:采用预设声纹识别模型分别处理目标语音特征和标准语音特征,分别得到原始语音向量和标准语音向量;获取原始语音向量和标准语音向量的空间距离;根据空间距离与预设的距离阈值,获取目标语音特征和标准语音特征是否对应同一说话人的声纹识别结果。In one embodiment, based on the standard speech features corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target speech features and standard speech features. When the computer-readable instructions are executed by the processor, the following steps are implemented: The preset voiceprint recognition model processes the target speech features and standard speech features respectively to obtain the original speech vector and the standard speech vector; obtain the spatial distance between the original speech vector and the standard speech vector; and obtain the target based on the spatial distance and a preset distance threshold Whether the speech features and standard speech features correspond to the voiceprint recognition result of the same speaker.
在一实施例中,获取原始语音向量和标准语音向量的空间距离,计算机可读指令被处理器执行时实现以下步骤:采用余弦相似度算法获取原始语音向量和标准语音向量的空间距离。In one embodiment, the spatial distance between the original speech vector and the standard speech vector is obtained. When the computer-readable instructions are executed by the processor, the following steps are implemented: the cosine similarity algorithm is used to obtain the spatial distance between the original speech vector and the standard speech vector.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种声纹识别方法,其特征在于,包括:A voiceprint recognition method, comprising:
    获取待识别语音,所述待识别语音携带说话人标识;Obtaining a speech to be recognized, where the speech to be recognized carries a speaker identifier;
    基于所述待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
    采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,则采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征;Using a truncated voice detection algorithm to detect the to-be-recognized voice, and if the to-be-recognized voice is a truncated voice segment, repairing the to-be-recognized voice feature using a truncated voice repair model to obtain a target voice feature;
    基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature to obtain whether the target voice feature corresponds to the standard voice feature Voiceprint recognition results for the same speaker.
  2. 如权利要求1所述声纹识别方法,其特征在于,所述基于所述待识别语音,获取对应的待识别语音特征,包括:The voiceprint recognition method according to claim 1, wherein the acquiring the corresponding voice feature to be recognized based on the voice to be recognized comprises:
    对所述待识别语音进行预处理,获取预处理语音数据;Pre-processing the speech to be recognized to obtain pre-processed speech data;
    对所述预处理语音数据作快速傅里叶变换,获取所述待识别语音的频谱,并根据所述频谱获取所述待识别语音的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the speech to be identified, and obtaining a power spectrum of the speech to be identified according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述待识别语音的功率谱,获取所述待识别语音的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the speech to be recognized, and obtaining a Mel power spectrum of the speech to be recognized;
    在所述梅尔功率谱上进行倒谱分析,获取所述待识别语音的梅尔频率倒谱系数。Cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
  3. 如权利要求1所述声纹识别方法,其特征在于,所述采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,包括:The voiceprint recognition method according to claim 1, wherein the detection of the to-be-recognized voice using a truncated voice detection algorithm, if the to-be-recognized voice is a truncated voice segment, comprises:
    将所述待识别语音按时序平均分割出至少两个语音子段;Divide the speech to be recognized into at least two speech sub-segments according to the time average;
    将所述语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量;Dividing the speech sub-segment into at least two volume sub-intervals according to the volume change, and obtaining the number of treble sampling points of the volume sub-interval where the highest sound is located;
    统计所有所述音量子区间的采样点总数,以获取所述高音采样点数量相对所述采样点总数的高音采样百分比;Counting the total number of sampling points of all the volume sub-intervals to obtain the percentage of treble sampling points of the number of treble sampling points relative to the total number of sampling points;
    若所述高音采样百分比超过预设阈值,则对应的所述语音子段为截顶语音段。If the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  4. 如权利要求1所述声纹识别方法,其特征在于,在所述采用截顶语音修复模型修复所述待识别语音特征的步骤之前,所述声纹识别方法还包括:The voiceprint recognition method according to claim 1, wherein before the step of using the truncated voice repair model to repair the speech feature to be recognized, the voiceprint recognition method further comprises:
    获取原始训练语音对应的原始训练特征,对该原始训练语音进行截顶语音处理以获取对应的截顶训练语音,再提取该截顶训练语音的截顶训练特征;Obtain the original training features corresponding to the original training voice, perform truncated voice processing on the original training voice to obtain the corresponding truncated training voice, and then extract the truncated training features of the truncated training voice;
    将所述截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将所述原始训练语音对应的原始训练特征作为DNN模型的输出层,校准所述DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。The truncated training feature corresponding to the truncated training voice is used as the input layer of the DNN model, the original training feature corresponding to the original training voice is used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate a A truncated speech repair model for the DNN model.
  5. 如权利要求4所述声纹识别方法,其特征在于,所述采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征,包括:The voiceprint recognition method according to claim 4, wherein the use of a truncated speech repair model to repair the speech features to be identified and obtaining the target speech features comprises:
    采用所述基于DNN模型的截顶语音修复模型修复所述待识别语音特征,获取目标语音特征。The truncated speech repair model based on the DNN model is used to repair the to-be-recognized speech features to obtain target speech features.
  6. 如权利要求1所述声纹识别方法,其特征在于,所述基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,包括:The voiceprint recognition method according to claim 1, wherein the target voice feature and the standard voice feature are voiced using a preset voiceprint recognition model based on the standard voice feature corresponding to the speaker identification. Pattern recognition, including:
    采用预设声纹识别模型分别处理所述目标语音特征和所述标准语音特征,分别得到原始语音向量和标准语音向量;Using a preset voiceprint recognition model to separately process the target speech feature and the standard speech feature, and obtain an original speech vector and a standard speech vector, respectively;
    获取所述原始语音向量和所述标准语音向量的空间距离;Obtaining a spatial distance between the original speech vector and the standard speech vector;
    根据所述空间距离与预设的距离阈值,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Obtaining, based on the spatial distance and a preset distance threshold, whether the target speech feature and the standard speech feature correspond to a voiceprint recognition result of the same speaker.
  7. 如权利要求6所述声纹识别方法,其特征在于,所述获取所述原始语音向量和所述标准语音向量的空间距离,包括:The voiceprint recognition method according to claim 6, wherein the acquiring a spatial distance between the original speech vector and the standard speech vector comprises:
    采用余弦相似度算法获取所述原始语音向量和所述标准语音向量的空间距离。A cosine similarity algorithm is used to obtain a spatial distance between the original speech vector and the standard speech vector.
  8. 一种声纹识别装置,其特征在于,包括:A voiceprint recognition device, comprising:
    获取待识别语音模块,用于获取待识别语音,所述待识别语音携带说话人标识;A to-be-recognized voice module, configured to obtain the to-be-recognized voice, where the to-be-recognized voice carries a speaker identifier;
    获取待识别特征模块,用于基于所述待识别语音,获取对应的待识别语音特征;A feature to be identified module for acquiring a feature to be identified based on the voice to be identified;
    获取目标语音特征模块,用于采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,则采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征;A target voice feature module is used to detect the to-be-recognized voice by using a truncated voice detection algorithm. If the to-be-recognized voice is a truncated voice segment, a truncated voice repair model is used to repair the to-be-recognized voice feature to obtain a target Phonetic features
    获取声纹识别结果模块,用于基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。A voiceprint recognition result module is configured to perform voiceprint recognition on the target voice feature and the standard voice feature based on a standard voice feature corresponding to the speaker identifier, and obtain the target voice. Whether the feature and the standard voice feature correspond to a voiceprint recognition result of the same speaker.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取待识别语音,所述待识别语音携带说话人标识;Obtaining a speech to be recognized, where the speech to be recognized carries a speaker identifier;
    基于所述待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
    采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,则采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征;Using a truncated voice detection algorithm to detect the to-be-recognized voice, and if the to-be-recognized voice is a truncated voice segment, repairing the to-be-recognized voice feature using a truncated voice repair model to obtain a target voice feature;
    基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature to obtain whether the target voice feature corresponds to the standard voice feature Voiceprint recognition results for the same speaker.
  10. 如权利要求9所述的计算机设备,其特征在于,所述基于所述待识别语音,获取对应的待识别语音特征,包括:The computer device according to claim 9, wherein the acquiring the corresponding to-be-recognized voice characteristics based on the to-be-recognized voice comprises:
    对所述待识别语音进行预处理,获取预处理语音数据;Pre-processing the speech to be recognized to obtain pre-processed speech data;
    对所述预处理语音数据作快速傅里叶变换,获取所述待识别语音的频谱,并根据所述频谱获取所述待识别语音的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the speech to be identified, and obtaining a power spectrum of the speech to be identified according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述待识别语音的功率谱,获取所述待识别语音的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the speech to be recognized, and obtaining a Mel power spectrum of the speech to be recognized;
    在所述梅尔功率谱上进行倒谱分析,获取所述待识别语音的梅尔频率倒谱系数。Cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
  11. 如权利要求9所述的计算机设备,其特征在于,所述采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,包括:The computer device according to claim 9, wherein the detection of the to-be-recognized voice using a truncated voice detection algorithm, if the to-be-recognized voice is a truncated voice segment, comprises:
    将所述待识别语音按时序平均分割出至少两个语音子段;Divide the speech to be recognized into at least two speech sub-segments according to the time average;
    将所述语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量;Dividing the speech sub-segment into at least two volume sub-intervals according to the volume change, and obtaining the number of treble sampling points of the volume sub-interval where the highest sound is located;
    统计所有所述音量子区间的采样点总数,以获取所述高音采样点数量相对所述采样点 总数的高音采样百分比;Counting the total number of sampling points of all the volume sub-intervals to obtain the percentage of treble sampling points of the number of treble sampling points relative to the total number of sampling points;
    若所述高音采样百分比超过预设阈值,则对应的所述语音子段为截顶语音段。If the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  12. 如权利要求9所述的计算机设备,其特征在于,在所述采用截顶语音修复模型修复所述待识别语音特征的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein before the step of using the truncated speech repair model to repair the speech feature to be identified, the processor further implements the following steps when executing the computer-readable instructions. :
    获取原始训练语音对应的原始训练特征,对该原始训练语音进行截顶语音处理以获取对应的截顶训练语音,再提取该截顶训练语音的截顶训练特征;Obtain the original training features corresponding to the original training voice, perform truncated voice processing on the original training voice to obtain the corresponding truncated training voice, and then extract the truncated training features of the truncated training voice;
    将所述截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将所述原始训练语音对应的原始训练特征作为DNN模型的输出层,校准所述DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。The truncated training feature corresponding to the truncated training voice is used as the input layer of the DNN model, the original training feature corresponding to the original training voice is used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate a A truncated speech repair model for the DNN model.
  13. 如权利要求12所述的计算机设备,其特征在于,所述采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征,包括:The computer device according to claim 12, wherein the repairing the to-be-recognized voice feature and obtaining the target voice feature by using a truncated voice repair model comprises:
    采用所述基于DNN模型的截顶语音修复模型修复所述待识别语音特征,获取目标语音特征。The truncated speech repair model based on the DNN model is used to repair the to-be-recognized speech features to obtain target speech features.
  14. 如权利要求9所述的计算机设备,其特征在于,所述基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,包括:The computer device according to claim 9, wherein the target voice feature and the standard voice feature are voiceprinted based on a standard voiceprint recognition model based on the standard voice characteristics corresponding to the speaker identification. Identification, including:
    采用预设声纹识别模型分别处理所述目标语音特征和所述标准语音特征,分别得到原始语音向量和标准语音向量;Using a preset voiceprint recognition model to separately process the target speech feature and the standard speech feature, and obtain an original speech vector and a standard speech vector, respectively;
    获取所述原始语音向量和所述标准语音向量的空间距离;Obtaining a spatial distance between the original speech vector and the standard speech vector;
    根据所述空间距离与预设的距离阈值,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Obtaining, based on the spatial distance and a preset distance threshold, whether the target speech feature and the standard speech feature correspond to a voiceprint recognition result of the same speaker.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取待识别语音,所述待识别语音携带说话人标识;Obtaining a speech to be recognized, where the speech to be recognized carries a speaker identifier;
    基于所述待识别语音,获取对应的待识别语音特征;Obtaining corresponding to-be-recognized voice characteristics based on the to-be-recognized voice;
    采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,则采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征;Using a truncated voice detection algorithm to detect the to-be-recognized voice, and if the to-be-recognized voice is a truncated voice segment, repairing the to-be-recognized voice feature using a truncated voice repair model to obtain a target voice feature;
    基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to perform voiceprint recognition on the target voice feature and the standard voice feature to obtain whether the target voice feature corresponds to the standard voice feature Voiceprint recognition results for the same speaker.
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述基于所述待识别语音,获取对应的待识别语音特征,包括:The non-volatile readable storage medium according to claim 15, wherein the acquiring the corresponding voice feature to be recognized based on the voice to be recognized comprises:
    对所述待识别语音进行预处理,获取预处理语音数据;Pre-processing the speech to be recognized to obtain pre-processed speech data;
    对所述预处理语音数据作快速傅里叶变换,获取所述待识别语音的频谱,并根据所述频谱获取所述待识别语音的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of the speech to be identified, and obtaining a power spectrum of the speech to be identified according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述待识别语音的功率谱,获取所述待识别语音的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the speech to be recognized, and obtaining a Mel power spectrum of the speech to be recognized;
    在所述梅尔功率谱上进行倒谱分析,获取所述待识别语音的梅尔频率倒谱系数。Cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the speech to be recognized.
  17. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用截顶语音检测算法检测所述待识别语音,若所述待识别语音为截顶语音段,包括:The non-volatile readable storage medium according to claim 15, wherein the truncated speech detection algorithm is used to detect the speech to be recognized, and if the speech to be recognized is a truncated speech segment, comprising:
    将所述待识别语音按时序平均分割出至少两个语音子段;Divide the speech to be recognized into at least two speech sub-segments according to the time average;
    将所述语音子段按音量变化均匀分割成至少两个音量子区间,获取最高音所在的音量子区间的高音采样点数量;Dividing the speech sub-segment into at least two volume sub-intervals according to the volume change, and obtaining the number of treble sampling points of the volume sub-interval where the highest sound is located;
    统计所有所述音量子区间的采样点总数,以获取所述高音采样点数量相对所述采样点总数的高音采样百分比;Counting the total number of sampling points of all the volume sub-intervals to obtain the percentage of treble sampling points of the number of treble sampling points relative to the total number of sampling points;
    若所述高音采样百分比超过预设阈值,则对应的所述语音子段为截顶语音段。If the treble sampling percentage exceeds a preset threshold, the corresponding speech subsegment is a truncated speech segment.
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,在所述采用截顶语音修复模型修复所述待识别语音特征的步骤之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to claim 15, wherein before the step of using the truncated speech repair model to repair the speech feature to be identified, the computer-readable instructions are replaced by one or more When the two processors execute, the one or more processors further perform the following steps:
    获取原始训练语音对应的原始训练特征,对该原始训练语音进行截顶语音处理以获取对应的截顶训练语音,再提取该截顶训练语音的截顶训练特征;Obtain the original training features corresponding to the original training voice, perform truncated voice processing on the original training voice to obtain the corresponding truncated training voice, and then extract the truncated training features of the truncated training voice;
    将所述截顶训练语音对应的截顶训练特征作为DNN模型的输入层,将所述原始训练语音对应的原始训练特征作为DNN模型的输出层,校准所述DNN模型的特征参数,以生成基于DNN模型的截顶语音修复模型。The truncated training feature corresponding to the truncated training voice is used as the input layer of the DNN model, the original training feature corresponding to the original training voice is used as the output layer of the DNN model, and the feature parameters of the DNN model are calibrated to generate a A truncated speech repair model for the DNN model.
  19. 如权利要求18所述的非易失性可读存储介质,其特征在于,所述采用截顶语音修复模型修复所述待识别语音特征,获取目标语音特征,包括:The non-volatile readable storage medium according to claim 18, wherein the repairing the to-be-recognized voice feature to obtain the target voice feature using a truncated voice repair model comprises:
    采用所述基于DNN模型的截顶语音修复模型修复所述待识别语音特征,获取目标语音特征。The truncated speech repair model based on the DNN model is used to repair the to-be-recognized speech features to obtain target speech features.
  20. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述基于所述说话人标识对应的标准语音特征,采用预设声纹识别模型对所述目标语音特征和所述标准语音特征进行声纹识别,包括:The non-volatile readable storage medium according to claim 15, wherein, based on the standard voice feature corresponding to the speaker identification, a preset voiceprint recognition model is used to compare the target voice feature and the target voice feature. Standard voice features for voiceprint recognition, including:
    采用预设声纹识别模型分别处理所述目标语音特征和所述标准语音特征,分别得到原始语音向量和标准语音向量;Using a preset voiceprint recognition model to separately process the target speech feature and the standard speech feature, and obtain an original speech vector and a standard speech vector, respectively;
    获取所述原始语音向量和所述标准语音向量的空间距离;Obtaining a spatial distance between the original speech vector and the standard speech vector;
    根据所述空间距离与预设的距离阈值,获取所述目标语音特征和所述标准语音特征是否对应同一说话人的声纹识别结果。Obtaining, based on the spatial distance and a preset distance threshold, whether the target speech feature and the standard speech feature correspond to a voiceprint recognition result of the same speaker.
PCT/CN2018/092598 2018-06-06 2018-06-25 Voiceprint recognition method and apparatus, computer device and storage medium WO2019232829A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810573715.4A CN108899032A (en) 2018-06-06 2018-06-06 Method for recognizing sound-groove, device, computer equipment and storage medium
CN201810573715.4 2018-06-06

Publications (1)

Publication Number Publication Date
WO2019232829A1 true WO2019232829A1 (en) 2019-12-12

Family

ID=64343940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092598 WO2019232829A1 (en) 2018-06-06 2018-06-25 Voiceprint recognition method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN108899032A (en)
WO (1) WO2019232829A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112542156A (en) * 2020-12-08 2021-03-23 山东航空股份有限公司 Civil aviation maintenance worker card system based on voiceprint recognition and voice instruction control

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN109584887B (en) * 2018-12-24 2022-12-02 科大讯飞股份有限公司 Method and device for generating voiceprint information extraction model and extracting voiceprint information
CN109473091B (en) * 2018-12-25 2021-08-10 四川虹微技术有限公司 Voice sample generation method and device
CN110556126B (en) * 2019-09-16 2024-01-05 平安科技(深圳)有限公司 Speech recognition method and device and computer equipment
CN110610709A (en) * 2019-09-26 2019-12-24 浙江百应科技有限公司 Identity distinguishing method based on voiceprint recognition
CN110827853A (en) * 2019-11-11 2020-02-21 广州国音智能科技有限公司 Voice feature information extraction method, terminal and readable storage medium
CN113223511B (en) * 2020-01-21 2024-04-16 珠海市煊扬科技有限公司 Audio processing device for speech recognition
CN111402889A (en) * 2020-03-16 2020-07-10 南京奥拓电子科技有限公司 Volume threshold determination method and device, voice recognition system and queuing machine
CN111613244A (en) * 2020-05-20 2020-09-01 北京搜狗科技发展有限公司 Scanning and reading-following processing method and related device
CN111883175B (en) * 2020-06-09 2022-06-07 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method
CN112071331B (en) * 2020-09-18 2023-05-30 平安科技(深圳)有限公司 Voice file restoration method and device, computer equipment and storage medium
CN112767949B (en) * 2021-01-18 2022-04-26 东南大学 Voiceprint recognition system based on binary weight convolutional neural network
CN112767950A (en) * 2021-02-24 2021-05-07 嘉楠明芯(北京)科技有限公司 Voiceprint recognition method and device and computer readable storage medium
CN113129899B (en) * 2021-04-16 2023-01-20 广东电网有限责任公司 Safety operation supervision method, equipment and storage medium
CN114010202B (en) * 2021-09-18 2024-05-14 苏州无双医疗设备有限公司 Method for classifying and distinguishing ventricular rate and supraventricular rate by implantable cardiac rhythm management device
CN113823261A (en) * 2021-10-28 2021-12-21 广州宏途教育网络科技有限公司 Learning interaction system and method based on voice interaction
CN114242044B (en) * 2022-02-25 2022-10-11 腾讯科技(深圳)有限公司 Voice quality evaluation method, voice quality evaluation model training method and device
CN115641852A (en) * 2022-10-18 2023-01-24 中国电信股份有限公司 Voiceprint recognition method and device, electronic equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605111A (en) * 2009-06-25 2009-12-16 华为技术有限公司 A kind of method and apparatus of clipping control
CN106847292A (en) * 2017-02-16 2017-06-13 平安科技(深圳)有限公司 Method for recognizing sound-groove and device
CN108091352A (en) * 2017-12-27 2018-05-29 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio file processing method, device and storage medium
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN101315771A (en) * 2008-06-04 2008-12-03 哈尔滨工业大学 Compensation method for different speech coding influence in speaker recognition
US9502038B2 (en) * 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
US9978065B2 (en) * 2013-06-25 2018-05-22 Visa International Service Association Voice filter system
CN104008751A (en) * 2014-06-18 2014-08-27 周婷婷 Speaker recognition method based on BP neural network
CN105989843A (en) * 2015-01-28 2016-10-05 中兴通讯股份有限公司 Method and device of realizing missing feature reconstruction
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107039036B (en) * 2017-02-17 2020-06-16 南京邮电大学 High-quality speaker recognition method based on automatic coding depth confidence network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605111A (en) * 2009-06-25 2009-12-16 华为技术有限公司 A kind of method and apparatus of clipping control
CN106847292A (en) * 2017-02-16 2017-06-13 平安科技(深圳)有限公司 Method for recognizing sound-groove and device
CN108091352A (en) * 2017-12-27 2018-05-29 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio file processing method, device and storage medium
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FANHU BIE ET AL.: "Detection and Reconstruction of Clipped Speech for Speaker Recognition", SPEECH COMMUNICATION, 2 July 2015 (2015-07-02), pages 218 - 231, XP055664948 *
LI CHUN-ZHI ET AL.: "Restoration of Clipped Vibration Signal Based on BP Neural Network", 2010 INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION, 14 March 2010 (2010-03-14), pages 251 - 253, XP031671991 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112542156A (en) * 2020-12-08 2021-03-23 山东航空股份有限公司 Civil aviation maintenance worker card system based on voiceprint recognition and voice instruction control

Also Published As

Publication number Publication date
CN108899032A (en) 2018-11-27

Similar Documents

Publication Publication Date Title
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
US11948552B2 (en) Speech processing method, apparatus, electronic device, and computer-readable storage medium
CN108877775B (en) Voice data processing method and device, computer equipment and storage medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
US5666466A (en) Method and apparatus for speaker recognition using selected spectral information
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
Sharma et al. Study of robust feature extraction techniques for speech recognition system
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
JP7184236B2 (en) Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium
CN113012684B (en) Synthesized voice detection method based on voice segmentation
CN115116475A (en) Voice depression automatic detection method and device based on time delay neural network
Alkhatib et al. Voice identification using MFCC and vector quantization
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
CN113327616A (en) Voiceprint recognition method and device, electronic equipment and storage medium
Jegan et al. MFCC and texture descriptors based stuttering dysfluencies classification using extreme learning machine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921788

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921788

Country of ref document: EP

Kind code of ref document: A1