WO2021042537A1 - Voice recognition authentication method and system - Google Patents

Voice recognition authentication method and system Download PDF

Info

Publication number
WO2021042537A1
WO2021042537A1 PCT/CN2019/117554 CN2019117554W WO2021042537A1 WO 2021042537 A1 WO2021042537 A1 WO 2021042537A1 CN 2019117554 W CN2019117554 W CN 2019117554W WO 2021042537 A1 WO2021042537 A1 WO 2021042537A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
short
information
speaker
audio information
Prior art date
Application number
PCT/CN2019/117554
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
苏雪琦
彭话易
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042537A1 publication Critical patent/WO2021042537A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the embodiments of the present application relate to the field of speech recognition, and in particular, to a speech recognition authentication method, a speech recognition authentication system, a computer device, and a readable storage medium.
  • the home intelligent voice robot completes the received voice commands by recognizing the voices of family members; the conference recording system records the speeches of the participants at the meeting by recognizing the voices of the participants.
  • the inventor found that most of the existing speech recognition systems have problems such as unclear speech recognition and speaker recognition errors. For example, the sound of typing on the keyboard is regarded as a valid human speech, so the speech recognition system gives invalid Or record speaker A’s speech as speaker B’s speech.
  • This application aims to solve the problems of unclear speech recognition, speaker recognition errors and low recognition accuracy.
  • an embodiment of the present application provides a voice recognition authentication method, and the method includes:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • an embodiment of the present application also provides a voice recognition authentication system, including:
  • a preprocessing module configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information;
  • the feature extraction module is used to perform voice feature extraction on the voice information
  • the processing module is used to process the voice features to obtain target voice features that are closer to the speaker;
  • the matching module is used to match the target voice feature with the speaker's voice feature stored in the database.
  • the output module is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
  • an embodiment of the present application also provides a computer device, the computer device memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the computer When the readable instruction is executed by the processor, the following steps are implemented:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • the embodiments of the present application also provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions may Is executed by at least one processor, so that the at least one processor executes the following steps:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  • the speech recognition authentication method, speech recognition authentication system, computer equipment, and readable storage medium provided by the embodiments of the present application perform preprocessing on the acquired audio information to obtain information from the short-term energy and spectrum center of the audio information. Acquire voice information from audio information, extract voice features from the voice information, process the voice features to obtain target voice features closer to the speaker, and compare the target voice features with those stored in the database The speaker's voice features are matched, and according to the matching result, the matched speaker identity information is output to obtain the speaker corresponding to the voice information.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • FIG. 1 is a flowchart of the steps of the voice recognition authentication method according to the first embodiment of the application.
  • Figure 2 is a normalized audio feature splicing diagram in Example 1 of this application.
  • FIG. 3 is a diagram of a specific splicing method in Embodiment 1 of the application.
  • FIG. 4 is a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the application.
  • Fig. 5 is a schematic diagram of program modules of a speech recognition authentication system according to the third embodiment of the application.
  • FIG. 1 there is shown a flow chart of the steps of the voice recognition authentication method according to the first embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. It should be noted that, in this embodiment, the computer device 2 is used as the execution subject for exemplary description. details as follows:
  • Step S100 Acquire audio information.
  • the voice recognition and authentication system acquires these sounds, that is, audio information, due to the presence of the speaker's voice, silence, environmental noise, and non-environmental noise in the environment.
  • non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
  • Step S102 Pre-processing the audio information to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
  • the audio information needs to be processed to obtain all the information from the audio information.
  • Said voice information refers to the part that is not pronounced due to silence.
  • the environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects.
  • the non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology.
  • the short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio.
  • the spectrum center Able to distinguish parts of non-environmental noise.
  • the short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
  • the audio information when the audio information is preprocessed to obtain the voice information from the audio information according to the short-term energy and the spectral center of the audio information, the audio information is obtained from the audio information according to preset rules.
  • Multi-frame short-term signals are extracted from the information, wherein the preset rule includes a preset signal extraction time interval.
  • the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal.
  • the short-term energy is compared with the first preset value stored in the database
  • the frequency spectrum center is compared with the second preset value stored in the database.
  • E represents the short-term energy
  • N represents the number of frames of the short-term signal
  • N ⁇ 2 the number of frames of the short-term signal
  • s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  • the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
  • the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is: Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K ⁇ 2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
  • the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
  • the audio information is effective audio, that is, the voice information of the speaker.
  • Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition.
  • high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
  • the audio information is invalid audio information, and the audio information is deleted.
  • the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
  • the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
  • Step S104 Perform voice feature extraction on the voice information.
  • the voice information is windowed by using a Hamming window with a window length of 10 frames (100 milliseconds) and a frame skip distance of 3 frames (30 milliseconds), and then the corresponding Voice characteristics.
  • the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features.
  • the frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration.
  • the voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction.
  • the frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
  • the voice information includes a plurality of single frames of voice data.
  • the single frame of voice data is first subjected to fast Fourier transform processing to obtain the The power spectrum of the voice information is then used to perform dimensionality reduction processing on the power spectrum by using a mel filter bank to obtain a mel spectrum, and finally a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
  • the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum.
  • the frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear.
  • the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency.
  • the center frequencies of these triangular bandpass filters are equidistant on the mel scale.
  • the mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ.
  • Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
  • Step S106 Process the voice features to obtain target voice features that are closer to the speaker.
  • the step of processing the voice feature to obtain a target voice feature closer to the speaker specifically includes: normalizing the voice feature using the Z-score standardization method , In order to unify the voice features, wherein the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing. Then, the normalized processing result features are spliced to form a spliced frame with a long overlapping part. Finally, input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
  • the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing.
  • the normalized processing result features are spliced to form a
  • a Hamming window with a window length of 10 frames and a jump distance of 3 frames is used to splice the normalized processing result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
  • each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
  • the embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect.
  • features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
  • the target voice features are also input into the pre-trained speaker detection model and the intruder In the detection model. Then, according to the output result, it is verified whether the voice information is the voice of one preset speaker among the multiple preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker When the time, the voice information is acquired.
  • the speaker's voice feature is extracted, it is verified whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and the speaker is selected to be accepted or rejected according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
  • Step S108 matching the target voice feature with the speaker's voice feature stored in the database.
  • the processed voice feature is compared with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
  • the speaker's voice features are also collected in advance, and the voice features and the corresponding speaker's voice features are collected in advance.
  • the identity information is stored in the database.
  • the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
  • Step S110 according to the matching result, output the identity information of the speaker corresponding to the voice feature of the speaker that is matched to obtain the speaker corresponding to the voice information.
  • the identity information 1 is output, and then the speaker A represented by the identity information 1 is obtained.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • FIG. 2 shows a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the present application.
  • the computer device 2 includes, but is not limited to, a memory 21, a processing 22, and a network interface 23 that can communicate with each other through a system bus.
  • FIG. 2 only shows the computer device 2 with components 21-23, but it should be understood that it is not It is required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 21 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
  • the memory may also be an external storage device of the computer device 2, such as a plug-in hard disk equipped on the computer device 2, a smart media card (SMC), a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, such as the program code of the voice recognition authentication system 20.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 22 is generally used to control the overall operation of the computer device 2.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice recognition authentication system 20 and so on.
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 3 shows a schematic diagram of program modules of a voice recognition authentication system according to the third embodiment of the present application.
  • the speech recognition authentication system 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete
  • This application can also implement the above-mentioned voice recognition authentication method.
  • the program module referred to in the embodiments of the present application refers to a series of computer-readable instruction instruction segments capable of completing specific functions, and is more suitable for describing the execution process of the voice recognition authentication system 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
  • the obtaining module 201 is used to obtain audio information.
  • the acquisition module 201 acquires these sounds, that is, audio information.
  • non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
  • the preprocessing module 202 is configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
  • the preprocessing module 202 needs to analyze the audio information Processing is performed to obtain the voice information from the audio information.
  • the silence refers to the part that is not pronounced due to silence. For example, the speaker will think and breathe while speaking, because the speaker will not make a sound when thinking and breathing.
  • the environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects.
  • the non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology.
  • the short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio. Able to distinguish parts of non-environmental noise.
  • the short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
  • the preprocessing module 202 is further configured to extract a multi-frame short-term signal from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval. Then, the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal. Then, the short-term energy is compared with the first preset value stored in the database, and the center of the spectrum is compared with the second preset value stored in the database. When the short-term energy is higher than the first preset value and the frequency spectrum center is higher than the second preset value, determine that the audio information is voice information, and obtain the voice information.
  • E represents the short-term energy
  • N represents the number of frames of the short-term signal
  • N ⁇ 2 the number of frames of the short-term signal
  • s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  • the preprocessing module 202 extracts multiple frames of short-term signals s(1), s(2), s(3), s(4) according to a preset time interval (for example: 0.2ms) from the audio information ...S(N), and then calculate the short-term energy of the extracted multi-frame short-term signal to determine the energy intensity of the audio information.
  • a preset time interval for example: 0.2ms
  • the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
  • the preprocessing module 202 is further configured to obtain frequencies corresponding to the multi-frame short-term signals, and according to the frequency and the multi-frame short-term signals, according to the mute
  • the detection algorithm calculates the frequency spectrum center of the audio information, wherein the calculation formula of the frequency spectrum center is: Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K ⁇ 2, and is an integer, S(k) represents the frequency domain corresponding to the s(n)
  • the spectral energy distribution obtained by the discrete Fourier transform.
  • the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
  • the audio information is effective audio, that is, the voice information of the speaker.
  • Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition.
  • high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
  • the preprocessing module 202 is further configured to: when the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value At the time, it is determined that the audio information is invalid audio information, and the audio information is deleted.
  • the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
  • the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information, or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
  • the feature extraction module 203 is configured to perform voice feature extraction on the voice information.
  • the feature extraction module 203 performs windowing processing on the voice information by using a Hamming window with a window length of 10 frames (100 milliseconds) and a sound frame skip distance of 3 frames (30 milliseconds). , And then extract the corresponding voice features.
  • the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features.
  • the frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration.
  • the voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction.
  • the frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
  • the voice information includes multiple single frames of voice data
  • the feature extraction module 203 is further configured to perform fast Fourier transform processing on the single frame of voice data first to obtain the voice information
  • a mel filter bank is used to perform dimensionality reduction processing on the power spectrum to obtain a mel spectrum
  • a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
  • the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum.
  • the frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear.
  • the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency.
  • the center frequencies of these triangular bandpass filters are equidistant on the mel scale.
  • the mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ.
  • Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
  • the processing module 204 is configured to process the voice features to obtain target voice features that are closer to the speaker.
  • the speech recognition authentication system further includes a normalization module 207, a splicing module 208, and a training module 209.
  • the normalization module 207 is configured to use the Z-score standardization method to perform normalization processing on the voice features to unify the voice features, wherein the normalization processing formula is: ⁇ is the mean value of the multiple voice information, ⁇ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing.
  • the splicing module 208 is used for splicing the normalized processing result features to form a spliced frame with a long overlapping part.
  • the training module 209 is configured to input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
  • Fig. 2 uses a Hamming window with a window length of 10 frames and a jump distance of 3 frames to splice the normalized result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
  • each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
  • the embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect.
  • features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
  • the voice recognition authentication system further includes a voice verification module 210, which is used to input the voice features into a pre-trained speaker detection model and an intruder detection model, and verify according to the output result Whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker, the voice message.
  • a voice verification module 210 which is used to input the voice features into a pre-trained speaker detection model and an intruder detection model, and verify according to the output result Whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker, the voice message.
  • the voice verification module 210 verifies whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and chooses to accept or reject the speaker according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
  • the matching module 205 is configured to match the target voice feature with the speaker's voice feature stored in the database.
  • the matching module 205 compares the processed voice feature with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
  • the voice recognition and authentication system 20 also collects the speaker's voice features in advance, and saves the voice features and the corresponding speaker's identity information in a database.
  • the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
  • the output module 206 is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
  • the output module 206 outputs the identity information 1 to obtain the identity information 1 The speaker represented by A.
  • the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
  • This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs.
  • a server cluster composed of two servers) and so on.
  • the computer device in this embodiment at least includes, but is not limited to, a memory, a processor, etc. that can be communicatively connected to each other through a system bus.
  • This embodiment also provides a non-volatile computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which storage There are computer-readable instructions, and the corresponding functions are realized when the program is executed by the processor.
  • the non-volatile computer-readable storage medium of this embodiment is used to store the voice recognition authentication system 20, and when executed by a processor, the following steps are implemented:
  • the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice recognition authentication method, a computer device and a readable storage medium. The method comprises: acquiring audio information (S100); preprocessing the audio information to acquire voice information from the audio information according to short-time energy and a spectrum center of the audio information (S102); performing voice feature extraction on the voice information (S104); processing voice features to acquire a target voice feature closer to a speaker (S106); matching the target voice feature with speaker voice features stored in a database (S108); and outputting, according to a matching result, identity information of a matched speaker corresponding to the speaker voice feature to acquire the speaker corresponding to the voice information (S110). By means of the above method, the accuracy of voice recognition technology can be improved, and the user experience is greatly enhanced.

Description

语音识别认证方法及***Speech recognition authentication method and system
本申请要求于2019年9月4日提交中国专利局,专利名称为“语音识别认证方法及***”,申请号为201910832042.4的发明专利的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of a Chinese patent application filed with the Chinese Patent Office on September 4, 2019. The patent name is "Voice Recognition Authentication Method and System", and the application number is 201910832042.4. The entire content of the Chinese patent application is incorporated herein by reference. Applying.
技术领域Technical field
本申请实施例涉及语音识别领域,尤其涉及一种语音识别认证方法、语音识别认证***、计算机设备及可读存储介质。The embodiments of the present application relate to the field of speech recognition, and in particular, to a speech recognition authentication method, a speech recognition authentication system, a computer device, and a readable storage medium.
背景技术Background technique
随着语音识别技术的日渐成熟,语音识别技术在日常生活中的应用极为广泛。例如,家用智能语音机器人,通过识别家庭成员的声音来完成接收到的语音指令;会议记录***,通过识别参会者的声音来记录参会者会上的发言等。发明人发现,现有的大多数语音识别***都存在语音识别不清、说话人识别错误等问题,例如,把敲击键盘声当作有效的人的说话语音,从而语音识别***给出了无效的回应,或者把说话人A的发言记录成说话人B的发言。With the maturity of speech recognition technology, the application of speech recognition technology in daily life is extremely wide. For example, the home intelligent voice robot completes the received voice commands by recognizing the voices of family members; the conference recording system records the speeches of the participants at the meeting by recognizing the voices of the participants. The inventor found that most of the existing speech recognition systems have problems such as unclear speech recognition and speaker recognition errors. For example, the sound of typing on the keyboard is regarded as a valid human speech, so the speech recognition system gives invalid Or record speaker A’s speech as speaker B’s speech.
本申请旨在解决语音识别不清,说话人识别错误等识别准确率低的问题。This application aims to solve the problems of unclear speech recognition, speaker recognition errors and low recognition accuracy.
发明内容Summary of the invention
有鉴于此,有必要提供一种语音识别认证方法、语音识别认证***、计算机设备及可读存储介质,能够提升语音识别技术的准确度,极大的提升用户体验。In view of this, it is necessary to provide a voice recognition authentication method, a voice recognition authentication system, a computer device, and a readable storage medium, which can improve the accuracy of the voice recognition technology and greatly enhance the user experience.
为实现上述目的,本申请实施例提供了一种语音识别认证方法,所述方法包括:To achieve the foregoing objective, an embodiment of the present application provides a voice recognition authentication method, and the method includes:
获取音频信息;Get audio information;
对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
为实现上述目的,本申请实施例还提供了一种语音识别认证***,包括:To achieve the foregoing objective, an embodiment of the present application also provides a voice recognition authentication system, including:
获取模块,用于获取音频信息;Obtaining module for obtaining audio information;
预处理模块,用于对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;A preprocessing module, configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information;
特征提取模块,用于将所述语音信息进行语音特征提取;The feature extraction module is used to perform voice feature extraction on the voice information;
处理模块,用于将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;The processing module is used to process the voice features to obtain target voice features that are closer to the speaker;
匹配模块,用于将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及The matching module is used to match the target voice feature with the speaker's voice feature stored in the database; and
输出模块,用于根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。The output module is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
为实现上述目的,本申请实施例还提供了一种计算机设备,所述计算机设备存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In order to achieve the foregoing objective, an embodiment of the present application also provides a computer device, the computer device memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the computer When the readable instruction is executed by the processor, the following steps are implemented:
获取音频信息;Get audio information;
对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
为实现上述目的,本申请实施例还提供了一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:In order to achieve the above objective, the embodiments of the present application also provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions may Is executed by at least one processor, so that the at least one processor executes the following steps:
获取音频信息;Get audio information;
对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频 信息中获取语音信息;Pre-processing the audio information to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information;
将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
本申请实施例提供的语音识别认证方法、语音识别认证***、计算机设备及可读存储介质,通过对获取的音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息,并从所述语音信息中提取语音特征,将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征,将所述目标语音特征与数据库中存储的说话人语音特征进行匹配,并根据匹配结果,将匹配出的所述说话人身份信息输出以获取与所述语音信息对应的所述说话人。通过本申请实施例,能够提升语音识别技术的准确度,极大的提升用户体验。The speech recognition authentication method, speech recognition authentication system, computer equipment, and readable storage medium provided by the embodiments of the present application perform preprocessing on the acquired audio information to obtain information from the short-term energy and spectrum center of the audio information. Acquire voice information from audio information, extract voice features from the voice information, process the voice features to obtain target voice features closer to the speaker, and compare the target voice features with those stored in the database The speaker's voice features are matched, and according to the matching result, the matched speaker identity information is output to obtain the speaker corresponding to the voice information. Through the embodiments of the present application, the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
附图说明Description of the drawings
图1为本申请实施例一之语音识别认证方法的步骤流程图。FIG. 1 is a flowchart of the steps of the voice recognition authentication method according to the first embodiment of the application.
图2为本申请实施例一之归一化后音频特征拼接图。Figure 2 is a normalized audio feature splicing diagram in Example 1 of this application.
图3为本申请实施例一之具体拼接方式图。FIG. 3 is a diagram of a specific splicing method in Embodiment 1 of the application.
图4为本申请实施例二之计算机设备的硬件架构示意图。FIG. 4 is a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the application.
图5为本申请实施例三之语音识别认证***的程序模块示意图。Fig. 5 is a schematic diagram of program modules of a speech recognition authentication system according to the third embodiment of the application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第 一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.
实施例一Example one
参阅图1,示出了本申请实施例一之语音识别认证方法的步骤流程图。可以理解,本方法实施例中的流程图不用于对执行步骤的顺序进行限定。需要说明是,本实施例以计算机设备2为执行主体进行示例性描述。具体如下:Referring to Fig. 1, there is shown a flow chart of the steps of the voice recognition authentication method according to the first embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. It should be noted that, in this embodiment, the computer device 2 is used as the execution subject for exemplary description. details as follows:
步骤S100,获取音频信息。Step S100: Acquire audio information.
在一较佳实施例中,当会议记录进行时,由于环境中存在说话人说话的语音、静音、环境噪音及非环境噪音,所述语音识别认证***获取该些声音,也即音频信息。In a preferred embodiment, when the meeting recording is in progress, the voice recognition and authentication system acquires these sounds, that is, audio information, due to the presence of the speaker's voice, silence, environmental noise, and non-environmental noise in the environment.
需要说明的是,非环境噪音与说话人说话的语音具有不同的短时能量和频谱中心。It should be noted that the non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
步骤S102,对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息。Step S102: Pre-processing the audio information to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
示例性地,在获取到音频信息后,由于所述音频信息包括说话人说话的语音信息、静音、环境噪声及非环境噪声,需要对所述音频信息进行处理以从所述音频信息中获取所述语音信息。所述静音是指由于静默而没有发音的部分,例如:说话人在说话过程中会思考、呼吸等,由于说话人在思考和呼吸时不会发出声音。所述环境噪音包括但不限于门窗的开关、物体的碰撞等发出的声音。所述非环境噪声包括但不限定于,例如:咳嗽声、点击鼠标的声音或敲打键盘的声音。短时能量和频谱中心是静音检测技术中音频信息的两个重要指标,其中,所述短时能量体现的是信号能量的强弱,能够区分一段音频中的静音和环境噪声,所述频谱中心能够区分非环境噪声中的部分。通过综合短时能量和频谱中心以从所述音频信息中筛选出有效音频,也即语音信息。Exemplarily, after the audio information is obtained, since the audio information includes the voice information of the speaker, silence, environmental noise, and non-environmental noise, the audio information needs to be processed to obtain all the information from the audio information.述声信息。 Said voice information. The silence refers to the part that is not pronounced due to silence. For example, the speaker will think and breathe while speaking, because the speaker will not make a sound when thinking and breathing. The environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects. The non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology. The short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio. The spectrum center Able to distinguish parts of non-environmental noise. The short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
在一较佳实施例中,当对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息时,按照预设规则从所述音频信息中提取多帧短时信号,其中,所述预设规则包括预设信号提取时间间隔。然后,将所述多帧短时信号按照静音检测算法计算所述短时能量和所述频谱中心。然后,将所述短时能量与数据库中存储的第一预设值进行比较,将所述频谱中心与所述数据库中存储的第二预设值进行比较。 当所述短时能量高于所述第一预设值,且所述频谱中心高于所述第二预设值时,判定所述音频信息为语音信息,并获取所述语音信息。In a preferred embodiment, when the audio information is preprocessed to obtain the voice information from the audio information according to the short-term energy and the spectral center of the audio information, the audio information is obtained from the audio information according to preset rules. Multi-frame short-term signals are extracted from the information, wherein the preset rule includes a preset signal extraction time interval. Then, the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal. Then, the short-term energy is compared with the first preset value stored in the database, and the frequency spectrum center is compared with the second preset value stored in the database. When the short-term energy is higher than the first preset value and the frequency spectrum center is higher than the second preset value, determine that the audio information is voice information, and obtain the voice information.
所述短时能量的计算公式为:
Figure PCTCN2019117554-appb-000001
E表示所述短时能量,N表示短时信号的帧数,N≥2,s(n)表示时域上第n帧短时信号的信号幅度。
The calculation formula of the short-term energy is:
Figure PCTCN2019117554-appb-000001
E represents the short-term energy, N represents the number of frames of the short-term signal, N≥2, and s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
示例性地,从音频信息中按照预设时间间隔(例如:0.2ms)提取多帧短时信号s(1),s(2),s(3),s(4)…s(N),然后将提取出的所述多帧短时信号计算短时能量,以确定所述音频信息的能量强弱。Exemplarily, extract multi-frame short-term signals s(1), s(2), s(3), s(4)...s(N) from the audio information according to a preset time interval (for example: 0.2ms), Then, the short-term energy of the extracted multi-frame short-term signal is calculated to determine the energy intensity of the audio information.
需要说明的是,所述短时能量是每帧信号的平方和,体现的是信号能量的强弱,当信号能量太弱时,则判定所述信号为静音或环境噪声。It should be noted that the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
在另一较佳实施例中,当根据所述多针短时信号计算所述频谱中心时,还分别获取与所述多帧短时信号对应的频率,并根据所述频率及所述多帧短时信号,按照所述静音检测算法计算所述音频信息的频谱中心,其中,所述频谱中心的计算公式为:
Figure PCTCN2019117554-appb-000002
其中,C表示所述频谱中心,K表示分别与N帧s(n)对应的频率个数,K≥2,且为整数,S(k)表示频域上与所述s(n)对应的离散傅里叶变换得到的频谱能量分布。
In another preferred embodiment, when the frequency spectrum center is calculated based on the multi-needle short-term signal, the frequency corresponding to the multi-frame short-term signal is also obtained respectively, and the frequency is determined according to the frequency and the multi-frame For short-term signals, the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is:
Figure PCTCN2019117554-appb-000002
Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K≥2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
需要说明的是,所述频谱中心又称为频谱一阶距,频谱中心的值越小,表明越多的频谱能量集中在低频范围内,可通过使用频谱中心去除非环境噪声的部分,例如:咳嗽声、点击鼠标的声音或敲打键盘的声音。It should be noted that the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
需要说明的是,当短时能量和频谱中心的指标同时高于设定的阈值时,则代表该音频信息为有效音频,也即说话人说话的语音信息,大多的环境噪声及非环境噪声被移除,以使保留下来的语音信息更纯净,质量更高,为语音识别的过程减少大量的干扰因素。在本申请实施例中,通过将所述第一预设值及所述第二预设值设置的比常规数值高,进而获得的高质量的语音信息。It should be noted that when the short-term energy and spectrum center indicators are higher than the set threshold at the same time, it means that the audio information is effective audio, that is, the voice information of the speaker. Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition. In the embodiment of the present application, high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
在另一较佳实施例中,将所述频谱中心与所述数据库中存储的第二预设值进行比较之后,当所述短时能量低于所述第一预设值,和/或所述频谱中心低于所述第二预设值时,判定所述音频信息为无效音频信息,并将所述音频信息删除。其中,所述无效音频信息至少包括:静音、环境噪声、及非环境噪声。In another preferred embodiment, after comparing the center of the spectrum with the second preset value stored in the database, when the short-term energy is lower than the first preset value, and/or When the center of the frequency spectrum is lower than the second preset value, it is determined that the audio information is invalid audio information, and the audio information is deleted. Wherein, the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
示例性地,若所述短时能量低于所述第一预设值,则代表为安静环境,所述音频信息 的静音或环境噪声。若所述频谱中心低于所述第二预设值,则代表为非安静环境,所述音频信息为非环境噪声。Exemplarily, if the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
步骤S104,将所述语音信息进行语音特征提取。Step S104: Perform voice feature extraction on the voice information.
在一较佳实施例中,通过用窗长为10帧(100毫秒)、音框跳距为3帧(30毫秒)的汉明窗对所述语音信息进行加窗处理,进而提取出相应的语音特征。In a preferred embodiment, the voice information is windowed by using a Hamming window with a window length of 10 frames (100 milliseconds) and a frame skip distance of 3 frames (30 milliseconds), and then the corresponding Voice characteristics.
需要说明的是,所述语音特征包括但不限于频谱特征、音质特征和声纹特征。频谱特征是根据声音振动频率区分不同的语音数据,如目标语音和干扰语音。音质特征和声纹特征时根据声纹和声音的音色特点识别待测试语音数据对应的说话人。由于语音区分是用于区分语音数据中的目标语音和干扰语音,因此,只需要获取所述语音信息的频谱特征,就可以完成语音区分。其中频谱是频率谱密度的简称,频谱特征是反映频率谱密度的参数。It should be noted that the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features. The frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration. The voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction. The frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
在一较佳实施例中,所述语音信息包括多个单帧语音数据,将所述语音信息进行语音特征提取时,先对所述单帧语音数据进行快速傅里叶变换处理,获取所述语音信息的功率谱,然后采用梅尔滤波器组对所述功率谱进行降维处理,获取梅尔频谱,最后对所述梅尔频谱进行倒谱分析,以获取所述语音特征。In a preferred embodiment, the voice information includes a plurality of single frames of voice data. When the voice information is subjected to voice feature extraction, the single frame of voice data is first subjected to fast Fourier transform processing to obtain the The power spectrum of the voice information is then used to perform dimensionality reduction processing on the power spectrum by using a mel filter bank to obtain a mel spectrum, and finally a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
示例性地,由于人的听觉感知***可以模拟复杂的非线性***,获取的功率谱不能很好地展现语音数据的非线性特点,因此还需要采用梅尔滤波器组对频谱进行降维处理,使得获取的待测试语音数据的频谱更加接近人耳感知的频率。其中,梅尔滤波器组是由多个重叠的三角带通滤波器组成的,三角带通滤波器携带有下限频率、截止频率和中心频率三种频率。这些三角带通滤波器的中心频率在梅尔刻度上是等距的,梅尔刻度在1000HZ之前是线性增长的,1000HZ之后是成对数增长的。倒谱是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶逆变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Exemplarily, since the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum. The frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear. Among them, the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency. The center frequencies of these triangular bandpass filters are equidistant on the mel scale. The mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ. Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
步骤S106,将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征。Step S106: Process the voice features to obtain target voice features that are closer to the speaker.
在一较佳实施例中,将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征的步骤,具体包括:利用Z-score标准化方法对所述语音特征进行归一化处理,以将所述语音特征进行统一,其中,所述归一化处理的公式为:
Figure PCTCN2019117554-appb-000003
μ为多个所述语音信息的均值,σ为多个所述语音信息的标准差,x为所述多个单帧语音数据,x*为归一化处理之后的所述语音特征。然后,将归一化处理结果特征进行拼接,以形成重叠部分长的拼接帧。最后,将所述拼接帧输入至神经网络中,以对所述拼接帧进行训练,以获取所述目标语音特征,以降低所述语音信息的损失。
In a preferred embodiment, the step of processing the voice feature to obtain a target voice feature closer to the speaker specifically includes: normalizing the voice feature using the Z-score standardization method , In order to unify the voice features, wherein the normalization processing formula is:
Figure PCTCN2019117554-appb-000003
μ is the mean value of the multiple voice information, σ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing. Then, the normalized processing result features are spliced to form a spliced frame with a long overlapping part. Finally, input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
示例性地,请参阅图2利用窗长为10帧,跳距为3帧的汉明窗对所述归一化处理结果特征进行拼接,形成390维度的特征。然后,以每10帧为一个拼接单位对所述10帧进行拼接,具体拼接方式请参阅图3。For example, referring to FIG. 2, a Hamming window with a window length of 10 frames and a jump distance of 3 frames is used to splice the normalized processing result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
需要说明的是由于每一帧都是39位,故,10帧拼接在一起的维度为390维。由于跳距为3帧,从第一帧开始,往后跳3步,下一个需要拼接的帧数为第4帧到第13帧,以此类推。It should be noted that since each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
本申请实施例通过将所述语音特征进行统一,解决数据指标之间的可比性,降低奇异样本数据导致的不同影响,有助于对所述语音特征进行综合对比评价,提高更好的语音训练效果。The embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect.
本申请实施例通过将特征进行拼接以形成重叠部分较长的帧,是为了捕捉过度信息,减少较长持续时间内信息的损失。In the embodiment of the present application, features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
在另一较佳实施例中,将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征之后,还将所述目标语音特征输入至预先训练的说话人检测模型及入侵者检测模型中。然后,根据输出结果验证所述语音信息是否为所述说话人检测模型中保存的多个预设说话人中一个预设说话人的语音,当所述语音信息为所述预设说话人的语音时,则获取所述语音信息。In another preferred embodiment, after the voice features are processed to obtain target voice features that are closer to the speaker, the target voice features are also input into the pre-trained speaker detection model and the intruder In the detection model. Then, according to the output result, it is verified whether the voice information is the voice of one preset speaker among the multiple preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker When the time, the voice information is acquired.
具体地,当提取出说话人的语音特征后,通过验证该语音特征是否是预先训练的说话人检测模型中的预设说话人中的其中一个,并根据验证结果选择接受或拒绝该说话人。若识别出该语音特征被入侵者冒名顶替,则拒绝该说话人的语音信息。Specifically, after the speaker's voice feature is extracted, it is verified whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and the speaker is selected to be accepted or rejected according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
步骤S108,将所述目标语音特征与数据库中存储的说话人语音特征进行匹配。Step S108, matching the target voice feature with the speaker's voice feature stored in the database.
示例性地,将处理后的语音特征与数据库中存储的说话人语音特征进行对比,以获取与所述语音特征相匹配的说话人语音特征。Exemplarily, the processed voice feature is compared with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
在一较佳实施例中,将提取出的语音特征与数据库中存储的说话人语音特征进行匹配之前,还预先采集所述说话人的语音特征,并将所述语音特征及对应的说话人的身份信息保存于数据库中。In a preferred embodiment, before matching the extracted voice features with the speaker's voice features stored in the database, the speaker's voice features are also collected in advance, and the voice features and the corresponding speaker's voice features are collected in advance. The identity information is stored in the database.
具体地,由于在说话人语音特征采集的过程中,环境为安静环境,故很容易获取所述说话人的语音特征,并将该语音特征及对应的说话人的身份信息保存于数据库中。Specifically, since the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
步骤S110,根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身 份信息输出,以获取与所述语音信息对应的所述说话人。Step S110, according to the matching result, output the identity information of the speaker corresponding to the voice feature of the speaker that is matched to obtain the speaker corresponding to the voice information.
具体地,当提取出的语音特征与数据库中存储的说话人的身份信息1的语音特征匹配时,将该身份信息1输出,进而获取该身份信息1所代表的说话人甲。Specifically, when the extracted voice feature matches the voice feature of the speaker's identity information 1 stored in the database, the identity information 1 is output, and then the speaker A represented by the identity information 1 is obtained.
通过本申请实施例,能够提升语音识别技术的准确度,并极大的提升用户体验。Through the embodiments of the present application, the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
实施例二Example two
请参阅图2,示出了本申请实施例二之计算机设备的硬件架构示意图。计算机设备2包括,但不仅限于,可通过***总线相互通信连接存储器21、处理22以及网络接口23,图2仅示出了具有组件21-23的计算机设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Please refer to FIG. 2, which shows a schematic diagram of the hardware architecture of the computer device according to the second embodiment of the present application. The computer device 2 includes, but is not limited to, a memory 21, a processing 22, and a network interface 23 that can communicate with each other through a system bus. FIG. 2 only shows the computer device 2 with components 21-23, but it should be understood that it is not It is required to implement all the illustrated components, and more or fewer components may be implemented instead.
所述存储器21至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,所述存储器也可以是所述计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述计算机设备2的操作***和各类应用软件,例如语音识别认证***20的程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 21 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2. In other embodiments, the memory may also be an external storage device of the computer device 2, such as a plug-in hard disk equipped on the computer device 2, a smart media card (SMC), a secure digital ( Secure Digital, SD card, Flash Card, etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, such as the program code of the voice recognition authentication system 20. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述计算机设备2的总体操作。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述语音识别认证***20等。In some embodiments, the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 22 is generally used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice recognition authentication system 20 and so on.
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他电子设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移 动通讯***(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
实施例三Example three
请参阅图3,示出了本申请实施例三之语音识别认证***的程序模块示意图。在本实施例中,语音识别认证***20可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述语音识别认证方法。本申请实施例所称的程序模块是指能够完成特定功能的一系列计算机可读指令指令段,比程序本身更适合于描述语音识别认证***20在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:Please refer to FIG. 3, which shows a schematic diagram of program modules of a voice recognition authentication system according to the third embodiment of the present application. In this embodiment, the speech recognition authentication system 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete This application can also implement the above-mentioned voice recognition authentication method. The program module referred to in the embodiments of the present application refers to a series of computer-readable instruction instruction segments capable of completing specific functions, and is more suitable for describing the execution process of the voice recognition authentication system 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
获取模块201,用于获取音频信息。The obtaining module 201 is used to obtain audio information.
在一较佳实施例中,当会议记录进行时,由于环境中存在说话人说话的语音、静音、环境噪音及非环境噪音,所述获取模块201获取该些声音,也即音频信息。In a preferred embodiment, when the meeting recording is in progress, due to the speaker's speech, silence, environmental noise, and non-environmental noise in the environment, the acquisition module 201 acquires these sounds, that is, audio information.
需要说明的是,非环境噪音与说话人说话的语音具有不同的短时能量和频谱中心。It should be noted that the non-ambient noise and the speech spoken by the speaker have different short-term energy and spectral centers.
预处理模块202,用于对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息。The preprocessing module 202 is configured to preprocess the audio information, so as to obtain voice information from the audio information according to the short-term energy and spectrum center of the audio information.
示例性地,在所述获取模块201获取到音频信息后,由于所述音频信息包括说话人说话的语音信息、静音、环境噪声及非环境噪声,所述预处理模块202需要对所述音频信息进行处理以从所述音频信息中获取所述语音信息。所述静音是指由于静默而没有发音的部分,例如:说话人在说话过程中会思考、呼吸等,由于说话人在思考和呼吸时不会发出声音。所述环境噪音包括但不限于门窗的开关、物体的碰撞等发出的声音。所述非环境噪声包括但不限定于,例如:咳嗽声、点击鼠标的声音或敲打键盘的声音。短时能量和频谱中心是静音检测技术中音频信息的两个重要指标,其中,所述短时能量体现的是信号能量的强弱,能够区分一段音频中的静音和环境噪声,所述频谱中心能够区分非环境噪声中的部分。通过综合短时能量和频谱中心以从所述音频信息中筛选出有效音频,也即语音信息。Exemplarily, after the audio information is obtained by the obtaining module 201, since the audio information includes the speech information of the speaker, silence, environmental noise, and non-environmental noise, the preprocessing module 202 needs to analyze the audio information Processing is performed to obtain the voice information from the audio information. The silence refers to the part that is not pronounced due to silence. For example, the speaker will think and breathe while speaking, because the speaker will not make a sound when thinking and breathing. The environmental noise includes, but is not limited to, the sound generated by the opening and closing of doors and windows, and the collision of objects. The non-ambient noise includes, but is not limited to, for example, coughing, mouse clicking, or keyboard typing. Short-term energy and spectrum center are two important indicators of audio information in silence detection technology. The short-term energy reflects the strength of signal energy and can distinguish silence and environmental noise in a segment of audio. Able to distinguish parts of non-environmental noise. The short-term energy and the center of the spectrum are combined to filter out effective audio, that is, voice information, from the audio information.
在一较佳实施例中,所述预处理模块202还用于按照预设规则从所述音频信息中提取多帧短时信号,其中,所述预设规则包括预设信号提取时间间隔。然后,将所述多帧短时信号按照静音检测算法计算所述短时能量和所述频谱中心。然后,将所述短时能量与数据 库中存储的第一预设值进行比较,将所述频谱中心与所述数据库中存储的第二预设值进行比较。当所述短时能量高于所述第一预设值,且所述频谱中心高于所述第二预设值时,判定所述音频信息为语音信息,并获取所述语音信息。In a preferred embodiment, the preprocessing module 202 is further configured to extract a multi-frame short-term signal from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval. Then, the short-term energy and the center of the spectrum are calculated according to the silence detection algorithm for the multi-frame short-term signal. Then, the short-term energy is compared with the first preset value stored in the database, and the center of the spectrum is compared with the second preset value stored in the database. When the short-term energy is higher than the first preset value and the frequency spectrum center is higher than the second preset value, determine that the audio information is voice information, and obtain the voice information.
所述短时能量的计算公式为:
Figure PCTCN2019117554-appb-000004
E表示所述短时能量,N表示短时信号的帧数,N≥2,s(n)表示时域上第n帧短时信号的信号幅度。
The calculation formula of the short-term energy is:
Figure PCTCN2019117554-appb-000004
E represents the short-term energy, N represents the number of frames of the short-term signal, N≥2, and s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
示例性地,所述预处理模块202从音频信息中按照预设时间间隔(例如:0.2ms)提取多帧短时信号s(1),s(2),s(3),s(4)…s(N),然后将提取出的所述多帧短时信号计算短时能量,以确定所述音频信息的能量强弱。Exemplarily, the preprocessing module 202 extracts multiple frames of short-term signals s(1), s(2), s(3), s(4) according to a preset time interval (for example: 0.2ms) from the audio information ...S(N), and then calculate the short-term energy of the extracted multi-frame short-term signal to determine the energy intensity of the audio information.
需要说明的是,所述短时能量是每帧信号的平方和,体现的是信号能量的强弱,当信号能量太弱时,则判定所述信号为静音或环境噪声。It should be noted that the short-term energy is the sum of the squares of the signal of each frame, which reflects the strength of the signal energy. When the signal energy is too weak, it is determined that the signal is silent or environmental noise.
在另一较佳实施例中,所述预处理模块202还用于分别获取与所述多帧短时信号对应的频率,并根据所述频率及所述多帧短时信号,按照所述静音检测算法计算所述音频信息的频谱中心,其中,所述频谱中心的计算公式为:
Figure PCTCN2019117554-appb-000005
其中,C表示所述频谱中心,K表示分别与N帧s(n)对应的频率个数,K≥2,且为整数,S(k)表示频域上与所述s(n)对应的离散傅里叶变换得到的频谱能量分布。
In another preferred embodiment, the preprocessing module 202 is further configured to obtain frequencies corresponding to the multi-frame short-term signals, and according to the frequency and the multi-frame short-term signals, according to the mute The detection algorithm calculates the frequency spectrum center of the audio information, wherein the calculation formula of the frequency spectrum center is:
Figure PCTCN2019117554-appb-000005
Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K≥2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
需要说明的是,所述频谱中心又称为频谱一阶距,频谱中心的值越小,表明越多的频谱能量集中在低频范围内,可通过使用频谱中心去除非环境噪声的部分,例如:咳嗽声、点击鼠标的声音或敲打键盘的声音。It should be noted that the center of the spectrum is also called the first order distance of the spectrum. The smaller the value of the center of the spectrum, the more spectrum energy is concentrated in the low frequency range. You can use the center of the spectrum to remove the non-environmental noise part, for example: The sound of coughing, clicking the mouse, or typing on the keyboard.
需要说明的是,当短时能量和频谱中心的指标同时高于设定的阈值时,则代表该音频信息为有效音频,也即说话人说话的语音信息,大多的环境噪声及非环境噪声被移除,以使保留下来的语音信息更纯净,质量更高,为语音识别的过程减少大量的干扰因素。在本申请实施例中,通过将所述第一预设值及所述第二预设值设置的比常规数值高,进而获得的高质量的语音信息。It should be noted that when the short-term energy and spectrum center indicators are higher than the set threshold at the same time, it means that the audio information is effective audio, that is, the voice information of the speaker. Most of the environmental noise and non-environmental noise are Remove to make the retained voice information more pure and higher quality, and reduce a lot of interference factors for the process of voice recognition. In the embodiment of the present application, high-quality voice information is obtained by setting the first preset value and the second preset value to be higher than conventional values.
在另一较佳实施例中,所述预处理模块202还用于当所述短时能量低于所述第一预设值,和/或所述频谱中心低于所述第二预设值时,判定所述音频信息为无效音频信息,并将所述音频信息删除。其中,所述无效音频信息至少包括:静音、环境噪声、及非环境噪声。In another preferred embodiment, the preprocessing module 202 is further configured to: when the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value At the time, it is determined that the audio information is invalid audio information, and the audio information is deleted. Wherein, the invalid audio information includes at least: silence, environmental noise, and non-environmental noise.
示例性地,若所述短时能量低于所述第一预设值,则代表为安静环境,所述音频信息的静音或环境噪声。若所述频谱中心低于所述第二预设值,则代表为非安静环境,所述音频信息为非环境噪声。Exemplarily, if the short-term energy is lower than the first preset value, it represents a quiet environment, silence of the audio information, or environmental noise. If the center of the frequency spectrum is lower than the second preset value, it represents a non-quiet environment, and the audio information is non-ambient noise.
特征提取模块203,用于将所述语音信息进行语音特征提取。The feature extraction module 203 is configured to perform voice feature extraction on the voice information.
在一较佳实施例中,所述特征提取模块203通过用窗长为10帧(100毫秒)、音框跳距为3帧(30毫秒)的汉明窗对所述语音信息进行加窗处理,进而提取出相应的语音特征。In a preferred embodiment, the feature extraction module 203 performs windowing processing on the voice information by using a Hamming window with a window length of 10 frames (100 milliseconds) and a sound frame skip distance of 3 frames (30 milliseconds). , And then extract the corresponding voice features.
需要说明的是,所述语音特征包括但不限于频谱特征、音质特征和声纹特征。频谱特征是根据声音振动频率区分不同的语音数据,如目标语音和干扰语音。音质特征和声纹特征时根据声纹和声音的音色特点识别待测试语音数据对应的说话人。由于语音区分是用于区分语音数据中的目标语音和干扰语音,因此,只需要获取所述语音信息的频谱特征,就可以完成语音区分。其中频谱是频率谱密度的简称,频谱特征是反映频率谱密度的参数。It should be noted that the voice features include, but are not limited to, spectrum features, sound quality features, and voiceprint features. The frequency spectrum feature distinguishes different voice data, such as target voice and interference voice, according to the frequency of sound vibration. The voice quality feature and the voiceprint feature identify the speaker corresponding to the voice data to be tested according to the voiceprint and the timbre feature of the voice. Since the voice distinction is used to distinguish the target voice and the interfering voice in the voice data, it is only necessary to obtain the frequency spectrum characteristics of the voice information to complete the voice distinction. The frequency spectrum is the abbreviation of frequency spectrum density, and the frequency spectrum characteristic is a parameter that reflects the frequency spectrum density.
在一较佳实施例中,所述语音信息包括多个单帧语音数据,所述特征提取模块203还用于先对所述单帧语音数据进行快速傅里叶变换处理,获取所述语音信息的功率谱,然后采用梅尔滤波器组对所述功率谱进行降维处理,获取梅尔频谱,最后对所述梅尔频谱进行倒谱分析,以获取所述语音特征。In a preferred embodiment, the voice information includes multiple single frames of voice data, and the feature extraction module 203 is further configured to perform fast Fourier transform processing on the single frame of voice data first to obtain the voice information Then, a mel filter bank is used to perform dimensionality reduction processing on the power spectrum to obtain a mel spectrum, and finally, a cepstrum analysis is performed on the mel spectrum to obtain the voice feature.
示例性地,由于人的听觉感知***可以模拟复杂的非线性***,获取的功率谱不能很好地展现语音数据的非线性特点,因此还需要采用梅尔滤波器组对频谱进行降维处理,使得获取的待测试语音数据的频谱更加接近人耳感知的频率。其中,梅尔滤波器组是由多个重叠的三角带通滤波器组成的,三角带通滤波器携带有下限频率、截止频率和中心频率三种频率。这些三角带通滤波器的中心频率在梅尔刻度上是等距的,梅尔刻度在1000HZ之前是线性增长的,1000HZ之后是成对数增长的。倒谱是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶逆变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Exemplarily, since the human auditory perception system can simulate a complex nonlinear system, the acquired power spectrum cannot well show the nonlinear characteristics of the voice data, so it is necessary to use a Mel filter bank to reduce the dimensionality of the spectrum. The frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear. Among them, the Mel filter bank is composed of multiple overlapping triangular bandpass filters, and the triangular bandpass filter carries three frequencies: the lower limit frequency, the cutoff frequency and the center frequency. The center frequencies of these triangular bandpass filters are equidistant on the mel scale. The mel scale increases linearly before 1000HZ, and increases logarithmically after 1000HZ. Cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum.
处理模块204,用于将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征。The processing module 204 is configured to process the voice features to obtain target voice features that are closer to the speaker.
在另一较佳实施例中,所述语音识别认证***还包括归一化模块207、拼接模块208及训练模块209。所述归一化模块207用于利用Z-score标准化方法对所述语音特征进行归一化处理,以将所述语音特征进行统一,其中,所述归一化处理的公式为:
Figure PCTCN2019117554-appb-000006
μ为多个所述语音信息的均值,σ为多个所述语音信息的标准差,x为所述多个单帧语音数据,x*为归一化处理之后的所述语音特征。所述拼接模块208用于将归一化处理结果特征进行拼接,以形成重叠部分长的拼接帧。所述训练模块209用于将所述拼接帧输入至神经网络中,以对所述拼接帧进行训练,以获取所述目标语音特征,以降低所述语音信息的损失。
In another preferred embodiment, the speech recognition authentication system further includes a normalization module 207, a splicing module 208, and a training module 209. The normalization module 207 is configured to use the Z-score standardization method to perform normalization processing on the voice features to unify the voice features, wherein the normalization processing formula is:
Figure PCTCN2019117554-appb-000006
μ is the mean value of the multiple voice information, σ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing. The splicing module 208 is used for splicing the normalized processing result features to form a spliced frame with a long overlapping part. The training module 209 is configured to input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature, so as to reduce the loss of the voice information.
示例性地,请参阅图2利用窗长为10帧,跳距为3帧的汉明窗对所述归一化处理结 果特征进行拼接,形成390维度的特征。然后,以每10帧为一个拼接单位对所述10帧进行拼接,具体拼接方式请参阅图3。Exemplarily, please refer to Fig. 2 to use a Hamming window with a window length of 10 frames and a jump distance of 3 frames to splice the normalized result features to form a 390-dimensional feature. Then, the 10 frames are spliced with every 10 frames as a splicing unit. Please refer to FIG. 3 for a specific splicing method.
需要说明的是由于每一帧都是39位,故,10帧拼接在一起的维度为390维。由于跳距为3帧,从第一帧开始,往后跳3步,下一个需要拼接的帧数为第4帧到第13帧,以此类推。It should be noted that since each frame is 39 bits, the dimension of 10 frames spliced together is 390 dimensions. Since the jump distance is 3 frames, starting from the first frame and then skipping 3 steps, the next number of frames to be spliced is from the 4th frame to the 13th frame, and so on.
本申请实施例通过将所述语音特征进行统一,解决数据指标之间的可比性,降低奇异样本数据导致的不同影响,有助于对所述语音特征进行综合对比评价,提高更好的语音训练效果。本申请实施例通过将特征进行拼接以形成重叠部分较长的帧,是为了捕捉过度信息,减少较长持续时间内信息的损失。The embodiment of the application unifies the voice features, solves the comparability between data indicators, reduces the different effects caused by singular sample data, and helps to comprehensively compare and evaluate the voice features and improve better voice training. effect. In the embodiment of the present application, features are spliced to form a frame with a longer overlapping part, so as to capture excessive information and reduce the loss of information in a longer duration.
在另一较佳实施例中,所述语音识别认证***还包括语音验证模块210,用于将所述语音特征输入至预先训练的说话人检测模型及入侵者检测模型中,并根据输出结果验证所述语音信息是否为所述说话人检测模型中保存的多个预设说话人中一个预设说话人的语音,当所述语音信息为所述预设说话人的语音时,则获取所述语音信息。In another preferred embodiment, the voice recognition authentication system further includes a voice verification module 210, which is used to input the voice features into a pre-trained speaker detection model and an intruder detection model, and verify according to the output result Whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model, and when the voice information is the voice of the preset speaker, the voice message.
具体地,所述语音验证模块210通过验证该语音特征是否是预先训练的说话人检测模型中的预设说话人中的其中一个,并根据验证结果选择接受或拒绝该说话人。若识别出该语音特征被入侵者冒名顶替,则拒绝该说话人的语音信息。Specifically, the voice verification module 210 verifies whether the voice feature is one of the preset speakers in the pre-trained speaker detection model, and chooses to accept or reject the speaker according to the verification result. If it is recognized that the voice feature has been imposted by the intruder, the voice information of the speaker is rejected.
匹配模块205,用于将所述目标语音特征与数据库中存储的说话人语音特征进行匹配。The matching module 205 is configured to match the target voice feature with the speaker's voice feature stored in the database.
示例性地,所述匹配模块205将处理后的语音特征与数据库中存储的说话人语音特征进行对比,以获取与所述语音特征相匹配的说话人语音特征。Exemplarily, the matching module 205 compares the processed voice feature with the speaker's voice feature stored in the database to obtain the speaker's voice feature that matches the voice feature.
在一较佳实施例中,所述语音识别认证***20还预先采集所述说话人的语音特征,并将所述语音特征及对应的说话人的身份信息保存于数据库中。In a preferred embodiment, the voice recognition and authentication system 20 also collects the speaker's voice features in advance, and saves the voice features and the corresponding speaker's identity information in a database.
具体地,由于在说话人语音特征采集的过程中,环境为安静环境,故很容易获取所述说话人的语音特征,并将该语音特征及对应的说话人的身份信息保存于数据库中。Specifically, since the environment is a quiet environment during the process of collecting the speaker's voice features, it is easy to obtain the speaker's voice feature, and save the voice feature and the corresponding speaker's identity information in the database.
输出模块206,用于根据匹配结果,将匹配出的与所述说话人语音特征对应的说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。The output module 206 is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
具体地,当所述特征提取模块203提取出的语音特征与数据库中存储的说话人的身份信息1的语音特征匹配时,所述输出模块206将该身份信息1输出,进而获取该身份信息1所代表的说话人甲。Specifically, when the voice feature extracted by the feature extraction module 203 matches the voice feature of the speaker's identity information 1 stored in the database, the output module 206 outputs the identity information 1 to obtain the identity information 1 The speaker represented by A.
通过本申请实施例,能够提升语音识别技术的准确度,并极大的提升用户体验。Through the embodiments of the present application, the accuracy of the speech recognition technology can be improved, and the user experience can be greatly improved.
本申请还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备至少包括但不限于:可通过***总线相互通信连接的存储器、处理器等。This application also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or more) that can execute programs. A server cluster composed of two servers) and so on. The computer device in this embodiment at least includes, but is not limited to, a memory, a processor, etc. that can be communicatively connected to each other through a system bus.
本实施例还提供一种非易失性计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机可读指令,程序被处理器执行时实现相应功能。本实施例的非易失性计算机可读存储介质用于存储语音识别认证***20,被处理器执行时实现如下步骤:This embodiment also provides a non-volatile computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which storage There are computer-readable instructions, and the corresponding functions are realized when the program is executed by the processor. The non-volatile computer-readable storage medium of this embodiment is used to store the voice recognition authentication system 20, and when executed by a processor, the following steps are implemented:
获取音频信息;Get audio information;
对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种语音识别认证方法,包括:A voice recognition authentication method includes:
    获取音频信息;Get audio information;
    对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
    将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
    将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
    将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
    根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  2. 如权利要求1所述的语音识别认证方法,所述对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息的步骤,包括:8. The voice recognition and authentication method according to claim 1, wherein the step of preprocessing the audio information to obtain the voice information from the audio information according to the short-term energy and spectrum center of the audio information comprises:
    按照预设规则从所述音频信息中提取多帧短时信号,其中,所述预设规则包括预设信号提取时间间隔;及Extracting multiple frames of short-term signals from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval; and
    将所述多帧短时信号按照静音检测算法计算所述短时能量;Calculating the short-term energy of the multi-frame short-term signal according to a silence detection algorithm;
    根据所述多帧短时信号计算所述频谱中心;Calculating the frequency spectrum center according to the multi-frame short-term signal;
    将所述短时能量与数据库中存储的第一预设值进行比较;Comparing the short-term energy with the first preset value stored in the database;
    将所述频谱中心与所述数据库中存储的第二预设值进行比较;Comparing the frequency spectrum center with a second preset value stored in the database;
    当所述短时能量高于所述第一预设值,且所述频谱中心高于所述第二预设值时,判定所述音频信息为语音信息;及When the short-term energy is higher than the first preset value and the center of the spectrum is higher than the second preset value, determining that the audio information is voice information; and
    获取所述语音信息。Obtain the voice information.
  3. 如权利要求2所述的语音识别认证方法,所述短时能量的计算公式为:3. The voice recognition and authentication method according to claim 2, wherein the calculation formula of the short-term energy is:
    Figure PCTCN2019117554-appb-100001
    Figure PCTCN2019117554-appb-100001
    其中,E表示所述短时能量,N表示短时信号的帧数,N≥2,且为整数,s(n)表示时域上第n帧短时信号的信号幅度。Wherein, E represents the short-term energy, N represents the number of frames of the short-term signal, N≥2, and is an integer, and s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  4. 如权利要求2所述的语音识别认证方法,所述根据所述多帧短时信号计算所述频谱中心的步骤,包括:3. The voice recognition authentication method according to claim 2, wherein the step of calculating the frequency spectrum center based on the multi-frame short-term signal comprises:
    分别获取与所述多帧短时信号对应的频率;及Respectively acquiring the frequencies corresponding to the multi-frame short-term signals; and
    根据所述频率及所述多帧短时信号,按照所述静音检测算法计算所述音频信息的频谱中心,其中,所述频谱中心的计算公式为:According to the frequency and the multi-frame short-term signal, the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is:
    Figure PCTCN2019117554-appb-100002
    Figure PCTCN2019117554-appb-100002
    其中,C表示所述频谱中心,K表示分别与N帧s(n)对应的频率个数,K≥2,且为整数,S(k)表示频域上与所述s(n)对应的离散傅里叶变换得到的频谱能量分布。Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K≥2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
  5. 如权利要求2所述的语音识别认证方法,所述将所述频谱中心与所述数据库中存储的第二预设值进行比较的步骤之后,还包括:3. The voice recognition authentication method according to claim 2, after the step of comparing the frequency spectrum center with the second preset value stored in the database, the method further comprises:
    当所述短时能量低于所述第一预设值,和/或所述频谱中心低于所述第二预设值时,判定所述音频信息为无效音频信息,其中,所述无效音频信息至少包括:静音、环境噪声、及非环境噪声;及When the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value, it is determined that the audio information is invalid audio information, wherein the invalid audio The information includes at least: silence, environmental noise, and non-environmental noise; and
    将所述音频信息删除。Delete the audio information.
  6. 如权利要求1所述的语音识别认证方法,所述将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征的步骤,包括:8. The voice recognition authentication method according to claim 1, wherein the step of processing the voice feature to obtain a target voice feature closer to the speaker comprises:
    利用Z-score标准化方法对所述语音特征进行归一化处理,以将所述语音特征进行统一,其中,所述归一化处理的公式为:
    Figure PCTCN2019117554-appb-100003
    μ为多个所述语音信息的均值,σ为多个所述语音信息的标准差,x为所述多个单帧语音数据,x*为归一化处理之后的所述语音特征;
    The Z-score standardization method is used to normalize the voice features to unify the voice features, wherein the formula for the normalization is:
    Figure PCTCN2019117554-appb-100003
    μ is the mean value of the multiple voice information, σ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing;
    将归一化处理结果特征进行拼接,以形成重叠部分长的拼接帧;及Splicing the normalized processing result features to form a spliced frame with a long overlapping part; and
    将所述拼接帧输入至神经网络中,以对所述拼接帧进行训练,以获取所述目标语音特征。The spliced frame is input into a neural network to train the spliced frame to obtain the target voice feature.
  7. 如权利要求1所述的语音识别认证方法,所述将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征的步骤之后,还包括:8. The voice recognition authentication method according to claim 1, after the step of processing the voice feature to obtain a target voice feature closer to the speaker, the method further comprises:
    将所述目标语音特征输入至预先训练的说话人检测模型及入侵者检测模型中;Input the target voice feature into a pre-trained speaker detection model and an intruder detection model;
    根据输出结果验证所述语音信息是否为所述说话人检测模型中保存的多个预设说话人中一个预设说话人的语音;及Verifying whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model according to the output result; and
    当所述语音信息为所述预设说话人的语音时,则获取所述语音信息。When the voice information is the voice of the preset speaker, the voice information is acquired.
  8. 一种语音识别认证***,包括:A speech recognition authentication system includes:
    获取模块,用于获取音频信息;Obtaining module for obtaining audio information;
    预处理模块,用于对所述音频信息进行预处理,以根据所述音频信息的短时能量与频 谱中心从所述音频信息中获取语音信息;A preprocessing module, configured to preprocess the audio information to obtain voice information from the audio information according to the short-term energy of the audio information and the center of the frequency spectrum;
    特征提取模块,用于将所述语音信息进行语音特征提取;The feature extraction module is used to perform voice feature extraction on the voice information;
    处理模块,用于将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;The processing module is used to process the voice features to obtain target voice features that are closer to the speaker;
    匹配模块,用于将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及The matching module is used to match the target voice feature with the speaker's voice feature stored in the database; and
    输出模块,用于根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。The output module is configured to output the matched identity information of the speaker corresponding to the voice feature of the speaker according to the matching result, so as to obtain the speaker corresponding to the voice information.
  9. 如权利要求8所述的语音识别认证***,所述预处理模块还用于:According to the speech recognition authentication system of claim 8, the preprocessing module is further used for:
    按照预设规则从所述音频信息中提取多帧短时信号,其中,所述预设规则包括预设信号提取时间间隔;及Extracting multiple frames of short-term signals from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval; and
    将所述多帧短时信号按照静音检测算法计算所述短时能量;Calculating the short-term energy of the multi-frame short-term signal according to a silence detection algorithm;
    根据所述多帧短时信号计算所述频谱中心;Calculating the frequency spectrum center according to the multi-frame short-term signal;
    将所述短时能量与数据库中存储的第一预设值进行比较;Comparing the short-term energy with the first preset value stored in the database;
    将所述频谱中心与所述数据库中存储的第二预设值进行比较;Comparing the frequency spectrum center with a second preset value stored in the database;
    当所述短时能量高于所述第一预设值,且所述频谱中心高于所述第二预设值时,判定所述音频信息为语音信息;及When the short-term energy is higher than the first preset value and the center of the spectrum is higher than the second preset value, determining that the audio information is voice information; and
    获取所述语音信息。Obtain the voice information.
  10. 如权利要求9所述的语音识别认证***,所述短时能量的计算公式为:9. The speech recognition and authentication system according to claim 9, wherein the calculation formula of the short-term energy is:
    Figure PCTCN2019117554-appb-100004
    Figure PCTCN2019117554-appb-100004
    其中,E表示所述短时能量,N表示短时信号的帧数,N≥2,且为整数,s(n)表示时域上第n帧短时信号的信号幅度。Wherein, E represents the short-term energy, N represents the number of frames of the short-term signal, N≥2, and is an integer, and s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  11. 如权利要求9所述的语音识别认证***,所述预处理模块还用于:According to the speech recognition authentication system according to claim 9, the preprocessing module is further used for:
    分别获取与所述多帧短时信号对应的频率;及Respectively acquiring the frequencies corresponding to the multi-frame short-term signals; and
    根据所述频率及所述多帧短时信号,按照所述静音检测算法计算所述音频信息的频谱中心,其中,所述频谱中心的计算公式为:According to the frequency and the multi-frame short-term signal, the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is:
    Figure PCTCN2019117554-appb-100005
    Figure PCTCN2019117554-appb-100005
    其中,C表示所述频谱中心,K表示分别与N帧s(n)对应的频率个数,K≥2,且为整数,S(k)表示频域上与所述s(n)对应的离散傅里叶变换得到的频谱能量分布。Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K≥2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
  12. 如权利要求9所述的语音识别认证***,所述预处理模块还用于:According to the speech recognition authentication system according to claim 9, the preprocessing module is further used for:
    当所述短时能量低于所述第一预设值,和/或所述频谱中心低于所述第二预设值时,判定所述音频信息为无效音频信息,其中,所述无效音频信息至少包括:静音、环境噪声、及非环境噪声;及When the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value, it is determined that the audio information is invalid audio information, wherein the invalid audio The information includes at least: silence, environmental noise, and non-environmental noise; and
    将所述音频信息删除。Delete the audio information.
  13. 如权利要求8所述的语音识别认证***,所述语音识别认证***还包括:9. The voice recognition and authentication system according to claim 8, the voice recognition and authentication system further comprising:
    归一化模块,用于利用Z-score标准化方法对所述语音特征进行归一化处理,以将所述语音特征进行统一,其中,所述归一化处理的公式为:
    Figure PCTCN2019117554-appb-100006
    μ为多个所述语音信息的均值,σ为多个所述语音信息的标准差,x为所述多个单帧语音数据,x*为归一化处理之后的所述语音特征;
    The normalization module is configured to use the Z-score standardization method to perform normalization processing on the voice features to unify the voice features, wherein the normalization processing formula is:
    Figure PCTCN2019117554-appb-100006
    μ is the mean value of the multiple voice information, σ is the standard deviation of the multiple voice information, x is the multiple single-frame voice data, and x* is the voice feature after normalization processing;
    拼接模块,用于将归一化处理结果特征进行拼接,以形成重叠部分长的拼接帧;及The splicing module is used to splice the normalized processing result features to form a spliced frame with a long overlapping part; and
    训练模块,用于将所述拼接帧输入至神经网络中,以对所述拼接帧进行训练,以获取所述目标语音特征。The training module is used to input the spliced frame into a neural network to train the spliced frame to obtain the target voice feature.
  14. 如权利要求8所述的语音识别认证***,还包括语音验证模块,用于:The voice recognition authentication system according to claim 8, further comprising a voice verification module for:
    将所述目标语音特征输入至预先训练的说话人检测模型及入侵者检测模型中;Input the target voice feature into a pre-trained speaker detection model and an intruder detection model;
    根据输出结果验证所述语音信息是否为所述说话人检测模型中保存的多个预设说话人中一个预设说话人的语音;及Verifying whether the voice information is the voice of one preset speaker among a plurality of preset speakers stored in the speaker detection model according to the output result; and
    当所述语音信息为所述预设说话人的语音时,则获取所述语音信息。When the voice information is the voice of the preset speaker, the voice information is acquired.
  15. 一种计算机设备,所述计算机设备存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:A computer device, the computer device memory, a processor, and computer readable instructions stored on the memory and capable of running on the processor, and when the computer readable instructions are executed by the processor, the following steps are implemented:
    获取音频信息;Get audio information;
    对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
    将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
    将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
    将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
    根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
  16. 如权利要求15所述的计算机设备,所述计算机可读指令被所述处理器执行时还实 现以下步骤:The computer device according to claim 15, wherein the computer-readable instructions, when executed by the processor, further implement the following steps:
    按照预设规则从所述音频信息中提取多帧短时信号,其中,所述预设规则包括预设信号提取时间间隔;及Extracting multiple frames of short-term signals from the audio information according to a preset rule, wherein the preset rule includes a preset signal extraction time interval; and
    将所述多帧短时信号按照静音检测算法计算所述短时能量;Calculating the short-term energy of the multi-frame short-term signal according to a silence detection algorithm;
    根据所述多帧短时信号计算所述频谱中心;Calculating the frequency spectrum center according to the multi-frame short-term signal;
    将所述短时能量与数据库中存储的第一预设值进行比较;Comparing the short-term energy with the first preset value stored in the database;
    将所述频谱中心与所述数据库中存储的第二预设值进行比较;Comparing the frequency spectrum center with a second preset value stored in the database;
    当所述短时能量高于所述第一预设值,且所述频谱中心高于所述第二预设值时,判定所述音频信息为语音信息;及When the short-term energy is higher than the first preset value and the center of the spectrum is higher than the second preset value, determining that the audio information is voice information; and
    获取所述语音信息。Obtain the voice information.
  17. 如权利要求16所述的计算机设备,所述短时能量的计算公式为:The computer device according to claim 16, wherein the calculation formula of the short-term energy is:
    Figure PCTCN2019117554-appb-100007
    Figure PCTCN2019117554-appb-100007
    其中,E表示所述短时能量,N表示短时信号的帧数,N≥2,且为整数,s(n)表示时域上第n帧短时信号的信号幅度。Wherein, E represents the short-term energy, N represents the number of frames of the short-term signal, N≥2, and is an integer, and s(n) represents the signal amplitude of the nth frame of the short-term signal in the time domain.
  18. 如权利要求16所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:The computer device according to claim 16, wherein the computer-readable instructions further implement the following steps when executed by the processor:
    分别获取与所述多帧短时信号对应的频率;及Respectively acquiring the frequencies corresponding to the multi-frame short-term signals; and
    根据所述频率及所述多帧短时信号,按照所述静音检测算法计算所述音频信息的频谱中心,其中,所述频谱中心的计算公式为:According to the frequency and the multi-frame short-term signal, the frequency spectrum center of the audio information is calculated according to the silence detection algorithm, wherein the calculation formula of the frequency spectrum center is:
    Figure PCTCN2019117554-appb-100008
    Figure PCTCN2019117554-appb-100008
    其中,C表示所述频谱中心,K表示分别与N帧s(n)对应的频率个数,K≥2,且为整数,S(k)表示频域上与所述s(n)对应的离散傅里叶变换得到的频谱能量分布。Wherein, C represents the center of the spectrum, K represents the number of frequencies corresponding to N frames s(n), K≥2, and is an integer, S(k) represents the frequency domain corresponding to the s(n) The spectral energy distribution obtained by the discrete Fourier transform.
  19. 如权利要求16所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现以下步骤:The computer device according to claim 16, wherein the computer-readable instructions further implement the following steps when executed by the processor:
    当所述短时能量低于所述第一预设值,和/或所述频谱中心低于所述第二预设值时,判定所述音频信息为无效音频信息,其中,所述无效音频信息至少包括:静音、环境噪声、及非环境噪声;及When the short-term energy is lower than the first preset value, and/or the center of the spectrum is lower than the second preset value, it is determined that the audio information is invalid audio information, wherein the invalid audio The information includes at least: silence, environmental noise, and non-environmental noise; and
    将所述音频信息删除。Delete the audio information.
  20. 一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质内存储有计算机可读指令,所述计算机可读指令可被至少一个处理器所执行,以使所述至少一个处理器执行如下步骤:A non-volatile computer-readable storage medium in which computer-readable instructions are stored, and the computer-readable instructions can be executed by at least one processor to cause the At least one processor performs the following steps:
    获取音频信息;Get audio information;
    对所述音频信息进行预处理,以根据所述音频信息的短时能量与频谱中心从所述音频信息中获取语音信息;Preprocessing the audio information to obtain voice information from the audio information according to the short-term energy and the frequency spectrum center of the audio information;
    将所述语音信息进行语音特征提取;Performing voice feature extraction on the voice information;
    将所述语音特征进行处理,以获取与说话人更为接近的目标语音特征;Processing the voice features to obtain target voice features that are closer to the speaker;
    将所述目标语音特征与数据库中存储的说话人语音特征进行匹配;及Matching the target voice feature with the speaker's voice feature stored in the database; and
    根据匹配结果,将匹配出的与所述说话人语音特征对应的所述说话人的身份信息输出,以获取与所述语音信息对应的所述说话人。According to the matching result, the matched identity information of the speaker corresponding to the voice feature of the speaker is output to obtain the speaker corresponding to the voice information.
PCT/CN2019/117554 2019-09-04 2019-11-12 Voice recognition authentication method and system WO2021042537A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910832042.4A CN110473552A (en) 2019-09-04 2019-09-04 Speech recognition authentication method and system
CN201910832042.4 2019-09-04

Publications (1)

Publication Number Publication Date
WO2021042537A1 true WO2021042537A1 (en) 2021-03-11

Family

ID=68514996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117554 WO2021042537A1 (en) 2019-09-04 2019-11-12 Voice recognition authentication method and system

Country Status (2)

Country Link
CN (1) CN110473552A (en)
WO (1) WO2021042537A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053695A (en) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN112348527A (en) * 2020-11-17 2021-02-09 上海桂垚信息科技有限公司 Identity authentication method in bank transaction system based on voice recognition
CN112927680B (en) * 2021-02-10 2022-06-17 中国工商银行股份有限公司 Voiceprint effective voice recognition method and device based on telephone channel
CN113879931B (en) * 2021-09-13 2023-04-28 厦门市特种设备检验检测院 Elevator safety monitoring method
CN113716246A (en) * 2021-09-16 2021-11-30 安徽世绿环保科技有限公司 Resident rubbish throwing traceability system
CN114697759B (en) * 2022-04-25 2024-04-09 中国平安人寿保险股份有限公司 Virtual image video generation method and system, electronic device and storage medium
CN115214541B (en) * 2022-08-10 2024-01-09 海南小鹏汽车科技有限公司 Vehicle control method, vehicle, and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662956A (en) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 Mega speaker identification (ID) system and corresponding methods therefor
JP4392805B2 (en) * 2008-04-28 2010-01-06 Kddi株式会社 Audio information classification device
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104078039A (en) * 2013-03-27 2014-10-01 广东工业大学 Voice recognition system of domestic service robot on basis of hidden Markov model
CN106782565A (en) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 A kind of vocal print feature recognition methods and system
CN108877775A (en) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 Voice data processing method, device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103705333A (en) * 2013-08-30 2014-04-09 李峰 Method and device for intelligently stopping snoring
CN104538036A (en) * 2015-01-20 2015-04-22 浙江大学 Speaker recognition method based on semantic cell mixing model
US10535000B2 (en) * 2016-08-08 2020-01-14 Interactive Intelligence Group, Inc. System and method for speaker change detection
CN106356052B (en) * 2016-10-17 2019-03-15 腾讯科技(深圳)有限公司 Phoneme synthesizing method and device
CN106782564B (en) * 2016-11-18 2018-09-11 百度在线网络技术(北京)有限公司 Method and apparatus for handling voice data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1662956A (en) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 Mega speaker identification (ID) system and corresponding methods therefor
JP4392805B2 (en) * 2008-04-28 2010-01-06 Kddi株式会社 Audio information classification device
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104078039A (en) * 2013-03-27 2014-10-01 广东工业大学 Voice recognition system of domestic service robot on basis of hidden Markov model
CN106782565A (en) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 A kind of vocal print feature recognition methods and system
CN108877775A (en) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 Voice data processing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110473552A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2021042537A1 (en) Voice recognition authentication method and system
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
EP2763134B1 (en) Method and apparatus for voice recognition
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120143608A1 (en) Audio signal source verification system
CN110880329B (en) Audio identification method and equipment and storage medium
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
WO2021179717A1 (en) Speech recognition front-end processing method and apparatus, and terminal device
CN109599117A (en) A kind of audio data recognition methods and human voice anti-replay identifying system
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN113823293B (en) Speaker recognition method and system based on voice enhancement
CN113035202B (en) Identity recognition method and device
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
CN113223536A (en) Voiceprint recognition method and device and terminal equipment
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN113112992B (en) Voice recognition method and device, storage medium and server
CN112216285B (en) Multi-user session detection method, system, mobile terminal and storage medium
CN114171032A (en) Cross-channel voiceprint model training method, recognition method, device and readable medium
CN113838469A (en) Identity recognition method, system and storage medium
Komlen et al. Text independent speaker recognition using LBG vector quantization
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Chakraborty et al. An improved approach to open set text-independent speaker identification (OSTI-SI)
CN117153185B (en) Call processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943879

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19943879

Country of ref document: EP

Kind code of ref document: A1