CN113593540B - Voice processing method, device and equipment - Google Patents

Voice processing method, device and equipment Download PDF

Info

Publication number
CN113593540B
CN113593540B CN202110867725.0A CN202110867725A CN113593540B CN 113593540 B CN113593540 B CN 113593540B CN 202110867725 A CN202110867725 A CN 202110867725A CN 113593540 B CN113593540 B CN 113593540B
Authority
CN
China
Prior art keywords
voice
voice signal
signal
preset
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110867725.0A
Other languages
Chinese (zh)
Other versions
CN113593540A (en
Inventor
陈小强
蒲胤华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Semiconductor Chengdu Co Ltd
Original Assignee
Spreadtrum Semiconductor Chengdu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Semiconductor Chengdu Co Ltd filed Critical Spreadtrum Semiconductor Chengdu Co Ltd
Priority to CN202110867725.0A priority Critical patent/CN113593540B/en
Publication of CN113593540A publication Critical patent/CN113593540A/en
Application granted granted Critical
Publication of CN113593540B publication Critical patent/CN113593540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application provides a voice processing method, a device and equipment, which are applied to a voice system, wherein the voice system comprises a microphone and a loudspeaker, and the method comprises the following steps: acquiring a first voice signal acquired by the microphone in a preset period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset period; acquiring a second voice signal in the preset time period from a cache; determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals; performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal; and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal. The accuracy of voice processing is improved.

Description

Voice processing method, device and equipment
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a device for processing speech.
Background
The user can control the vehicle-mounted equipment in the vehicle through the voice command, and the multimedia equipment in the vehicle can play multimedia voice signals such as music, broadcasting and the like when the user sends the voice command, so that the voice command sent by the user cannot be accurately identified.
Multimedia devices typically store multimedia voice signals in a buffer before playing the multimedia voice signals in the buffer. In the related art, after a microphone in a vehicle collects a voice signal to be recognized (including a voice command and a multimedia voice signal), the voice signal to be recognized may be processed by the multimedia signal in a buffer memory to obtain the voice command sent by a user. However, there is usually a certain time delay between the multimedia signal in the voice signal to be recognized acquired by the microphone and the multimedia voice signal in the buffer memory, so that the voice command cannot be accurately extracted from the voice signal to be recognized, and thus the accuracy of voice processing is poor.
Disclosure of Invention
The application relates to a voice processing method, a device and equipment, which reduce the time delay of a reference signal and an acquired voice signal and improve the accuracy of voice processing.
In a first aspect, an embodiment of the present application provides a voice processing method, which is applied to a voice system, where the voice system includes a microphone and a speaker, and the method includes:
acquiring a first voice signal acquired by the microphone in a preset period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset period;
acquiring a second voice signal in the preset time period from a cache;
determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals;
performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal;
and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.
In one possible implementation manner, the calibrating the second voice signal according to the time delay to obtain a third voice signal includes:
acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, sampling bits and channel numbers;
determining the preset signal insertion number N according to the time delay and the sampling parameters, wherein N is an integer greater than 1;
and adding N preset signals before the second voice signal to obtain the third voice signal.
In a possible implementation manner, determining the preset signal insertion number N according to the time delay and the sampling parameter includes:
according to the time delay and the sampling parameter, determining the preset signal insertion quantity N through the following formula I:
wherein d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number.
In a possible implementation manner, adding N preset signals before the second voice signal to obtain the third voice signal includes:
determining a starting storage position corresponding to the second voice signal in a cache;
and adding the N preset signals before the initial storage position in the buffer memory to obtain the third voice signal, wherein the initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
In one possible implementation manner, determining a delay of playing the voice signal in the buffer by the speaker according to the first voice signal and the second voice signal includes:
determining a first voice characteristic corresponding to the first voice signal;
determining a second voice characteristic corresponding to the second voice signal;
and carrying out matching processing on the first voice feature and the second voice feature to obtain the time delay.
In a possible implementation manner, the matching processing of the first voice feature and the second voice feature to obtain the time delay includes:
determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;
and determining the voice playing duration between the starting position of the first voice feature and the first position as the time delay.
In one possible implementation manner, the voice system is an in-vehicle voice system, and the method further includes:
determining a control instruction according to the user voice signal;
and controlling corresponding vehicle-mounted equipment according to the control instruction.
In a second aspect, an embodiment of the present application provides a speech processing apparatus, including a first acquisition module, a second acquisition module, a determination module, a calibration module, and a processing module, where,
the first acquisition module is used for acquiring a first voice signal acquired by the microphone in a preset period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset period;
the second obtaining module is used for obtaining a second voice signal in the preset time period in the cache;
the determining module is used for determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals;
the calibration module is used for performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal;
the processing module is used for processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.
In one possible embodiment, the calibration module is specifically configured to:
acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, sampling bits and channel numbers;
determining the preset signal insertion number N according to the time delay and the sampling parameters, wherein N is an integer greater than 1;
and adding N preset signals before the second voice signal to obtain the third voice signal.
In one possible embodiment, the calibration module is specifically configured to:
according to the time delay and the sampling parameter, determining the preset signal insertion quantity N through the following formula I:
wherein d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number.
In one possible embodiment, the calibration module is specifically configured to:
determining a starting storage position corresponding to the second voice signal in a cache;
and adding the N preset signals before the initial storage position in the buffer memory to obtain the third voice signal, wherein the initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
In one possible implementation manner, the determining module is specifically configured to:
determining a first voice characteristic corresponding to the first voice signal;
determining a second voice characteristic corresponding to the second voice signal;
and carrying out matching processing on the first voice feature and the second voice feature to obtain the time delay. In one possible implementation manner, the determining module is specifically configured to:
determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;
and determining the voice playing duration between the starting position of the first voice feature and the first position as the time delay.
In one possible implementation, the speech processing apparatus further comprises a control module,
the control module is used for determining a control instruction according to the user voice signal; and controlling corresponding vehicle-mounted equipment according to the control instruction.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, including: a processor, a memory, and a memory,
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory, causing the processor to perform the speech processing method as described in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the speech processing method according to the first aspect when the computer-executable instructions are executed by a processor.
In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the speech processing method according to the first aspect.
The embodiment of the application provides a voice processing method, a device and equipment, which are used for acquiring a first voice signal acquired by a microphone and a second voice signal (reference signal) in a buffer memory; determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals (reference signals); calibrating the second voice signal (reference signal) according to the time delay to obtain a third voice signal; and performing EC/NR processing on the first voice signal by using the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is smaller, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved.
Drawings
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of a voice processing method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another speech processing method according to an embodiment of the present application;
fig. 4 is a schematic flow chart of a matching process according to an embodiment of the present application;
FIG. 5 is a diagram illustrating adding N preset signals before a start storage location in a cache;
FIG. 6 is a schematic diagram of a speech processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a voice processing device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of another speech processing device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a voice processing device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to facilitate understanding, an application scenario to which the embodiment of the present application is applicable is described below with reference to fig. 1.
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application. Referring to fig. 1, a voice system may be provided in a vehicle. A microphone, a voice processing module, a voice recognition module, and a multimedia player may be included in the voice system, and the multimedia player may include a buffer module and a speaker. The audio signal is buffered in the buffer module of the multimedia player, and when the multimedia player is started, the speaker plays the audio signal in the buffer module.
One or more vehicle-mounted devices can be arranged in the vehicle, and a user can control the vehicle-mounted devices by sending out voice instructions. When a user utters speech, the speech signal collected by the microphone in the smart car device may include the speech instruction uttered by the user and the audio signal (background noise) played by the speaker in the multimedia player. The voice processing module carries out EC/NR processing on the collected voice signals to obtain voice instructions sent by pure users, and then the voice instructions sent by the pure users are transmitted to the voice recognition module for recognition.
In the related art, the speech processing module may perform EC/NR processing on the collected speech signal by using a reference signal, where the reference signal is an audio signal buffered in a buffer module in the multimedia player. The reference signal and the background noise in the collected voice signal usually have time delay, and the reference signal cannot perform good EC/NR processing on the collected voice signal, so that a voice instruction cannot be accurately extracted from the voice signal to be recognized, and further the voice processing accuracy is poor.
In order to obtain a pure voice command sent by a user, the embodiment of the application provides a voice processing method, which comprises the steps of firstly determining the time delay between a reference signal and a voice signal collected by a microphone, carrying out calibration processing on the reference signal according to the time delay, and carrying out EC/NR processing on the collected voice signal by using the calibrated reference signal. By carrying out calibration processing on the reference signal, the time delay between the calibrated reference signal and the collected voice signal can be reduced, so that a voice instruction can be accurately extracted from the voice signal to be recognized according to the calibrated reference signal, and the accuracy of voice processing is further improved.
The technical scheme shown in the application is described in detail by specific examples. It should be noted that the following embodiments may exist independently or may be combined with each other, and the description will not be repeated in different embodiments for the same or displayed content.
Fig. 2 is a flow chart of a voice processing method according to an embodiment of the present application. Referring to fig. 2, the method may include:
s201, acquiring a first voice signal acquired by a microphone in a preset period.
The execution body of the embodiment of the application can be a vehicle, and can also be a voice processing device arranged in the vehicle, and the voice processing device can be realized by software, and can also be realized by a combination of software and hardware.
The first voice signal comprises a user voice signal and a voice signal played by a loudspeaker in a preset period; the user voice signal may be a voice command sent by the user, and the voice signal played by the speaker may be a voice signal after the speaker plays the audio signal in the buffer memory.
The preset period may be a period from when the user voice signal starts to be emitted to when the user voice signal ends, for example, the preset period may be 20 seconds, 1 minute, or the like.
S202, acquiring a second voice signal in a preset period in the buffer memory.
The buffer may be a ring buffer and the buffer may store audio signals to be played, for example, the audio signals may be multimedia audio signals.
The second speech signal is an audio signal pre-stored in the buffer. The second speech signal may be a multimedia audio signal, for example, the second speech signal may be music, broadcast, or the like. The second speech signal may be used as a reference signal for processing the first speech signal.
S203, determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals.
The voice signal may have a certain delay from buffering to speaker playing, and the delay may be 200 ms, 2 seconds, etc.
The delay in playing the buffered speech signal by the speaker may be determined by: determining a first voice characteristic corresponding to the first voice signal; determining a second voice characteristic corresponding to the second voice signal; and carrying out matching processing on the first voice feature and the second voice feature to obtain time delay.
For example, the duration of the first voice feature is T1 seconds, the duration of the second voice feature is T2 seconds, and assuming that the starting time of the first voice feature is 0 seconds, the voice feature of the first voice feature after the 1.5s is matched with the second voice feature, and the starting time of the first voice feature is 0 seconds, the time delay is determined to be 1.5s.
S204, performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal.
A predetermined signal of a certain duration may be added before the second speech signal according to the time delay to obtain a third speech signal. For example, assuming that the time delay is t, a preset signal with a time length of t may be added before the second voice signal, so as to obtain a third voice signal.
The delay between the third speech signal and the first speech signal is smaller than the delay between the second speech signal and the first speech signal.
S205, processing the first voice signal according to the third voice signal to extract the user voice signal from the first voice signal.
The first speech signal may be EC/NR processed based on the third speech signal.
In the embodiment shown in fig. 2, a first voice signal collected by a microphone and a second voice signal (reference signal) in a buffer are acquired first; determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals (reference signals); calibrating the second voice signal (reference signal) according to the time delay to obtain a third voice signal; and performing EC/NR processing on the first voice signal by using the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is smaller, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved.
The above-described speech processing method will be described in detail below with reference to the embodiment shown in fig. 3, based on any of the above-described embodiments.
Fig. 3 is a flowchart of another voice processing method according to an embodiment of the present application. Referring to fig. 3, the method may include:
s301, acquiring a first voice signal acquired by a microphone in a preset period.
It should be noted that, the execution process of S301 may refer to the execution process of S201, and will not be described herein. S302, acquiring a second voice signal in a preset period in the buffer memory. It should be noted that, the execution process of S302 may refer to the execution process of S202, and will not be described herein.
S303, determining a first voice characteristic corresponding to the first voice signal.
The first speech feature may be a sequence of time domain frames, a sequence of short time zero crossing rates, a sequence of spectral centroids, and/or a sequence of mel-frequency coefficients, etc. of the first speech signal.
S304, determining a second voice characteristic corresponding to the second voice signal.
The second speech feature may be a sequence of time domain frames, a sequence of short time zero crossing rates, a sequence of spectral centroids, and/or a sequence of mel-frequency coefficients, etc. of the second speech signal.
S305, performing matching processing on the first voice feature and the second voice feature to obtain time delay.
The matching process may be performed in the following manner: determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value; and determining the voice playing time length between the starting position of the first voice feature and the first position as time delay.
The duration of the first speech feature is the same as the duration of the first speech signal.
Optionally, the second voice feature and the first voice feature may be subjected to matching processing, if the similarity between the second voice feature and the first voice feature is smaller than a first threshold, the starting position of the first voice feature is shifted back by P1, the matching degree between the second voice feature and the first voice feature after the shifting of the starting position is determined, if the matching degree is greater than or equal to the first threshold, P1 is determined as the first position, otherwise, the starting position of the first voice feature is shifted back by P2 continuously, and the above process is repeated until the first position is determined.
The first threshold may be 98%, 100%, etc.
In order to facilitate understanding, the matching process is described in detail below with reference to fig. 4.
Fig. 4 is a schematic flow chart of a matching process according to an embodiment of the present application. Referring to fig. 4, a first voice feature 401 and a second voice feature 402 may be first determined, where the duration of the first voice feature is the same as the duration of the first voice signal, and the duration of the second voice feature is the same as the duration of the second voice signal.
A first location may be determined on the first speech feature 401, wherein a degree of matching of a speech feature following the first location in the first speech feature with the second speech feature is greater than a preset threshold. Assuming that the corresponding time at the first location is 1 second and the starting time of the first speech feature is 0 second, the time delay is determined to be 1 second.
S306, acquiring sampling parameters of the microphone.
The sampling parameters include audio sampling rate, sampling bit number, and channel number.
The audio sampling rate may be the number of samples of the audio signal per unit time by the recording device, and may be 24000Hz, 48000Hz, etc., for example. The number of sampling bits may be the resolution of sound processed by the sound card, for example, 16 bits, 24 bits, 32 bits, etc. The number of channels may be the number of channels of sound, for example, may be 1 channel, 2 channels, or the like.
S307, determining the preset signal insertion number N according to the time delay and the sampling parameters.
The preset signal may be zero data; n is an integer greater than 1.
The preset signal insertion number N can be determined by the formula one:
where d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number.
S308, determining a starting storage position corresponding to the second voice signal in the buffer memory.
The buffer memory includes storage time of each voice signal, and a starting storage position corresponding to the second voice signal can be determined in the buffer memory according to the buffer memory time of the second voice signal.
S309, adding N preset signals before the initial storage position in the buffer memory to obtain a third voice signal.
The initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
For ease of understanding, adding N preset signals before the starting storage location in the cache is described in detail below in connection with fig. 5.
Fig. 5 is a schematic diagram of adding N preset signals before a start storage location in a cache. Referring to fig. 5, an image 501 and an image 502 are included. The signal in the middle of the start position 1 to the end position in the image 501 is the second speech signal. N pieces of null data (preset signals) are inserted before the start position 1 of the second voice signal to obtain a third voice signal, where the third voice signal is a signal from the start position 2 to the end position in the image 502.
S310, processing the first voice signal according to the third voice signal to extract the user voice signal from the first voice signal.
It should be noted that, the execution process of S310 may refer to the execution process of S205, which is not described herein.
In the embodiment shown in fig. 3, a first voice signal collected by a microphone and a second voice signal (reference signal) in a buffer are acquired first, then the voice characteristics of the first voice signal and the second voice signal (reference signal) are acquired, and the time delay of playing the voice signals in the buffer by a loudspeaker is determined according to the voice characteristics of the two paths of signals. And determining the number N of preset signals according to the time delay and the sampling parameters of the microphone, and performing calibration processing on the second voice signal (reference signal) through the preset signals to obtain a third voice signal. And performing EC/NR processing on the first voice signal by using the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is smaller, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved. In addition, after the second voice signal is subjected to the first calibration processing, only the timing (for example, 3 minutes or 5 minutes) is needed for carrying out the micro calibration, and the calibration is not needed before each voice recognition, so that the time for the calibration processing is saved.
On the basis of any one of the above embodiments, a speech processing method will be described in detail below by way of a specific example shown in fig. 6. Fig. 6 is a schematic diagram of a voice processing method according to an embodiment of the present application. Please refer to fig. 6, which includes step 1, step 2 and step 3.
Step 1, a first voice signal 601 acquired by a microphone in a preset period is acquired, a second voice signal 602 in the preset period is acquired in a buffer, and it is assumed that a first position determined from the first voice signal is shown as a position a. The voice playing time of the position A in the first voice signal 601 is determined to be 0.8s, the voice playing time of the initial position of the first voice signal is determined to be 0s, and the two voice playing times are subtracted to obtain the time delay, wherein the time delay is 0.8s.
Step 2, acquiring microphone sampling parameters, determining the zero data (preset signals) insertion quantity N according to time delay and sampling parameters, and determining the zero data (preset signals) insertion quantity N through a formula I; the second speech signal is calibrated by inserting N zero data (preset signal) at the start position of the second speech signal, resulting in a third speech signal 603.
Where d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number.
Step 3, performing EC/NR processing on the first voice signal by using the third voice signal to obtain a pure user voice signal 604.
In the embodiment shown in fig. 6, the second voice signal is calibrated by using the time delay and the sampling parameter, so that the time delay between the third voice signal and the first voice signal is smaller, and therefore, the third voice signal can perform more accurate EC/NR processing on the first voice signal, so as to obtain a pure voice instruction sent by a user, and further improve the accuracy of voice processing. In addition, when the second voice signal is subjected to the calibration processing, the signal itself is not subjected to complex calculation, and therefore, the time of the calibration processing is short.
Fig. 7 is a schematic structural diagram of a speech processing device according to an embodiment of the present application. Please refer to fig. 7. The speech processing means 7 may comprise a first acquisition module 11, a second acquisition module 12, a determination module 13, a calibration module 14 and a processing module 15, wherein,
the first obtaining module 11 is configured to obtain a first voice signal collected by the microphone in a preset period, where the first voice signal includes a user voice signal and a voice signal played by the speaker in the preset period;
the second obtaining module 12 is configured to obtain a second speech signal in a buffer memory within a preset period;
the determining module 13 is configured to determine, according to the first voice signal and the second voice signal, a delay of playing the voice signal in the buffer memory by the speaker;
the calibration module 14 is configured to perform calibration processing on the second voice signal according to the time delay, so as to obtain a third voice signal;
the processing module 15 is configured to process the first voice signal according to the third voice signal, so as to extract the user voice signal from the first voice signal.
In one possible implementation, the calibration module 14 is specifically configured to:
acquiring sampling parameters of a microphone, wherein the sampling parameters comprise an audio sampling rate, sampling bits and channel numbers;
determining the preset signal insertion quantity N according to the time delay and the sampling parameters, wherein N is an integer greater than 1;
n preset signals are added before the second voice signal, and a third voice signal is obtained.
In one possible implementation, the calibration module 14 is specifically configured to:
according to the time delay and the sampling parameters, the preset signal insertion quantity N is determined through the following formula I:
wherein d is time delay, R is sampling rate, B is sampling bit number, and C is channel number.
In one possible implementation, the calibration module 14 is specifically configured to:
determining a starting storage position corresponding to the second voice signal in the buffer memory;
n preset signals are added before the initial storage position in the buffer memory to obtain a third voice signal, wherein the initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
In one possible implementation, the determining module 13 is specifically configured to:
determining a first voice characteristic corresponding to the first voice signal;
determining a second voice characteristic corresponding to the second voice signal;
and carrying out matching processing on the first voice feature and the second voice feature to obtain time delay.
In one possible implementation, the determining module 13 is specifically configured to:
determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;
and determining the voice playing time length between the starting position of the first voice feature and the first position as time delay.
Fig. 8 is a schematic structural diagram of another speech processing device according to an embodiment of the present application. Referring to fig. 8 on the basis of fig. 7, the speech processing apparatus 10 further comprises a control module 16, wherein,
the control module 16 is used for determining a control instruction according to a user voice signal; and controlling the corresponding vehicle-mounted equipment according to the control instruction.
The voice processing device 10 provided in the embodiment of the present application may execute the technical solution shown in the foregoing method embodiment, and its implementation principle and beneficial effects are similar, and will not be described in detail.
Fig. 9 is a schematic structural diagram of a voice processing device according to an embodiment of the present application. Referring to fig. 9, the voice processing apparatus 20 may include: a memory 21, and a processor 22. The memory 21, the processor 22, are illustratively interconnected by a bus 23.
The memory 21 is used for storing program instructions;
the processor 22 is configured to execute the program instructions stored in the memory, so as to cause the speech processing device 20 to execute the above-mentioned speech processing method.
The voice processing device shown in the embodiment of fig. 9 may execute the technical solution shown in the embodiment of the method, and its implementation principle and beneficial effects are similar, and will not be described herein again.
Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the above-described speech processing method when the computer-executable instructions are executed by a processor.
Embodiments of the present application may also provide a computer program product comprising a computer program which, when executed by a processor, implements the above-described speech processing method.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the application. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (7)

1. A method of speech processing, for use in a speech system including a microphone and a speaker, the method comprising:
acquiring a first voice signal acquired by the microphone in a preset period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset period;
acquiring a second voice signal in the preset time period from a cache;
determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals;
performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal;
processing the first voice signal according to the third voice signal to extract the user voice signal from the first voice signal;
and performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal, wherein the method comprises the following steps:
acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, sampling bits and channel numbers;
determining the preset signal insertion number N according to the time delay and the sampling parameters, wherein N is an integer greater than 1;
adding N preset signals before the second voice signal to obtain the third voice signal;
according to the time delay and the sampling parameter, determining a preset signal insertion number N comprises the following steps:
according to the time delay and the sampling parameter, determining the preset signal insertion quantity N through the following formula I:
wherein d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number;
adding N preset signals before the second voice signal to obtain the third voice signal, including:
determining a starting storage position corresponding to the second voice signal in a cache;
and adding the N preset signals before the initial storage position in the buffer memory to obtain the third voice signal, wherein the initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
2. The method of claim 1, wherein determining a delay in playing the buffered speech signal by the speaker based on the first speech signal and the second speech signal comprises:
determining a first voice characteristic corresponding to the first voice signal;
determining a second voice characteristic corresponding to the second voice signal;
and carrying out matching processing on the first voice feature and the second voice feature to obtain the time delay.
3. The method for processing speech according to claim 2, wherein performing matching processing on the first speech feature and the second speech feature to obtain the time delay includes:
determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;
and determining the voice playing duration between the starting position of the first voice feature and the first position as the time delay.
4. The speech processing method of claim 1 wherein the speech system is an in-vehicle speech system, the method further comprising:
determining a control instruction according to the user voice signal;
and controlling corresponding vehicle-mounted equipment according to the control instruction.
5. A voice processing device is characterized by comprising a first acquisition module, a second acquisition module, a determination module, a calibration module and a processing module, wherein,
the first acquisition module is used for acquiring a first voice signal acquired by the microphone in a preset period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset period;
the second obtaining module is used for obtaining a second voice signal in the preset time period in the cache;
the determining module is used for determining the time delay of playing the voice signals in the buffer memory by the loudspeaker according to the first voice signals and the second voice signals;
the calibration module is used for performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal;
the processing module is used for processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal;
the calibration module is specifically configured to:
acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, sampling bits and channel numbers;
determining the preset signal insertion number N according to the time delay and the sampling parameters, wherein N is an integer greater than 1;
adding N preset signals before the second voice signal to obtain the third voice signal;
the calibration module is specifically configured to:
according to the time delay and the sampling parameter, determining the preset signal insertion quantity N through the following formula I:
wherein d is the time delay, R is the sampling rate, B is the sampling bit number, and C is the channel number;
the calibration module is specifically configured to:
determining a starting storage position corresponding to the second voice signal in a cache;
and adding the N preset signals before the initial storage position in the buffer memory to obtain the third voice signal, wherein the initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.
6. A speech processing apparatus, comprising: a processor, a memory, and a memory,
the memory stores computer-executable instructions;
the processor executing computer-executable instructions stored in the memory, causing the processor to perform the speech processing method of any one of claims 1-4.
7. A computer readable storage medium having stored therein computer executable instructions for implementing the speech processing method of any of claims 1 to 4 when the computer executable instructions are executed by a processor.
CN202110867725.0A 2021-07-28 2021-07-28 Voice processing method, device and equipment Active CN113593540B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110867725.0A CN113593540B (en) 2021-07-28 2021-07-28 Voice processing method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110867725.0A CN113593540B (en) 2021-07-28 2021-07-28 Voice processing method, device and equipment

Publications (2)

Publication Number Publication Date
CN113593540A CN113593540A (en) 2021-11-02
CN113593540B true CN113593540B (en) 2023-08-11

Family

ID=78252141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110867725.0A Active CN113593540B (en) 2021-07-28 2021-07-28 Voice processing method, device and equipment

Country Status (1)

Country Link
CN (1) CN113593540B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727607A (en) * 2017-10-31 2019-05-07 腾讯科技(深圳)有限公司 Delay time estimation method, device and electronic equipment
CN111739544A (en) * 2019-03-25 2020-10-02 Oppo广东移动通信有限公司 Voice processing method and device, electronic equipment and storage medium
CN112331204A (en) * 2020-11-24 2021-02-05 珠海市杰理科技股份有限公司 Intelligent voice recognition method, device and storage medium
CN112509595A (en) * 2020-11-06 2021-03-16 广州小鹏汽车科技有限公司 Audio data processing method, system and storage medium
CN112735487A (en) * 2021-03-29 2021-04-30 智道网联科技(北京)有限公司 Voice data processing method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727607A (en) * 2017-10-31 2019-05-07 腾讯科技(深圳)有限公司 Delay time estimation method, device and electronic equipment
CN111739544A (en) * 2019-03-25 2020-10-02 Oppo广东移动通信有限公司 Voice processing method and device, electronic equipment and storage medium
CN112509595A (en) * 2020-11-06 2021-03-16 广州小鹏汽车科技有限公司 Audio data processing method, system and storage medium
CN112331204A (en) * 2020-11-24 2021-02-05 珠海市杰理科技股份有限公司 Intelligent voice recognition method, device and storage medium
CN112735487A (en) * 2021-03-29 2021-04-30 智道网联科技(北京)有限公司 Voice data processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN113593540A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN106796785B (en) Sound sample validation for generating a sound detection model
US9947338B1 (en) Echo latency estimation
US9704478B1 (en) Audio output masking for improved automatic speech recognition
US7467088B2 (en) Closed caption control apparatus and method therefor
US9418662B2 (en) Method, apparatus and computer program product for providing compound models for speech recognition adaptation
US20120041764A1 (en) Speech processing system and method
KR101334366B1 (en) Method and apparatus for varying audio playback speed
JP2017506767A (en) System and method for utterance modeling based on speaker dictionary
JP2008513825A (en) Robust speech recognition system independent of speakers
US20180033427A1 (en) Speech recognition transformation system
US9679560B2 (en) Server-side ASR adaptation to speaker, device and noise condition via non-ASR audio transmission
CN107464563B (en) Voice interaction toy
US20150348566A1 (en) Audio correction apparatus, and audio correction method thereof
CN104505099A (en) Method and equipment for removing known interference in voice signal
US20170278511A1 (en) Server-Side ASR Adaptation to Speaker, Device and Noise Condition Via Non-ASR Audio Transmission
CN114203163A (en) Audio signal processing method and device
KR101312451B1 (en) Extraction method and extraction apparatus of voice signal used for voice recognition in enviroment outputting a plurality of audio sources
US20120070016A1 (en) Sound quality correcting apparatus and sound quality correcting method
CN113593540B (en) Voice processing method, device and equipment
CN112053669B (en) Method, device, equipment and medium for eliminating human voice
CN109741761B (en) Sound processing method and device
CN112908308A (en) Audio processing method, device, equipment and medium
JP2012208218A (en) Electronic apparatus
CN109599098A (en) Audio-frequency processing method and device
US11862141B2 (en) Signal processing device and signal processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant