CN111599366A - Vehicle-mounted multi-sound-zone voice processing method and related device - Google Patents

Vehicle-mounted multi-sound-zone voice processing method and related device Download PDF

Info

Publication number
CN111599366A
CN111599366A CN202010424470.6A CN202010424470A CN111599366A CN 111599366 A CN111599366 A CN 111599366A CN 202010424470 A CN202010424470 A CN 202010424470A CN 111599366 A CN111599366 A CN 111599366A
Authority
CN
China
Prior art keywords
audio
awakening
wake
vehicle
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010424470.6A
Other languages
Chinese (zh)
Other versions
CN111599366B (en
Inventor
王飞
蒋亚冲
钱俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202010424470.6A priority Critical patent/CN111599366B/en
Publication of CN111599366A publication Critical patent/CN111599366A/en
Application granted granted Critical
Publication of CN111599366B publication Critical patent/CN111599366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Traffic Control Systems (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The application discloses a vehicle-mounted multi-sound-zone voice processing method and a related device, wherein the method comprises the following steps: detecting position information comprising at least one position direction through each vehicle-mounted seat sensor; processing the multi-path microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-path audio; and comprehensively determining the target direction of the voice recognition by combining the position information and the multi-channel audio. Therefore, on the basis of multi-channel audio, position information is obtained by detecting each vehicle-mounted seat sensor and serves as auxiliary information, the target direction of voice recognition is comprehensively determined, sound source positioning interference during voice awakening in the vehicle-mounted multi-sound zone voice interaction process under a severe voice awakening scene can be effectively avoided, the accuracy of sound source positioning during voice awakening in the vehicle-mounted multi-sound zone voice interaction process is improved, more accurate vehicle-mounted multi-sound zone voice interaction is achieved, and user experience of vehicle-mounted multi-sound zone voice interaction is improved.

Description

Vehicle-mounted multi-sound-zone voice processing method and related device
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method and a related apparatus for vehicle-mounted multi-tone-zone speech processing.
Background
With the rapid development of science and technology, the voice interaction technology is gradually applied to a vehicle-mounted interconnection scene, and users are more and more accustomed to interacting with vehicle-mounted equipment through voice, so that the requirements and the demands on a vehicle-mounted voice interaction system are increased day by day. In order to meet the voice interaction between each user in the vehicle and the vehicle-mounted equipment, the vehicle-mounted voice interaction system provides vehicle-mounted multi-tone-zone voice interaction service so as to expand the voice interaction range.
The existing vehicle-mounted multi-tone area voice interaction means that system echoes in multi-channel microphone audio picked up by a vehicle-mounted microphone are eliminated through an echo cancellation technology, voice noise reduction and voice separation are achieved through a narrow beam algorithm, so that voice awakening and voice recognition are carried out on the multi-channel audio, if the multi-channel audio contains audio triggering awakening callback, sound source positioning is carried out during the voice awakening to determine the direction of the voice recognition, so that directional voice recognition can be carried out subsequently, and therefore the vehicle-mounted multi-tone area voice interaction is achieved.
However, the inventor finds that, in a severe voice awakening scene, sound source positioning interference is very easy to occur during voice awakening, so that sound source positioning errors during voice awakening are caused, the accuracy of sound source positioning is greatly reduced, and the effect of vehicle-mounted multi-sound-zone voice interaction is seriously influenced, so that the user experience of vehicle-mounted multi-sound-zone voice interaction is influenced.
Disclosure of Invention
In view of this, embodiments of the present application provide a method and a related device for vehicle-mounted multi-sound zone voice processing, which can effectively avoid sound source localization interference during voice wakeup in a vehicle-mounted multi-sound zone voice interaction process in a severe voice wakeup scene, so as to improve accuracy of sound source localization during voice wakeup in a vehicle-mounted multi-sound zone voice interaction process, thereby implementing more accurate vehicle-mounted multi-sound zone voice interaction, and improving user experience of vehicle-mounted multi-sound zone voice interaction.
In a first aspect, an embodiment of the present application provides a method for vehicle-mounted multi-zone speech processing, where the method includes:
obtaining position information detected by each vehicle-mounted seat sensor, wherein the position information comprises at least one position direction;
carrying out echo cancellation processing and narrow beam algorithm processing on the multi-path microphone audio to obtain multi-path audio;
and determining the target direction of the voice recognition based on the position information and the multi-channel audio.
Optionally, the determining a target direction of speech recognition based on the position information and the multiple channels of audio includes:
when the position information only comprises one position direction, if the audio frequency corresponding to the position direction in the multi-channel audio frequency triggers a wakeup back, determining the position direction as the target direction;
and when the position information comprises a plurality of position directions, determining the audio which triggers the wake-up callback in the audio corresponding to each position direction in the plurality of position directions in the multi-path audio as wake-up audio, and determining the target direction based on the wake-up audio.
Optionally, the determining the target direction based on the wake-up audio includes:
when the awakening audio is a path of awakening audio, determining a position direction corresponding to the awakening audio as the target direction;
when the wake-up audio is a multi-path wake-up audio, determining a target wake-up audio from the multi-path wake-up audio based on the wake-up score and the spectral energy of each path of wake-up audio in the multi-path wake-up audio, and determining a position direction corresponding to the target wake-up audio as the target direction.
Optionally, the determining a target wake-up audio from the multiple wake-up audios based on the wake-up score and the spectral energy of each of the multiple wake-up audios includes:
determining that the awakening audio corresponding to the highest awakening score and the highest spectral energy in the multi-path awakening audio is a first awakening audio and a second awakening audio respectively;
when the awakening score difference between the first awakening audio and the second awakening audio is larger than a preset awakening score difference and the frequency spectrum energy difference is smaller than a first preset frequency spectrum energy difference, determining the first awakening audio as the target awakening audio;
and when the awakening score difference between the first awakening audio and the second awakening audio is smaller than or equal to the preset awakening score difference or the frequency spectrum energy difference is larger than or equal to the first preset frequency spectrum energy difference, determining the second awakening audio as the target awakening audio.
Optionally, the method further includes:
when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle, determining the audio of each other position direction except the plurality of position directions in each position direction in the vehicle corresponding to the multi-path audio as noise reduction reference audio;
correspondingly, the determining the target direction based on the wake-up audio specifically includes:
and performing adaptive filtering algorithm processing on the awakening audio based on the noise reduction reference audio to obtain a noise reduction awakening audio, and determining the target direction based on the noise reduction awakening audio.
Optionally, the performing adaptive filtering algorithm processing on the wake-up audio based on the noise reduction reference audio to obtain a noise reduction wake-up audio includes:
extracting state noise information of the noise reduction reference audio;
and carrying out adaptive filtering algorithm processing on the awakening audio based on the state noise information to obtain the noise reduction awakening audio.
Optionally, the method further includes:
obtaining multiple paths of audio to be identified;
determining the audio to be identified corresponding to the target direction in the multi-channel audio to be identified as the audio to be identified in the target direction;
based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time, carrying out strong noise reduction on the audio to be identified in the target direction to obtain a strong noise reduction audio to be identified in the target direction; the main beam direction is the target direction.
Optionally, the strongly denoising processing is performed on the audio to be identified in the target direction to obtain a strongly denoised audio to be identified in the target direction based on the spectral energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time, and the strongly denoising method includes:
obtaining the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time;
and if the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction is greater than or equal to a second preset frequency spectrum energy difference, eliminating the audio to be identified in the non-main beam direction in the audio to be identified in the target direction, and obtaining the audio to be identified in the strong noise reduction target direction.
Optionally, the method further includes:
and adjusting the preset time and/or the second preset frequency spectrum energy difference based on the user audio characteristics corresponding to the audio to be identified based on the main beam direction and the non-main beam direction in the audio to be identified in the target direction.
In a second aspect, an embodiment of the present application provides an apparatus for vehicle-mounted multi-range speech processing, where the apparatus includes:
a position information obtaining unit for obtaining position information detected by each in-vehicle seat sensor, the position information including at least one position direction;
the multi-channel audio acquisition unit is used for carrying out echo cancellation processing and narrow beam algorithm processing on multi-channel microphone audio to acquire multi-channel audio;
and the target direction determining unit is used for determining the target direction of the voice recognition based on the position information and the multi-channel audio.
Optionally, the first determining unit includes:
the first determining subunit is configured to, when the position information includes only one position direction, determine the position direction as the target direction if a wake-up callback is triggered by an audio frequency corresponding to the position direction in the multiple channels of audio frequencies;
and a second determining subunit, configured to determine, when the position information includes a plurality of position directions, an audio that triggers a wake-up callback in the audio corresponding to each of the plurality of position directions in the multi-path audio as a wake-up audio, and determine the target direction based on the wake-up audio.
Optionally, the second determining subunit includes:
the first determining module is used for determining the position direction corresponding to the awakening audio as the target direction when the awakening audio is a path of awakening audio;
and the second determining module is used for determining a target wake-up audio from the multi-path wake-up audio based on the wake-up score and the spectral energy of each path of wake-up audio in the multi-path wake-up audio when the wake-up audio is the multi-path wake-up audio, and determining a position direction corresponding to the target wake-up audio as the target direction.
Optionally, the second determining module includes:
the first determining submodule is used for determining that the awakening audio corresponding to the highest awakening score and the highest spectral energy in the multi-channel awakening audio is a first awakening audio and a second awakening audio respectively;
a second determining submodule, configured to determine the first wake-up audio as the target wake-up audio when a wake-up score difference between the first wake-up audio and the second wake-up audio is greater than a preset wake-up score difference and a spectral energy difference is smaller than a first preset spectral energy difference;
a third determining submodule, configured to determine the second wake-up audio as the target wake-up audio when a wake-up score difference between the first wake-up audio and the second wake-up audio is less than or equal to the preset wake-up score difference or a spectrum energy difference is greater than or equal to the first preset spectrum energy difference.
Optionally, the apparatus further comprises:
a second determining unit, configured to determine, as noise reduction reference audio, audio corresponding to each position direction other than the plurality of position directions in the vehicle from among the plurality of audio channels when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle;
correspondingly, the second determining subunit is specifically configured to:
and performing adaptive filtering algorithm processing on the awakening audio based on the noise reduction reference audio to obtain a noise reduction awakening audio, and determining the target direction based on the noise reduction awakening audio.
Optionally, the second determining subunit includes:
the extraction module is used for extracting state noise information of the noise reduction reference audio;
and the obtaining module is used for carrying out adaptive filtering algorithm processing on the awakening audio based on the state noise information to obtain the noise reduction awakening audio.
Optionally, the apparatus further comprises:
a third obtaining unit, configured to obtain multiple channels of audio to be identified;
the third determining unit is used for determining the audio to be recognized corresponding to the target direction in the multi-channel audio to be recognized as the audio to be recognized in the target direction;
the fourth obtaining unit is used for carrying out strong noise reduction on the audio to be identified in the target direction to obtain strong noise reduction target direction audio to be identified based on the frequency spectrum energy of the audio to be identified in the main beam direction and non-main beam direction in the audio to be identified in the target direction within the preset time; the main beam direction is the target direction.
Optionally, the fourth obtaining unit includes:
the first obtaining subunit is configured to obtain, based on spectrum energies of to-be-identified audios in a main beam direction and a non-main beam direction in the to-be-identified audio in the target direction within a preset time, a spectrum energy difference between to-be-identified audios in the main beam direction and the non-main beam direction in the to-be-identified audio in the target direction;
and the second obtaining subunit is configured to, if the difference between the spectral energy of the audio to be identified in the main beam direction and the audio to be identified in the non-main beam direction in the audio to be identified in the target direction is greater than or equal to a second preset spectral energy difference, reject the audio to be identified in the non-main beam direction in the audio to be identified in the target direction, and obtain the audio to be identified in the strong noise reduction target direction.
Optionally, the apparatus further comprises:
and the adjusting unit is used for adjusting the preset time and/or the second preset frequency spectrum energy difference based on the user audio characteristics corresponding to the audio to be identified, which is based on the main beam direction and the non-main beam direction in the audio to be identified in the target direction.
In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for vehicle-mounted multi-range speech processing according to any one of the above first aspect according to instructions in the program code.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program codes for executing the method for vehicle-mounted multi-zone speech processing according to any one of the above first aspects.
Compared with the prior art, the method has the advantages that:
by adopting the technical scheme of the embodiment of the application, the position information comprising at least one position direction is obtained by detecting each vehicle-mounted seat sensor; processing the multi-path microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-path audio; and comprehensively determining the target direction of the voice recognition by combining the position information and the multi-channel audio. Therefore, on the basis of multi-channel audio, position information is obtained by detecting each vehicle-mounted seat sensor and serves as auxiliary information, the target direction of voice recognition is comprehensively determined, sound source positioning interference during voice awakening in the vehicle-mounted multi-sound zone voice interaction process under a severe voice awakening scene can be effectively avoided, the accuracy of sound source positioning during voice awakening in the vehicle-mounted multi-sound zone voice interaction process is improved, more accurate vehicle-mounted multi-sound zone voice interaction is achieved, and user experience of vehicle-mounted multi-sound zone voice interaction is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of a system framework related to an application scenario in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for vehicle-mounted multi-sound zone voice processing according to an embodiment of the present application;
FIG. 3 is a schematic flowchart of another method for vehicle-mounted multi-range speech processing according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a vehicle-mounted multi-zone speech processing device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the existing vehicle-mounted multi-sound-zone voice interaction process, after multi-channel microphone audio is subjected to echo cancellation processing and narrow beam algorithm processing to obtain multi-channel audio, voice awakening and voice recognition are carried out based on the multi-channel audio. When the multi-channel audio contains the audio triggering the awakening callback, sound source positioning is carried out during voice awakening to determine the direction of voice recognition so as to carry out directional voice recognition subsequently. However, the inventor finds, through research, that in a severe voice wake-up scene, for example, when a user in a target direction performs voice interaction with an on-vehicle device, a wake-up audio corresponding to the target direction is leaked to other directions, and the wake-up audio corresponding to the target direction is interfered by an audio corresponding to mixed noise in other directions, so that sound source positioning interference is very easy to occur during voice wake-up, sound source positioning errors during voice wake-up greatly reduce the accuracy of sound source positioning, and the effect of voice interaction in an on-vehicle multi-sound zone is seriously affected, thereby affecting the user experience of voice interaction in the on-vehicle multi-sound zone.
In order to solve this problem, in the embodiment of the present application, position information including at least one position direction is detected by each in-vehicle seat sensor; processing the multi-path microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-path audio; and comprehensively determining the target direction of the voice recognition by combining the position information and the multi-channel audio. It is thus clear that on the basis of multichannel audio frequency, with each vehicle-mounted seat sensor detection obtain positional information as auxiliary information, synthesize the target direction who confirms speech recognition, sound localization interference when can effectively avoid abominable pronunciation to awaken the on-vehicle polyphonic zone pronunciation interactive in-process pronunciation awaken up under the scene to improve the accuracy of sound localization when on-vehicle polyphonic zone pronunciation awaken up in-process pronunciation awaken up, thereby realize more accurate on-vehicle polyphonic zone pronunciation interaction, promote on-vehicle polyphonic zone pronunciation interactive user experience.
For example, one of the scenarios in the embodiment of the present application may be applied to the scenario shown in fig. 1, which includes the in-vehicle seat sensor 101, the in-vehicle microphone 102, and the in-vehicle multi-zone voice interaction system 103. When a user is in the vehicle-mounted seat, the vehicle-mounted seat sensor 101 can detect the position information and send the position information to the vehicle-mounted multi-tone-zone voice interaction system 103; when a user in the vehicle speaks, the vehicle-mounted microphone 102 picks up multiple microphone audios and sends the multiple microphone audios to the vehicle-mounted multi-sound-zone voice interaction system 103; the vehicle-mounted multi-sound zone voice interaction system 103 determines a target direction of voice recognition by adopting the implementation manner of the embodiment of the application, and then performs directional voice recognition according to the target direction.
It is to be understood that, in the application scenario described above, although the actions of the embodiment of the present application are described as being performed by the in-vehicle multi-zone voice interactive system 103, the present application is not limited in terms of the execution subject as long as the actions disclosed in the embodiment of the present application are performed.
It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario.
The following describes in detail a specific implementation manner of a vehicle-mounted multi-range speech processing method and a related apparatus in the embodiments of the present application by way of embodiments with reference to the accompanying drawings.
Exemplary method
Referring to fig. 2, a flowchart of a method for vehicle-mounted multi-zone speech processing in the embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:
step 201: position information detected by each vehicle-mounted seat sensor is obtained, and the position information comprises at least one position direction.
It should be noted that, in a bad voice wake-up scene, for example, when a user in the main driving direction performs voice interaction with the vehicle-mounted device, if a wake-up audio corresponding to the main driving direction leaks to the secondary driving direction, the wake-up audio mixes noise in the secondary driving direction to interfere with the wake-up audio corresponding to the main driving direction, so that sound source positioning interference is very likely to occur during voice wake-up, and the vehicle-mounted multi-tone zone voice interaction system may position the voice recognition direction as the secondary driving direction; namely, sound source positioning is wrong when voice awakening is carried out under a severe voice awakening scene, accuracy of sound source positioning is greatly reduced, and the effect of vehicle-mounted multi-sound-zone voice interaction is seriously influenced, so that user experience of vehicle-mounted multi-sound-zone voice interaction is influenced. Therefore, in the embodiment of the application, in order to avoid the wake-up audio in the target direction from leaking to the sound source positioning interference when the other directions wake up the voice, whether users exist in each position direction in the vehicle or not, that is, whether users exist in each vehicle-mounted seat in the vehicle or not, can be considered, and the sound source positioning interference when the audio corresponding to the position direction without the users wakes up the voice is eliminated through the judgment result.
Specifically, a sensor is mounted below each vehicle-mounted seat in the vehicle, and is called as a vehicle-mounted seat sensor, when a user is on the vehicle-mounted seat, the vehicle-mounted seat sensor can detect position information and send the position information to a vehicle-mounted multi-tone-zone voice interaction system, wherein the position information comprises a position direction of the corresponding vehicle-mounted seat, and is referred to as the position direction for short; when no user is on the vehicle-mounted seat, the vehicle-mounted seat sensor cannot detect the position information and send the position information to the vehicle-mounted multi-tone-zone voice interaction system. When only one vehicle-mounted seat in each vehicle-mounted seat in the vehicle is provided with a user, the position information detected by each vehicle-mounted seat sensor obtained by the vehicle-mounted multi-tone zone voice interaction system is the position information detected by only one vehicle-mounted seat sensor, and the position information only comprises one position direction; when users are all arranged on a plurality of vehicle-mounted seats in each vehicle-mounted seat in the vehicle, the position information detected by each vehicle-mounted seat sensor obtained by the vehicle-mounted multi-tone-zone voice interaction system refers to the position information detected by the plurality of vehicle-mounted seat sensors, and the position information comprises a plurality of position directions.
As an example of step 201, taking an on-vehicle four-tone zone voice interaction as an example, the on-vehicle seats in the vehicle are respectively a main driving on-vehicle seat, a sub-driving on-vehicle seat, a rear-row left-side on-vehicle seat and a rear-row right-side on-vehicle seat, and the sequentially corresponding on-vehicle seat sensors are respectively a main driving on-vehicle seat sensor, a sub-driving on-vehicle seat sensor, a rear-row left-side on-vehicle seat sensor and a rear-row right-side on-vehicle seat sensor. The vehicle-mounted seat sensors corresponding to users on the main driving vehicle-mounted seat, the auxiliary driving vehicle-mounted seat, the rear-row left side vehicle-mounted seat and the rear-row right side vehicle-mounted seat can detect position information and send the position information to the vehicle-mounted multi-tone-zone voice interaction system, wherein the position information comprises the position direction of the corresponding vehicle-mounted seat, such as the main driving direction, the auxiliary driving direction, the rear-row left side direction or the rear-row right side direction.
Step 202: and carrying out echo cancellation processing and narrow beam algorithm processing on the multi-path microphone audio to obtain the multi-path audio.
It should be noted that, when a user in a vehicle speaks, the vehicle-mounted microphone may pick up multiple microphone audio frequencies and send the multiple microphone audio frequencies to the vehicle-mounted multi-tone-zone voice interaction system, and the multiple microphone audio frequencies may be processed by an echo cancellation technique to cancel system echoes in the multiple microphone audio frequencies, and then may be processed by a narrow beam algorithm, so that audio separation is realized while audio noise reduction is realized to obtain multiple audio frequencies, and any one audio frequency in the multiple audio frequencies may include audio frequencies in a main beam direction thereof as much as possible. The multi-channel audio corresponds to each position and direction in the vehicle one by one.
As an example of step 202, based on the example of step 201, the multiple microphone audios picked up and sent to the in-vehicle multi-zone voice interaction system by the in-vehicle microphone are the main driving microphone audio, the sub driving microphone audio, the rear row left side microphone audio and the rear row right side microphone audio, and the multiple audios are obtained through performing echo cancellation processing and narrow beam algorithm processing on the multiple microphone audios to obtain the main driving audio, the sub driving audio, the rear row left side audio and the rear row right side audio.
Step 203: and determining the target direction of the voice recognition based on the position information and the multi-channel audio.
In addition, in the above steps 201 to 202, when determining the target direction of the voice recognition, it is necessary to comprehensively determine the target direction of the voice recognition by using, as the auxiliary information, not only the sound source localization based on the multi-channel audio when the voice is woken up, but also the position information capable of indicating whether there is a user in each position direction in the vehicle. This mode can effectively avoid abominable pronunciation to awaken sound source location when the scene is awaken up in the on-vehicle multitone district pronunciation interactive in-process pronunciation and awaken up to sound source location's accuracy when improving on-vehicle multitone district pronunciation interactive in-process pronunciation and awaken up, thereby realize more accurate on-vehicle multitone district pronunciation mutual, promote on-vehicle multitone district pronunciation mutual user experience.
In a specific application, the position information detected by each vehicle-mounted seat sensor obtained by the vehicle-mounted multi-zone voice interaction system may only comprise one position direction, and may also comprise a plurality of position directions. When the position information obtained in step 201 includes only one position direction, it indicates that there is a user in only one of the vehicle-mounted seats in the vehicle, that is, only the position direction may be a target direction for voice recognition, and the other position directions except the position direction in the vehicle-mounted seats are inevitably not possible to be the target direction for voice recognition; at this time, it is only necessary to determine whether the audio corresponding to the position direction in the multi-channel audio triggers a wake-up callback in step 202, where the audio triggers the wake-up callback to indicate a wake-up word included in the audio, that is, the audio is a wake-up audio capable of realizing voice wake-up, and then the position direction is determined as the target direction of voice recognition. When the position information obtained in step 201 includes a plurality of position directions, it indicates that there are users on a plurality of vehicle seats in each vehicle seat in the vehicle, that is, any one of the plurality of position directions may be a target direction of voice recognition, at this time, it is first necessary to determine whether an audio frequency corresponding to each of the plurality of position directions in the multi-channel audio triggers a wake-up callback, and the audio frequency triggering the wake-up callback may be determined as a wake-up audio frequency, and then it is necessary to perform sound source positioning on the basis of the determined wake-up audio frequency to determine the target direction of voice recognition. Therefore, in an optional implementation manner of this embodiment of this application, the step 203 may include the following steps:
step A: and when the position information only comprises one position direction, if the audio corresponding to the position direction in the multi-channel audio triggers a wakeup back, determining the position direction as the target direction.
As an example, on the basis of the above examples from step 201 to step 202, when the position information includes only the main driving direction and it is determined that the main driving audio corresponding to the main driving direction in the multi-channel audio triggers the wakeup callback, the main driving direction may be directly determined as the target direction of the voice recognition.
And B: and when the position information comprises a plurality of position directions, determining the audio which triggers the wake-up callback in the audio corresponding to each position direction in the plurality of position directions in the multi-path audio as wake-up audio, and determining the target direction based on the wake-up audio.
The number of the plurality of position directions included in the position information may be smaller than or equal to the number of each position direction in the vehicle. As an example, on the basis of the example of the above step 201, the position information includes any two or three position directions of the main driving direction, the sub-driving direction, the rear-row left direction, and the rear-row right direction. As another example, on the basis of the example of step 201 described above, the position information includes four position directions of the main driving direction, the sub-driving direction, the rear-row left direction, and the rear-row right direction.
When the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle, the audio corresponding to each position direction in the plurality of position directions needs to be screened out from the multi-path audio, and then whether each screened audio triggers a wakeup back is judged to determine the wakeup audio; and when the number of the plurality of position directions included in the position information is equal to the number of the position directions in the vehicle, directly judging whether each audio frequency in the multi-channel audio frequency triggers the awakening callback to determine the awakening audio frequency.
The awakening audio determined in the step B may be one-way awakening audio or multiple-way awakening audio; when the wake-up audio is a path of wake-up audio, it indicates that only the wake-up audio can implement voice wake-up, that is, only the position direction corresponding to the wake-up audio is the target direction of voice recognition, and then sound source positioning during voice wake-up refers to directly determining the position direction corresponding to the wake-up audio as the target direction of voice recognition; when the wake-up audio is the multi-path wake-up audio, each path of wake-up audio in the multi-path wake-up audio may implement voice wake-up, that is, a position direction corresponding to each path of wake-up audio in the multi-path wake-up audio may be a target direction of voice recognition, and at this time, sound source positioning during voice wake-up refers to comparing wake-up scores and spectral energies of different wake-up audio in the multi-path wake-up audio, and determining one path of wake-up audio as the target wake-up audio, so as to determine a position direction corresponding to the target wake-up audio as the. The awakening score of the awakening audio is determined based on the matching degree of the awakening words and the preset awakening words included in the awakening audio, the awakening score reflects the frequency spectrum characteristics of the awakening audio, the frequency spectrum energy of the awakening audio is calculated based on the frequency spectrum information of the awakening audio, the frequency spectrum energy reflects the energy characteristics of the awakening audio, and the sound source positioning can be realized by combining the awakening words and the preset awakening words. Therefore, in an optional implementation manner of this embodiment of the present application, the step of determining the target direction based on the wake-up audio in step B may include the following steps:
step B1: and when the awakening audio is one path of awakening audio, determining the position direction corresponding to the awakening audio as the target direction.
As an example, on the basis of the example of step 201, when the position information includes the primary driving direction and the secondary driving direction, and in the primary driving audio corresponding to the primary driving direction and the secondary driving audio corresponding to the secondary driving direction in the multi-channel audio, the primary driving audio triggers the wake-up callback, and the secondary driving audio cannot trigger the wake-up callback, that is, the wake-up audio is the primary driving audio, the primary driving direction corresponding to the primary driving audio is directly determined as the target direction of the voice recognition.
Step B2: when the wake-up audio is a multi-path wake-up audio, determining a target wake-up audio from the multi-path wake-up audio based on the wake-up score and the spectral energy of each path of wake-up audio in the multi-path wake-up audio, and determining a position direction corresponding to the target wake-up audio as the target direction.
When the step B2 is specifically implemented, first, the wake-up audio corresponding to the highest wake-up score and the wake-up audio corresponding to the highest spectral energy may be determined based on the wake-up score and the spectral energy of each wake-up audio; then, comparing the awakening scores of the two paths of awakening audios to obtain awakening score difference, and comparing the frequency spectrum energy of the two paths of awakening audios to obtain frequency spectrum energy difference; and finally, based on the preset awakening score difference, measuring the awakening score difference, and based on the first preset spectrum energy difference, measuring the spectrum energy difference, and determining whether the awakening audio corresponding to the highest awakening score or the awakening audio corresponding to the highest spectrum energy is the target awakening audio by referring to the following table as the spectrum energy difference is more trustworthy compared with the awakening score difference in the sound source positioning process.
Arousal score difference Spectral energy difference Target wake-up audio
Big (a) Big (a) Wake-up audio corresponding to highest spectral energy
Big (a) Small Wake-up audio corresponding to highest wake-up score
Small Big (a) Wake-up audio corresponding to highest spectral energy
Small Small Wake-up audio corresponding to highest spectral energy
Therefore, in an optional implementation manner of this embodiment of this application, the step B2 of determining the target wake-up audio from the multiple wake-up audios based on the wake-up score and the spectral energy of each of the multiple wake-up audios may include the following steps, for example:
step B21: determining that the awakening audio corresponding to the highest awakening score and the highest spectral energy in the multi-path awakening audio is a first awakening audio and a second awakening audio respectively;
step B22: when the awakening score difference between the first awakening audio and the second awakening audio is larger than a preset awakening score difference and the frequency spectrum energy difference is smaller than a first preset frequency spectrum energy difference, determining the first awakening audio as the target awakening audio;
step B23: and when the awakening score difference between the first awakening audio and the second awakening audio is smaller than or equal to the preset awakening score difference or the frequency spectrum energy difference is larger than or equal to the first preset frequency spectrum energy difference, determining the second awakening audio as the target awakening audio.
As an example, the first wake-up tone is the wake-up tone corresponding to the highest wake-up score with a wake-up score of A1Spectral energy of E1(ii) a The second wake-up audio is the wake-up audio corresponding to the highest spectral energy, and the wake-up score is A2Spectral energy of E2(ii) a The wake-up score difference between the first wake-up audio and the second wake-up audio is (A)1-A2+0.01)/(A1+0.01), the spectral energy difference of the first wake-up audio and said second wake-up audio being (E)2-E1+0.01)/(E2+0.01). The difference of the preset wake-up scores is 0.13, the difference of the first preset spectrum energy is 0.19, when (A)1-A2+0.01)/(A1+0.01) > 0.13 and (E)2-E1+0.01)/(E2When +0.01) < 0.19, determining the first wake-up audio as a target wake-up audio; when (A)1-A2+0.01)/(A1+0.01) is less than or equal to 0.13 or (E)2-E1+0.01)/(E2+0.01) ≧ 0.19, the second wake-up audio is determined as the target wake-up audio.
It should be further noted that, when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle, and other position directions (referred to as other position directions for short) in each position direction in the vehicle are inevitably impossible to be the target direction of the speech recognition, the audio corresponding to other position directions in the multi-path audio may be determined as the noise reduction reference audio, the adaptive filtering algorithm is used to perform secondary noise reduction on the wake-up audio through the noise reduction reference audio, so as to obtain the wake-up audio after secondary noise reduction, which is recorded as the noise reduction wake-up audio, and the target direction of the speech recognition is determined based on the noise reduction wake-up audio. Therefore, in an optional implementation manner of the embodiment of the present application, the method further includes step C: when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle, determining the audio of each other position direction except the plurality of position directions in each position direction in the vehicle corresponding to the multi-path audio as noise reduction reference audio; correspondingly, the step B of determining the target direction based on the wake-up audio may specifically be: and performing adaptive filtering algorithm processing on the awakening audio based on the noise reduction reference audio to obtain a noise reduction awakening audio, and determining the target direction based on the noise reduction awakening audio.
Specifically, the secondary denoising of the wake-up audio is performed by the denoising reference audio through the adaptive filtering algorithm, which means that state noise information of the denoising reference audio is extracted first, and then the secondary denoising of the wake-up audio is performed by the adaptive filtering algorithm according to the state noise information. That is, in an optional implementation manner of the embodiment of the present application, the step B of performing adaptive filtering algorithm processing on the wake-up audio based on the noise reduction reference audio to obtain a noise reduction wake-up audio may include the following steps:
step B3: extracting state noise information of the noise reduction reference audio;
step B4: and carrying out adaptive filtering algorithm processing on the awakening audio based on the state noise information to obtain the noise reduction awakening audio.
Through various implementation manners provided by the embodiment, the position information comprising at least one position direction is detected through each vehicle-mounted seat sensor; processing the multi-path microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-path audio; and comprehensively determining the target direction of the voice recognition by combining the position information and the multi-channel audio. Therefore, on the basis of multi-channel audio, position information is obtained by detecting each vehicle-mounted seat sensor and serves as auxiliary information, the target direction of voice recognition is comprehensively determined, sound source positioning interference during voice awakening in the vehicle-mounted multi-sound zone voice interaction process under a severe voice awakening scene can be effectively avoided, the accuracy of sound source positioning during voice awakening in the vehicle-mounted multi-sound zone voice interaction process is improved, more accurate vehicle-mounted multi-sound zone voice interaction is achieved, and user experience of vehicle-mounted multi-sound zone voice interaction is improved.
It should be further noted that after the target direction of the speech recognition is determined, if the user in the target direction and the user in the non-target direction speak simultaneously, multiple paths of audio to be recognized in the speech recognition process are obtained through processing by an echo cancellation technique and a narrow beam algorithm, and because the narrow beam algorithm may have a leakage problem, the audio to be recognized corresponding to the target direction in the multiple paths of audio to be recognized includes both the audio of the user in the target direction and the audio of the user in the non-target direction, which easily causes recognition crosstalk in the speech recognition, greatly reduces the accuracy of the speech recognition, and seriously affects the effect of the speech interaction in the vehicle-mounted multiple sound zones, thereby affecting the user experience of the speech interaction in the vehicle-mounted multiple sound zones. Therefore, on the basis of the above embodiment, after obtaining multiple channels of audio to be recognized, the audio to be recognized corresponding to the target direction in the multiple channels of audio to be recognized needs to be taken as the audio to be recognized in the target direction, and strong noise reduction processing is performed on the audio to be recognized in the target direction according to the spectral energy of the audio to be recognized in the main beam direction and the non-main beam direction within a period of time in the audio to be recognized in the target direction, so that the risk of recognition crosstalk occurring in voice recognition is reduced, the accuracy of voice recognition is improved, the voice interaction effect in the vehicle-mounted multi-tone zone is improved, and the user experience of the vehicle-mounted multi-tone zone voice interaction is improved.
Referring to fig. 3, a flow chart of another method for vehicle-mounted multi-zone speech processing in the embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:
step 301: position information detected by each vehicle-mounted seat sensor is obtained, and the position information comprises at least one position direction.
Step 302: and carrying out echo cancellation processing and narrow beam algorithm processing on the multi-path microphone audio to obtain the multi-path audio.
Step 303: and determining the target direction of the voice recognition based on the position information and the multi-channel audio.
It should be noted that, in this embodiment, steps 301 to 303 are the same as steps 201 to 203 in the above embodiment, and specific implementations of steps 301 to 303 may refer to specific implementations of steps 201 to 203 in the above embodiment, which are not described herein again.
Step 304: and obtaining a plurality of channels of audio to be identified.
Step 305: and determining the audio to be identified corresponding to the target direction in the multi-channel audio to be identified as the audio to be identified in the target direction.
As an example of step 304-step 305, the target direction of the voice recognition is a main driving direction, the multiple paths of to-be-recognized audio are obtained as main driving to-be-recognized audio and vice driving to-be-recognized audio, and the audio to be recognized in the target direction in the multiple paths of to-be-recognized audio is determined as the main driving to-be-recognized audio based on the main driving direction.
Step 306: based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time, carrying out strong noise reduction on the audio to be identified in the target direction to obtain a strong noise reduction audio to be identified in the target direction; the main beam direction is the target direction.
Specifically, the frequency spectrum energy difference of the audio to be recognized in the main beam direction and the audio to be recognized in the non-main beam direction in the audio to be recognized in the target direction within the preset time needs to be calculated, the second preset frequency spectrum energy difference is used for measuring the magnitude of the frequency spectrum energy difference, when the frequency spectrum energy difference is large, the audio to be recognized in the non-main beam direction in the audio to be recognized in the target direction is the residual interference audio after the processing of the narrow beam algorithm, and the interference audio needs to be removed, so that the audio to be recognized in the strong noise reduction target direction, which actually needs to be subjected to voice recognition, is obtained. Therefore, in an optional implementation manner of this embodiment of this application, the step 306 may include the following steps, for example:
step D: obtaining the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time;
step E: and if the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction is greater than or equal to a second preset frequency spectrum energy difference, eliminating the audio to be identified in the non-main beam direction in the audio to be identified in the target direction, and obtaining the audio to be identified in the strong noise reduction target direction.
As an example, on the basis of the above examples of step 304-step 305, the spectrum energy difference of the to-be-recognized audio in the main driving direction and the secondary driving direction in the main driving to-be-recognized audio is greater than or equal to a second preset spectrum energy difference, the to-be-recognized audio in the secondary driving direction in the main driving to-be-recognized audio is removed, and the strong noise-reduced main driving to-be-recognized audio is obtained.
It should be further noted that, because users corresponding to the audio to be identified in the main beam direction and the audio to be identified in the non-main beam direction in the audio to be identified in the target direction are different, and the characteristics of the user audio thereof are different, the preset time for representing the comparison time length of the spectral energy of the audio to be identified in the main beam direction and the non-main beam direction and the second preset spectral energy difference for measuring the magnitude of the spectral energy difference need to be dynamically adjusted according to the characteristics of the user audio. That is, in an optional implementation manner of the embodiment of the present application, before step 306, for example, step F may be further included: and adjusting the preset time and/or the second preset frequency spectrum energy difference based on the user audio characteristics corresponding to the audio to be identified based on the main beam direction and the non-main beam direction in the audio to be identified in the target direction.
Through various implementation manners provided by the embodiment, the position information comprising at least one position direction is detected through each vehicle-mounted seat sensor; processing the multi-path microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-path audio; and comprehensively determining the target direction of the voice recognition by combining the position information and the multi-channel audio. The method comprises the steps of firstly determining audio to be identified in a target direction in multiple paths of audio to be identified according to the target direction, and then carrying out strong noise reduction treatment according to the frequency spectrum energy of the audio to be identified in a main beam direction and a non-main beam direction to obtain the audio to be identified in the strong noise reduction target direction. Therefore, on the basis of multi-channel audio, position information detected by each vehicle-mounted seat sensor is used as auxiliary information, the target direction of voice recognition is comprehensively determined, and sound source positioning interference during voice awakening in a vehicle-mounted multi-sound-zone voice interaction process under a severe voice awakening scene can be effectively avoided; and the spectrum energy of the audio frequency to be recognized of the main beam direction and the non-main beam direction in the audio frequency to be recognized corresponding to the target direction is subjected to strong noise reduction processing, so that the risk of recognition crosstalk occurring in voice recognition is reduced, the accuracy of sound source positioning during voice awakening in the vehicle-mounted multi-sound zone voice interaction process is improved, more accurate vehicle-mounted multi-sound zone voice interaction is realized, and the user experience of the vehicle-mounted multi-sound zone voice interaction is improved.
Exemplary devices
Referring to fig. 4, a schematic structural diagram of a vehicle-mounted multi-range speech processing device in the embodiment of the present application is shown. In this embodiment, the apparatus may specifically include:
a first obtaining unit 401 for obtaining position information detected by each in-vehicle seat sensor, the position information including at least one position direction;
a second obtaining unit 402, configured to perform echo cancellation processing and narrow-beam algorithm processing on multiple paths of microphone audio to obtain multiple paths of audio;
a first determining unit 403, configured to determine a target direction of speech recognition based on the position information and the multiple channels of audio.
In an optional implementation manner of the embodiment of the present application, the first determining unit 403 includes:
the first determining subunit is configured to, when the position information includes only one position direction, determine the position direction as the target direction if a wake-up callback is triggered by an audio frequency corresponding to the position direction in the multiple channels of audio frequencies;
and a second determining subunit, configured to determine, when the position information includes a plurality of position directions, an audio that triggers a wake-up callback in the audio corresponding to each of the plurality of position directions in the multi-path audio as a wake-up audio, and determine the target direction based on the wake-up audio.
In an optional implementation manner of the embodiment of the present application, the second determining subunit includes:
the first determining module is used for determining the position direction corresponding to the awakening audio as the target direction when the awakening audio is a path of awakening audio;
and the second determining module is used for determining a target wake-up audio from the multi-path wake-up audio based on the wake-up score and the spectral energy of each path of wake-up audio in the multi-path wake-up audio when the wake-up audio is the multi-path wake-up audio, and determining a position direction corresponding to the target wake-up audio as the target direction.
In an optional implementation manner of the embodiment of the present application, the second determining module includes:
the first determining submodule is used for determining that the awakening audio corresponding to the highest awakening score and the highest spectral energy in the multi-channel awakening audio is a first awakening audio and a second awakening audio respectively;
a second determining submodule, configured to determine the first wake-up audio as the target wake-up audio when a wake-up score difference between the first wake-up audio and the second wake-up audio is greater than a preset wake-up score difference and a spectral energy difference is smaller than a first preset spectral energy difference;
a third determining submodule, configured to determine the second wake-up audio as the target wake-up audio when a wake-up score difference between the first wake-up audio and the second wake-up audio is less than or equal to the preset wake-up score difference or a spectrum energy difference is greater than or equal to the first preset spectrum energy difference.
In an optional implementation manner of the embodiment of the present application, the apparatus further includes:
a second determining unit, configured to determine, as noise reduction reference audio, audio corresponding to each position direction other than the plurality of position directions in the vehicle from among the plurality of audio channels when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle;
correspondingly, the second determining subunit is specifically configured to:
and performing adaptive filtering algorithm processing on the awakening audio based on the noise reduction reference audio to obtain a noise reduction awakening audio, and determining the target direction based on the noise reduction awakening audio.
In an optional implementation manner of the embodiment of the present application, the second determining subunit includes:
the extraction module is used for extracting state noise information of the noise reduction reference audio;
and the obtaining module is used for carrying out adaptive filtering algorithm processing on the awakening audio based on the state noise information to obtain the noise reduction awakening audio.
In an optional implementation manner of the embodiment of the present application, the apparatus further includes:
a third obtaining unit, configured to obtain multiple channels of audio to be identified;
the third determining unit is used for determining the audio to be recognized corresponding to the target direction in the multi-channel audio to be recognized as the audio to be recognized in the target direction;
the fourth obtaining unit is used for carrying out strong noise reduction on the audio to be identified in the target direction to obtain strong noise reduction target direction audio to be identified based on the frequency spectrum energy of the audio to be identified in the main beam direction and non-main beam direction in the audio to be identified in the target direction within the preset time; the main beam direction is the target direction.
In an optional implementation manner of the embodiment of the present application, the fourth obtaining unit includes:
the first obtaining subunit is configured to obtain, based on spectrum energies of to-be-identified audios in a main beam direction and a non-main beam direction in the to-be-identified audio in the target direction within a preset time, a spectrum energy difference between to-be-identified audios in the main beam direction and the non-main beam direction in the to-be-identified audio in the target direction;
and the second obtaining subunit is configured to, if the difference between the spectral energy of the audio to be identified in the main beam direction and the audio to be identified in the non-main beam direction in the audio to be identified in the target direction is greater than or equal to a second preset spectral energy difference, reject the audio to be identified in the non-main beam direction in the audio to be identified in the target direction, and obtain the audio to be identified in the strong noise reduction target direction.
In an optional implementation manner of the embodiment of the present application, the apparatus further includes:
and the adjusting unit is used for adjusting the preset time and/or the second preset frequency spectrum energy difference based on the user audio characteristics corresponding to the audio to be identified, which is based on the main beam direction and the non-main beam direction in the audio to be identified in the target direction.
Through various implementation manners provided by the embodiment, the vehicle-mounted multi-sound-zone voice processing device comprises a first obtaining unit, a second obtaining unit and a first determining unit; the first obtaining unit obtains position information including at least one position direction through detection of each vehicle-mounted seat sensor; the second obtaining unit processes the multi-channel microphone audio by using an echo cancellation technology and a narrow beam algorithm to obtain multi-channel audio; the first determining unit determines a target direction of the voice recognition by combining the position information and the multi-channel audio. Therefore, on the basis of multi-channel audio, position information is obtained by detecting each vehicle-mounted seat sensor and serves as auxiliary information, the target direction of voice recognition is comprehensively determined, sound source positioning interference during voice awakening in the vehicle-mounted multi-sound zone voice interaction process under a severe voice awakening scene can be effectively avoided, the accuracy of sound source positioning during voice awakening in the vehicle-mounted multi-sound zone voice interaction process is improved, more accurate vehicle-mounted multi-sound zone voice interaction is achieved, and user experience of vehicle-mounted multi-sound zone voice interaction is improved.
In addition, an embodiment of the present application further provides a terminal device, where the terminal device includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is used for executing the method for processing the vehicle-mounted multi-sound-zone voice according to the instructions in the program codes.
The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a program code, where the program code is configured to execute the method for vehicle-mounted multi-zone speech processing according to the foregoing method embodiment.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims (12)

1. A vehicle-mounted multi-sound-zone voice processing method is characterized by comprising the following steps:
obtaining position information detected by each vehicle-mounted seat sensor, wherein the position information comprises at least one position direction;
carrying out echo cancellation processing and narrow beam algorithm processing on the multi-path microphone audio to obtain multi-path audio;
and determining the target direction of the voice recognition based on the position information and the multi-channel audio.
2. The method of claim 1, wherein determining a target direction for speech recognition based on the location information and the multi-channel audio comprises:
when the position information only comprises one position direction, if the audio frequency corresponding to the position direction in the multi-channel audio frequency triggers a wakeup back, determining the position direction as the target direction;
and when the position information comprises a plurality of position directions, determining the audio which triggers the wake-up callback in the audio corresponding to each position direction in the plurality of position directions in the multi-path audio as wake-up audio, and determining the target direction based on the wake-up audio.
3. The method of claim 2, wherein the determining the target direction based on the wake audio comprises:
when the awakening audio is a path of awakening audio, determining a position direction corresponding to the awakening audio as the target direction;
when the wake-up audio is a multi-path wake-up audio, determining a target wake-up audio from the multi-path wake-up audio based on the wake-up score and the spectral energy of each path of wake-up audio in the multi-path wake-up audio, and determining a position direction corresponding to the target wake-up audio as the target direction.
4. The method of claim 3, wherein determining the target wake audio from the multiple wake audios based on the wake score and the spectral energy of each of the multiple wake audios comprises:
determining that the awakening audio corresponding to the highest awakening score and the highest spectral energy in the multi-path awakening audio is a first awakening audio and a second awakening audio respectively;
when the awakening score difference between the first awakening audio and the second awakening audio is larger than a preset awakening score difference and the frequency spectrum energy difference is smaller than a first preset frequency spectrum energy difference, determining the first awakening audio as the target awakening audio;
and when the awakening score difference between the first awakening audio and the second awakening audio is smaller than or equal to the preset awakening score difference or the frequency spectrum energy difference is larger than or equal to the first preset frequency spectrum energy difference, determining the second awakening audio as the target awakening audio.
5. The method of claim 2, further comprising:
when the number of the plurality of position directions included in the position information is smaller than the number of each position direction in the vehicle, determining the audio of each other position direction except the plurality of position directions in each position direction in the vehicle corresponding to the multi-path audio as noise reduction reference audio;
correspondingly, the determining the target direction based on the wake-up audio specifically includes:
and performing adaptive filtering algorithm processing on the awakening audio based on the noise reduction reference audio to obtain a noise reduction awakening audio, and determining the target direction based on the noise reduction awakening audio.
6. The method of claim 5, wherein the adaptive filtering algorithm processing the wake-up audio based on the noise reduction reference audio to obtain a noise reduction wake-up audio comprises:
extracting state noise information of the noise reduction reference audio;
and carrying out adaptive filtering algorithm processing on the awakening audio based on the state noise information to obtain the noise reduction awakening audio.
7. The method of claim 1, further comprising:
obtaining multiple paths of audio to be identified;
determining the audio to be identified corresponding to the target direction in the multi-channel audio to be identified as the audio to be identified in the target direction;
based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time, carrying out strong noise reduction on the audio to be identified in the target direction to obtain a strong noise reduction audio to be identified in the target direction; the main beam direction is the target direction.
8. The method according to claim 7, wherein the strongly denoising the audio to be identified in the target direction to obtain a strongly denoised audio to be identified in the target direction based on the spectral energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within a preset time includes:
obtaining the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction based on the frequency spectrum energy of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction within the preset time;
and if the frequency spectrum energy difference of the audio to be identified in the main beam direction and the non-main beam direction in the audio to be identified in the target direction is greater than or equal to a second preset frequency spectrum energy difference, eliminating the audio to be identified in the non-main beam direction in the audio to be identified in the target direction, and obtaining the audio to be identified in the strong noise reduction target direction.
9. The method of claim 8, further comprising:
and adjusting the preset time and/or the second preset frequency spectrum energy difference based on the user audio characteristics corresponding to the audio to be identified based on the main beam direction and the non-main beam direction in the audio to be identified in the target direction.
10. An on-vehicle polyphonic zone speech processing device, comprising:
a position information obtaining unit for obtaining position information detected by each in-vehicle seat sensor, the position information including at least one position direction;
the multi-channel audio acquisition unit is used for carrying out echo cancellation processing and narrow beam algorithm processing on multi-channel microphone audio to acquire multi-channel audio;
and the target direction determining unit is used for determining the target direction of the voice recognition based on the position information and the multi-channel audio.
11. A terminal device, comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of vehicular multi-zone speech processing of any of claims 1-9 according to instructions in the program code.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method of vehicle-mounted multi-zone speech processing according to any of claims 1-9.
CN202010424470.6A 2020-05-19 2020-05-19 Vehicle-mounted multitone region voice processing method and related device Active CN111599366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010424470.6A CN111599366B (en) 2020-05-19 2020-05-19 Vehicle-mounted multitone region voice processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010424470.6A CN111599366B (en) 2020-05-19 2020-05-19 Vehicle-mounted multitone region voice processing method and related device

Publications (2)

Publication Number Publication Date
CN111599366A true CN111599366A (en) 2020-08-28
CN111599366B CN111599366B (en) 2024-04-12

Family

ID=72187396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010424470.6A Active CN111599366B (en) 2020-05-19 2020-05-19 Vehicle-mounted multitone region voice processing method and related device

Country Status (1)

Country Link
CN (1) CN111599366B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599126A (en) * 2020-12-03 2021-04-02 海信视像科技股份有限公司 Awakening method of intelligent device, intelligent device and computing device
CN113192289A (en) * 2021-04-14 2021-07-30 恒大恒驰新能源汽车研究院(上海)有限公司 Monitoring and alarming system and method for personnel in vehicle
CN115346527A (en) * 2022-08-08 2022-11-15 科大讯飞股份有限公司 Voice control method, device, system, vehicle and storage medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1316562A (en) * 1971-03-01 1973-05-09 Cossor Ltd A C Secondary radar systems
JPH02127692A (en) * 1988-11-08 1990-05-16 Casio Comput Co Ltd Sound source device
US20080071547A1 (en) * 2006-09-15 2008-03-20 Volkswagen Of America, Inc. Speech communications system for a vehicle and method of operating a speech communications system for a vehicle
US20150117669A1 (en) * 2013-10-25 2015-04-30 Hyundai Mobis Co., Ltd. Apparatus and method for controlling beamforming microphone considering location of driver seat
CN108986806A (en) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 Sound control method and system based on Sounnd source direction
CN109192203A (en) * 2018-09-29 2019-01-11 百度在线网络技术(北京)有限公司 Multitone area audio recognition method, device and storage medium
CN109461449A (en) * 2018-12-29 2019-03-12 苏州思必驰信息科技有限公司 Voice awakening method and system for smart machine
CN109490834A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of sound localization method, sound source locating device and vehicle
CN109754803A (en) * 2019-01-23 2019-05-14 上海华镇电子科技有限公司 Vehicle multi-sound area voice interactive system and method
CN110010126A (en) * 2019-03-11 2019-07-12 百度国际科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110033775A (en) * 2019-05-07 2019-07-19 百度在线网络技术(北京)有限公司 Multitone area wakes up exchange method, device and storage medium
US20190311715A1 (en) * 2016-06-15 2019-10-10 Nuance Communications, Inc. Techniques for wake-up word recognition and related systems and methods
CN110366156A (en) * 2019-08-26 2019-10-22 科大讯飞(苏州)科技有限公司 Vehicle bluetooth communication processing method, onboard audio management system and relevant device
CN110475180A (en) * 2019-08-23 2019-11-19 科大讯飞(苏州)科技有限公司 Vehicle multi-sound area audio processing system and method
CN110554357A (en) * 2019-09-12 2019-12-10 苏州思必驰信息科技有限公司 Sound source positioning method and device
CN111098859A (en) * 2018-10-26 2020-05-05 福特全球技术公司 Vehicle-mounted digital auxiliary authentication

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1316562A (en) * 1971-03-01 1973-05-09 Cossor Ltd A C Secondary radar systems
JPH02127692A (en) * 1988-11-08 1990-05-16 Casio Comput Co Ltd Sound source device
US20080071547A1 (en) * 2006-09-15 2008-03-20 Volkswagen Of America, Inc. Speech communications system for a vehicle and method of operating a speech communications system for a vehicle
US20150117669A1 (en) * 2013-10-25 2015-04-30 Hyundai Mobis Co., Ltd. Apparatus and method for controlling beamforming microphone considering location of driver seat
US20190311715A1 (en) * 2016-06-15 2019-10-10 Nuance Communications, Inc. Techniques for wake-up word recognition and related systems and methods
CN108986806A (en) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 Sound control method and system based on Sounnd source direction
CN109192203A (en) * 2018-09-29 2019-01-11 百度在线网络技术(北京)有限公司 Multitone area audio recognition method, device and storage medium
CN109490834A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of sound localization method, sound source locating device and vehicle
CN111098859A (en) * 2018-10-26 2020-05-05 福特全球技术公司 Vehicle-mounted digital auxiliary authentication
CN109461449A (en) * 2018-12-29 2019-03-12 苏州思必驰信息科技有限公司 Voice awakening method and system for smart machine
CN109754803A (en) * 2019-01-23 2019-05-14 上海华镇电子科技有限公司 Vehicle multi-sound area voice interactive system and method
CN110010126A (en) * 2019-03-11 2019-07-12 百度国际科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110033775A (en) * 2019-05-07 2019-07-19 百度在线网络技术(北京)有限公司 Multitone area wakes up exchange method, device and storage medium
CN110475180A (en) * 2019-08-23 2019-11-19 科大讯飞(苏州)科技有限公司 Vehicle multi-sound area audio processing system and method
CN110366156A (en) * 2019-08-26 2019-10-22 科大讯飞(苏州)科技有限公司 Vehicle bluetooth communication processing method, onboard audio management system and relevant device
CN110554357A (en) * 2019-09-12 2019-12-10 苏州思必驰信息科技有限公司 Sound source positioning method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599126A (en) * 2020-12-03 2021-04-02 海信视像科技股份有限公司 Awakening method of intelligent device, intelligent device and computing device
CN112599126B (en) * 2020-12-03 2022-05-27 海信视像科技股份有限公司 Awakening method of intelligent device, intelligent device and computing device
CN113192289A (en) * 2021-04-14 2021-07-30 恒大恒驰新能源汽车研究院(上海)有限公司 Monitoring and alarming system and method for personnel in vehicle
CN115346527A (en) * 2022-08-08 2022-11-15 科大讯飞股份有限公司 Voice control method, device, system, vehicle and storage medium

Also Published As

Publication number Publication date
CN111599366B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN110503969B (en) Audio data processing method and device and storage medium
CN110459234B (en) Vehicle-mounted voice recognition method and system
CN111599366A (en) Vehicle-mounted multi-sound-zone voice processing method and related device
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
CN102164328B (en) Audio input system used in home environment based on microphone array
JP4225430B2 (en) Sound source separation device, voice recognition device, mobile phone, sound source separation method, and program
CN100559461C (en) The apparatus and method of voice activity detection
CN102498709B (en) Method for selecting one of two or more microphones for a speech-processing system such as a hands-free telephone device operating in a noisy environment
KR101339592B1 (en) Sound source separator device, sound source separator method, and computer readable recording medium having recorded program
US8712770B2 (en) Method, preprocessor, speech recognition system, and program product for extracting target speech by removing noise
US20180262832A1 (en) Sound Signal Processing Apparatus and Method for Enhancing a Sound Signal
CN110010126B (en) Speech recognition method, apparatus, device and storage medium
CN110265020B (en) Voice wake-up method and device, electronic equipment and storage medium
US8996383B2 (en) Motor-vehicle voice-control system and microphone-selecting method therefor
CN106483502B (en) A kind of sound localization method and device
CN109920405A (en) Multi-path voice recognition methods, device, equipment and readable storage medium storing program for executing
WO2016103709A1 (en) Voice processing device
JP2008236077A (en) Target sound extracting apparatus, target sound extracting program
US8452592B2 (en) Signal separating apparatus and signal separating method
CN109102819A (en) One kind is uttered long and high-pitched sounds detection method and device
KR20170063618A (en) Electronic device and its reverberation removing method
CN112216295A (en) Sound source positioning method, device and equipment
US7542577B2 (en) Input sound processor
CN113270095B (en) Voice processing method, device, storage medium and electronic equipment
CN111883153B (en) Microphone array-based double-end speaking state detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant