WO2021031811A1 - Method and device for voice enhancement - Google Patents

Method and device for voice enhancement Download PDF

Info

Publication number
WO2021031811A1
WO2021031811A1 PCT/CN2020/105296 CN2020105296W WO2021031811A1 WO 2021031811 A1 WO2021031811 A1 WO 2021031811A1 CN 2020105296 W CN2020105296 W CN 2020105296W WO 2021031811 A1 WO2021031811 A1 WO 2021031811A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
voice
information
electronic device
voiceprint
Prior art date
Application number
PCT/CN2020/105296
Other languages
French (fr)
Chinese (zh)
Inventor
王保辉
李伟
李晓建
胡伟湘
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021031811A1 publication Critical patent/WO2021031811A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Definitions

  • the embodiments of the present application relate to the field of communication technologies, and in particular, to a method and device for speech enhancement.
  • traditional speech enhancement algorithms apply the same set of speech enhancement algorithms to all input sounds in different environments, and perform the same enhancement regardless of input sounds.
  • the sound received by a smart speaker includes various sounds such as human voice, TV sound, dog barking, water flow sound, etc.
  • using the existing voice enhancement algorithm will enhance all sounds in the same way, which will lead to intelligent
  • the speaker may not be able to accurately learn user commands, causing the smart speaker to fail to output voice interaction, or the voice interaction output is not accurate, and so on, making the user's smart interaction experience poor. Therefore, the existing speech enhancement algorithms cannot adapt to the environment in a scene with a more complex sound environment, and the user's intelligent interaction experience is poor.
  • the embodiments of the present application provide a voice enhancement method and device, which can accurately capture the voice of a target user from a complex sound environment in a scene with a relatively complex sound environment, thereby improving the user's intelligent interaction experience.
  • a voice enhancement method the method includes: an electronic device collects a first sound, the first sound includes at least one of a second sound and a background sound; and the electronic device recognizes the first sound A sound; when the first sound has the second sound, the electronic device performs sound analysis on the second sound to obtain a third sound; the electronic device processes the third sound. Based on this solution, by identifying the collected first sound, and when the second sound exists in the first sound, the second sound (human voice) is separated from the first sound, and the third sound is extracted from the second sound.
  • Voice user voice interaction command
  • the obtained voice interaction output is more accurate, so the voice interaction can be accurately completed, and the user's intelligent interaction experience is improved.
  • this application recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user voice interaction command (the third voice). Therefore, the voice enhancement method does not Instead of performing the same enhancement on all input sounds, it combines the attribute information of the currently collected sound for targeted enhancement, so it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
  • the electronic device recognizing the first sound includes: the electronic device performs sound event recognition on the first sound according to a sound event recognition model, and obtains the sound of the first sound Category information. Based on this solution, by recognizing the first sound, the sound type information of the first sound can be obtained, so that the second sound can be extracted from the first sound according to the sound type information of the first sound.
  • the electronic device performs sound analysis on the second sound to obtain the third sound, including: The sound category information of the first sound separates the second sound from the first sound; the electronic device analyzes the sound attribute information of the second sound; wherein the sound attribute information includes: sound orientation information, voiceprint information, One or more of sound time information and sound decibel information; the above electronic device obtains the third sound according to the sound attribute information of the second sound.
  • the voice can be extracted from multiple human voices based on the attribute information of the second voice to obtain clean user voice interaction commands. Achieve targeted enhancement.
  • the voiceprint information of the third voice is matched with the voiceprint information of the registered user. Based on this solution, the voice matching the voiceprint information of the registered user can be determined as the third voice through the voiceprint information.
  • the above method further includes: the above electronic device clusters the fourth sound to obtain new sound category information;
  • the fourth voice is the voice whose voice category information is not recognized according to the voice event recognition model in the first voice; the voice event recognition model is updated according to the new voice category information, and the updated voice event recognition model is obtained.
  • the sounds not recognized by the sound event recognition model can be clustered, and a new sound event recognition model can be trained, that is, the sound event recognition model can be updated through the self-learning of the sounds that often appear in the environment through the electronic device , It can be adaptive to the environment and enhance the user's interactive experience.
  • the stability and robustness of the sound event recognition model will become more stable with the increase of the user's use time, and the more used the better.
  • the above method further includes: the electronic device obtains the sound orientation information of the first sound; The sound position information and the sound category information of the first sound are used to obtain the position information of the unnecessary sound; the electronic device filters the sound from the position according to the position information of the unnecessary sound. Based on this solution, by learning the azimuth information of various sounds, combined with the sound category information, the azimuths of the useless sounds that often appear in a specific scene can be obtained, so as to better assist the sound separation.
  • the above method further includes: acquiring the voice interaction information of the third voice; and combining the interaction voice information of the third voice For the third voice that has not registered voiceprint in, perform voiceprint registration.
  • the above method further includes: the electronic device outputs interactive information according to the sound attribute information of the third sound.
  • the interactive information can be voice interactive information or control signals.
  • the electronic device can combine the position information, voiceprint information, time information, age information, etc. of the third voice to output corresponding interactive information.
  • a voice enhancement device including: a processor for recognizing a first sound, the first sound being collected by a voice collecting device, the first sound including a second sound and a background At least one of the sounds; when the first sound has the second sound, the processor performs sound analysis on the second sound to obtain a third sound; the processor processes the third sound.
  • the upper processor is further configured to perform sound event recognition on the first sound according to the sound event recognition model, and obtain sound category information of the first sound.
  • the above-mentioned processor is further configured to: according to the sound category information of the above-mentioned first sound, from the above-mentioned first sound Separate the second sound; analyze the sound attribute information of the second sound; wherein, the sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information; The sound attribute information of the second sound obtains the above-mentioned third sound.
  • the voiceprint information of the third voice mentioned above matches the voiceprint information of the registered user.
  • the above-mentioned processor is further configured to: cluster the fourth sound to obtain new sound category information;
  • the fourth voice is the voice whose voice category information is not recognized according to the voice event recognition model in the first voice; the voice event recognition model is updated according to the new voice category information, and the updated voice event recognition model is obtained.
  • the above-mentioned processor is further configured to: obtain the sound orientation information of the above-mentioned first sound; The sound position information and the sound category information of the first sound mentioned above are used to obtain the position information of the unwanted sound; according to the position information of the unwanted sound, the sound from the position is filtered.
  • the above-mentioned processor is further configured to: obtain the interactive voice information of the above-mentioned third sound; For the third voice whose voiceprint is not registered in the interactive voice message, perform voiceprint registration.
  • the above-mentioned processor is further configured to output interactive information according to the sound attribute information of the above-mentioned third sound.
  • the embodiments of the present application provide an electronic device that can implement the voice enhancement method described in the first aspect, which can be implemented by software, hardware, or by hardware executing corresponding software. method.
  • the electronic device may include a processor and a memory.
  • the processor is configured to support the electronic device to perform corresponding functions in the above-mentioned first aspect or second aspect method.
  • the memory is used for coupling with the processor, and it stores the necessary program instructions and data of the electronic device.
  • the embodiments of the present application provide a computer storage medium.
  • the computer storage medium includes computer instructions.
  • the computer instructions run on an electronic device, the electronic device executes any of the above aspects and The possible design method of the speech enhancement method.
  • the embodiments of the present application provide a computer program product that, when the computer program product runs on a computer, causes the computer to execute as described in any of the foregoing aspects and possible design methods.
  • Voice enhancement method when the computer program product runs on a computer, causes the computer to execute as described in any of the foregoing aspects and possible design methods.
  • FIG. 1 is a schematic diagram of a scene example of a sound environment provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of the composition of a hardware structure of an electronic device provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of the software architecture of an electronic device provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of a system architecture adapted to a voice enhancement method provided by an embodiment of this application;
  • FIG. 5 is a schematic flowchart of a voice enhancement method provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of the principle of a sound localization method provided by an embodiment of the application.
  • FIG. 7 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 8 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 9 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of the structural composition of an electronic device provided by an embodiment of the application.
  • At least one of a, b, or c can mean: a, b, c, a and b, a and c, b and c, or, a and b and c, where a, b and c c can be single or multiple.
  • words such as “first” and “second” are used to distinguish the same items or similar items that have substantially the same function and effect. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order.
  • the "first” in the first application and the "second” in the second application in the embodiments of the present application are only used to distinguish different applications.
  • the descriptions of the first, second, etc. appearing in the embodiments of this application are only used for illustration and distinguishing the description objects, and there is no order, and it does not mean that the number of devices in the embodiments of this application is particularly limited, and cannot constitute a reference to this application. Any limitations of the embodiment.
  • Voiceprint It is a biological feature composed of more than one hundred characteristic dimensions such as wavelength, frequency and intensity.
  • the production of human language is a complex physiological and physical process between the human body's language center and the vocal organs. Because different people's vocal organs are different in size, shape and function, the voiceprint patterns of any two people will be different.
  • Voiceprint has the characteristics of specificity, relative stability and variability. The specificity of voiceprint means that different people have different voiceprints. Even if the speaker deliberately imitates the voice and tone of others, or whispers softly, even if the imitation is vivid, the voiceprint is always different.
  • the relative stability of the voiceprint means that the human voiceprint can remain relatively stable for a long time after adulthood.
  • voiceprint may come from physiology, pathology, psychology, simulation, camouflage, and is also related to factors such as environmental interference. Since each person’s articulation organs are not exactly the same, based on the specificity and relative stability of the voiceprint, the voices of different people can be distinguished by the voiceprint.
  • Voiceprint recognition It is a type of biometric technology that can extract voiceprint features from the speaker's voice signal for identity recognition. That is, voiceprint recognition can also be called speaker recognition, including speaker recognition and speaker confirmation. Among them, speaker recognition is used to determine which of a plurality of people speaks a certain speech; and speaker confirmation is used to confirm whether a certain speech is spoken by a designated person.
  • a deep learning algorithm can be used to extract the speaker's voiceprint features. For example, the speaker recognition system based on the classic deep neural network and the speaker feature extraction system based on the end-to-end deep neural network are used to extract the voiceprint features of the speaker.
  • the electronic device may perform voiceprint registration before performing voiceprint recognition.
  • Voiceprint registration means that the user records a piece of voice through the pickup device of the electronic device (such as a microphone), for example, it can be called a registered voice; the device extracts the voiceprint feature of the registered voice; and establishes and saves the voiceprint feature and input Correspondence of voice users.
  • a user who has registered a voiceprint may be referred to as a registered user.
  • the aforementioned voiceprint registration may be a voiceprint registration that is not related to the content of the voice entered by the user, that is, the electronic device may extract voiceprint features from the voice entered by the user for registration.
  • the embodiment of the present application does not limit the specific method for extracting voiceprint features from voice information.
  • the electronic device may support voiceprint registration of multiple users. For example, multiple users separately record a segment of voice through the pickup device of the electronic device. Exemplarily, user 1 enters voice 1, user 2 enters voice 2, and user 3 enters voice 3.
  • the electronic device separately extracts the voiceprint features of each segment of voice (voice 1, voice 2 and voice 3). Further, the electronic device establishes and saves the corresponding relationship between the voiceprint feature of voice 1 and user 1, establishes and saves the corresponding relationship between the voiceprint feature of voice 2 and user 2, and establishes and saves the voiceprint feature of voice 3 with that of user 3. Correspondence. In this way, the electronic device saves the correspondence between multiple users and the voiceprint features extracted from the user's registered voice, and these users are the registered users on the electronic device.
  • the electronic device may perform voiceprint recognition after receiving the user's voice. For example, after the electronic device receives the user’s voice, it can extract the voiceprint features of the voice; the electronic device can also acquire the voiceprint features of the registered voice of each registered user; the electronic device compares the received voiceprint features of the user’s voice with The voiceprint features of the registered voices of each registered user are compared, and it is obtained whether the voiceprint features of the received user voice match the voiceprint features of the registered voice of each registered user.
  • the permissions of multiple users who have registered voiceprints in this embodiment of the application may be different.
  • their usage rights can be different. If the registered voiceprint is actively registered by the user, then the user has higher authority. If the registered voiceprint is registered through adaptive learning, the user's authority is low and the confidentiality function cannot be used.
  • the user authority of the registered voiceprint may also be preset by the user, which is not limited in the embodiment of the present application.
  • Voice enhancement technology refers to a technology that extracts useful voice signals from voice signals containing multiple noises in a complex sound environment to suppress and reduce noise interference. That is, extract the original voice as pure as possible from the noisy voice.
  • the electronic device is a smart speaker
  • the sound environment in which the smart speaker is located is the scene shown in FIG. 1 as an example.
  • the smart speaker can receive a variety of sound inputs in the environment.
  • the sound input may include the user's asking "how is the weather today", the barking of pet dogs, the sound of TV, the sound of people talking, etc.
  • a voice enhancement method that performs the same enhancement on the four sounds in the environment shown in Figure 1. Therefore, the sound input to the smart speaker includes both the user command voice and the sound in the environment, which will cause the electronic device to fail
  • the voice command of the target user is captured correctly, which causes the problem of poor user experience.
  • the speech enhancement algorithm cannot adapt to the environment when the electronic device is in a complex sound environment.
  • the embodiment of the present application provides a voice enhancement method, which can accurately capture the voice of a target user in a complex sound environment in a scene with a relatively complex sound environment, and enhance the user's intelligent interaction experience.
  • the voice enhancement method adopted in the embodiments of this application can be applied to electronic devices.
  • the electronic devices can be smart speakers, smart TVs, smart refrigerators, and other smart home devices that can perform voice interaction, or smart watches, smart glasses, smart Wearable electronic devices such as helmets that can perform voice interactions can also be other forms of devices that can perform voice interactions.
  • the embodiments of the present application do not impose special restrictions on the specific form of the electronic device.
  • FIG. 2 is a schematic structural diagram of an electronic device 100 provided by an embodiment of this application.
  • the electronic device 100 may include a processor 110, a memory 120, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 150, a wireless communication module 160, and an audio module 170 , Speaker 170A, microphone 170B, etc.
  • the electronic device 100 may further include a battery 180.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a controller, a memory, a video codec, and a digital signal processor (DSP).
  • AP application processor
  • DSP digital signal processor
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 100.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the memory 120 is used to store instructions and data.
  • the memory 120 is a cache memory.
  • the memory can store instructions or data used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory 120. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the memory 120 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the memory 120.
  • the processor 110 may perform sound enhancement processing by executing instructions stored in the memory 120.
  • the storage program area can store an operating system, an application program required by at least one function (such as a sound playback function), and the like.
  • the storage data area can store data (such as audio data, etc.) created during the use of the electronic device 100.
  • the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the memory 120 may also be provided in the processor 110, that is, the processor 110 includes the memory 120.
  • the processor 110 includes the memory 120.
  • the embodiment of the present application does not limit this.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive the charging input of the wired charger through the USB interface 130.
  • the charging management module 140 may receive the wireless charging input through the wireless charging coil of the electronic device 100. While the charging management module 140 charges the battery 180, it can also supply power to the electronic device through the power management module 150.
  • the power management module 150 is used to connect the battery 180, the charging management module 140 and the processor 110.
  • the power management module 150 receives input from the battery 180 and/or the charging management module 140, and supplies power to the processor 110, the memory 120, and the wireless communication module 160.
  • the power management module 150 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance).
  • the power management module 150 may also be provided in the processor 110.
  • the power management module 150 and the charging management module 140 may also be provided in the same device.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via an antenna, modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110.
  • the wireless communication module 160 may also receive the signal to be sent from the processor 110, perform frequency modulation, amplify, and convert it into electromagnetic waves to radiate through the antenna.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, and an application processor. For example, music playback, voice collection, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called a “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the microphone 170B also called “microphone”, “microphone”, is used to convert sound signals into electric signals. When the user performs voice interaction with the electronic device, the microphone 170B can receive the user's voice input.
  • the electronic device 100 may be provided with at least one microphone 170B. In some other embodiments, the electronic device 100 may be provided with two microphones 170B, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 100 may also be provided with three, four or more microphones 170B to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions. Optionally, the electronic device 100 may be provided with six microphones 170B to form a microphone array, through which the azimuth information of the sound can be analyzed.
  • an electronic device 100 can be divided into different functional modules when implementing the voice enhancement method provided in the embodiments of the present application.
  • an electronic device 100 provided in an embodiment of the present application may include functional modules such as an audio collection module, an audio processing module, an audio recognition module, an audio synthesis module, and a voice interaction module.
  • the audio collection module is used to obtain audio data, and is responsible for storing and forwarding the obtained audio data.
  • the function of the audio collection module can be realized by the microphone 170B in FIG. 2.
  • the microphone 170B can receive a variety of sound inputs and convert them into audio data through the audio module 170; the audio collection module can obtain audio data, store and send it to Other modules forward.
  • the audio processing module is used to enhance the audio data acquired by the audio collection module. For example, perform sound attribute analysis and sound separation and filtering on the received voice data.
  • the audio processing module may include a sound event recognition module, a sound localization module, a voiceprint recognition module, and so on.
  • the sound event recognition module is used to recognize multiple sounds collected by the audio collection module, and determine the category information of each sound.
  • the sound location module is used to locate multiple sounds collected by the audio collection module and analyze the location information of each sound.
  • the voiceprint recognition module is used to perform voiceprint recognition on various voice data collected by the audio collection module. In the embodiment of the present application, the voiceprint recognition module may be used to extract voiceprint features of the input voice to determine whether it matches the voiceprint of a registered user.
  • the audio recognition module is used to perform voice recognition on the processed audio data. For example, recognize wake words, recognize voice commands, etc.
  • the audio synthesis module is used to synthesize and play audio data.
  • the commands of the server can be synthesized into audio data, and voice broadcast can be performed through the speaker 170A.
  • the input of the voice interaction module is the voice of the target user that is enhanced by the audio processing module, and the voice interaction module is used to obtain the voice interaction output according to the voice of the target user.
  • the related functions of the audio processing module, audio recognition module, audio synthesis module, and voice interaction module described above may be implemented by program instructions stored in the memory 120.
  • the processor 110 may implement the voiceprint recognition function by executing program instructions related to the audio processing module stored in the memory 120.
  • the voice enhancement method provided in the embodiment of the present application may be applied to the system architecture shown in FIG. 4.
  • the sound input to the intelligent speech enhancement model includes both the sound in the daily environment (for example, the sound of water, the sound of TV, the sound of cooking, etc.), and the voice command input by the user.
  • the intelligent voice enhancement model is used to enhance the input multiple sounds, and its output only includes the voice commands input by the user.
  • the processing process of the intelligent voice enhancement model includes voice recognition, voice attribute analysis, and voice separation and filtering.
  • voice recognition includes the recognition of voice category information. Combining the recognized voice category information, the user’s voice can be combined with the background. The sound is separated.
  • Voice attribute analysis is to perform attribute analysis such as sound location, time analysis, and voiceprint analysis of the separated user voice.
  • the enhancement model can determine the target user’s voice interaction command from the user’s voice in combination with the voice attribute information, that is, it can be from multiple people The voice interaction commands of the target user are extracted from the voice.
  • this application performs targeted enhancement processing on the sounds in the more complex sound environment collected by the intelligent voice enhancement model electronic device, thereby extracting the voice interaction commands of the target user, and therefore according to the voice interaction commands of the clean target user , Can output more accurate interactive information, and enhance the user's intelligent interactive experience.
  • the user interaction feedback in Figure 4 refers to the user's feedback on the interaction process. For example, after the user inputs a voice interactive command and the electronic device feeds back the interactive output, the user continues to perform voice interaction with the electronic device; or, the user outputs an interactive command, and the electronic device feeds back the voice interactive output, and the user stops the dialogue, etc., all of which can be used as enhancement results judgment.
  • the user stops the dialogue indicating that the enhancement result may not be very good, and the user is not satisfied with the voice interactive output and stops the dialogue.
  • the user continues to request voice interaction, indicating that the enhancement result is better and can meet the user's interaction needs.
  • the above interactive feedback can assist the intelligent enhancement model in self-learning, so that the enhancement model is more suitable for the sound environment in which it is located.
  • the speech enhancement method may include steps S501-S505:
  • the electronic device collects the first sound.
  • the first sound includes at least one of a second sound and a background sound.
  • the second voice may include the voice of the user (human voice, human voice for short).
  • the second sound may include a voice interaction command input by the user.
  • the background sound may be the sound in the environment where the electronic device is currently located.
  • the sound in the environment where the electronic device is currently located may include daily environmental sounds such as pet calls, TV sounds, water flowing sounds, coffee machine sounds, microwave oven sounds, cooking sounds, air conditioning sounds, and so on.
  • the embodiments of the present application do not limit the specific types of background sounds, which are only exemplary descriptions here.
  • the first sound may include: a second sound and a background sound.
  • the second voice includes the voice of the character talking, and the voice command of the user asking "how is the weather today".
  • Background sounds include TV sounds and dog barking.
  • the second sound and background sound included in the foregoing first sound may come from multiple different sound sources.
  • the sound sources of the four sounds of TV sound, dog barking, user voice command, and character conversation sound are different from each other.
  • the sound characteristics of different sound sources are different from each other.
  • collecting the first sound by the electronic device includes: collecting the first sound by a microphone array of the electronic device.
  • the microphone array can convert the first sound collected by the microphone 170B into audio data through the audio module 170.
  • the electronic device recognizes the first voice.
  • the electronic device recognizing the first sound includes: the electronic device performs sound event recognition on the first sound according to the sound event recognition model, and obtains sound category information of the first sound.
  • the sound category information may include: human voice, television sound, air-conditioning sound, dog barking sound, water flow sound and other sound types.
  • the aforementioned sound event recognition model may be a neural network model obtained through a large amount of sound training.
  • the category information of the sound can be obtained.
  • the above-mentioned sound event recognition model can be stored locally in the electronic device, or can be stored in the cloud, which is not limited in the embodiment of the present application.
  • the above-mentioned specific method for obtaining the category information of the first sound may be obtained by using a deep learning algorithm, and the deep learning algorithm may be a convolutional neural network, a fully connected neural network, a long and short-term memory network, a gated loop unit and other algorithms.
  • the embodiment of the present application does not limit the specific deep learning algorithm for obtaining sound category information, and is only an exemplary description here.
  • the sound category information of the first sound collected by the electronic device can be obtained.
  • the electronic device can determine the category information of the first sound collected by the smart speaker in FIG. 1, according to the convolutional neural network algorithm, including human voice, dog barking, and TV sound.
  • the convolutional neural network algorithm including human voice, dog barking, and TV sound.
  • the electronic device recognizes the first sound in step S502, and obtains the sound category information of the first sound.
  • the electronic device determines that the second sound exists in the first sound, that is, when the first sound includes human voices, the above method It also includes step S503.
  • the third sound is a user's voice interaction command.
  • the foregoing second sound may include a third sound, or may be the same as the third sound.
  • the second sound when the second sound includes only one human voice, the second sound may be the same as the third sound, that is, the second sound may be a user voice interaction command; when the second sound includes at least two human voices, the The second sound includes the third sound, and the electronic device can extract the third sound from the second sound, that is, the electronic device can extract the user voice interaction command from multiple human voices.
  • the electronic device performs sound analysis on the second sound to obtain the third sound, which may include steps S5031-S5033.
  • the electronic device separates the second sound from the first sound according to the sound category information of the first sound.
  • the electronic device may combine the sound category information to separate the second sound from the first sound. That is, when the first sound has the second sound, the electronic device can separate the second sound in the first sound from the background sound, filter out the background sound, and separate the human voice.
  • the electronic device can combine the sound category information and use a sound source separation algorithm to separate the human voice from the background sound of the daily environment, filter out the background sound in the daily environment, and retain only the human voice.
  • the first sound collected by an electronic device includes voices such as character conversation, user voice interaction commands, TV sound, dog barking, etc.
  • the electronic device can combine sound category information to combine the TV sound, dog barking sound with character conversation sound and user voice Separate interactive commands, filter out TV sounds and dog barking, and only retain character conversations and user voice interactive commands.
  • the foregoing sound source separation algorithm may use a traditional separation algorithm for sound separation, or a deep learning algorithm for sound separation.
  • traditional matrix factorization algorithms or singular value decomposition algorithms can be used, and deep learning algorithms such as deep neural networks, convolutional neural networks, and fully connected neural networks can also be used to achieve sound separation.
  • deep learning algorithms such as deep neural networks, convolutional neural networks, and fully connected neural networks can also be used to achieve sound separation.
  • the embodiment of the present application does not limit the specific algorithm of sound separation, and is only an exemplary description here.
  • the embodiment of the present application performs sound separation based on a deep learning algorithm, which enables the electronic device to separate human voices from background sounds, so that human voices can be extracted from noisy daily environmental sounds.
  • the electronic device analyzes the sound attribute information of the second sound.
  • the sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information.
  • the sound position information refers to the position information of the sound source.
  • Voiceprint information includes voiceprint features, which are extracted from the voice information sent by the speaker, and the voiceprint features can be used for identity recognition.
  • Sound time information refers to the time period when a certain type of sound often appears. The embodiment of the present application does not limit the specific content of the attribute information of the sound, and is only an exemplary description here.
  • analyzing the sound attribute information of the second sound by the electronic device may include: the electronic device obtains sound orientation information of different sound sources in the second sound through a sound localization module.
  • the electronic device can collect sounds through the 6 microphones and analyze the sound orientation information of different human voices.
  • the embodiment of the present application does not limit the specific method of how the microphone array analyzes the sound orientation information, and reference may be made to the sound localization method based on the microphone array in the prior art. For example, a sound localization method based on beamforming, or a sound localization method based on high-resolution spectrum estimation, or a sound localization method based on a sound arrival delay difference, etc.
  • the electronic device determines the sound position information based on the sound arrival delay difference as an example.
  • the microphone array of the smart speaker can first estimate the sound arrival time difference according to the sound source, and obtain the sound delay between the elements of the microphone array; then use the acquired sound arrival time difference and combine the known spatial position of the microphone array to further determine the sound The location of the source.
  • the solid circles 1, 2, and 3 respectively represent the three microphones 170B of the smart speaker.
  • the time delay from the sound source to microphone 1 and microphone 3 is a constant.
  • the black and bold hyperbola in Figure 6 can be drawn; the time delay from the sound source to microphone 2 and microphone 3 is also a constant, according to this constant
  • the dashed hyperbola in Figure 6 can be drawn.
  • the intersection of these two hyperbolas, the intersection point is the position of the sound source, which is represented by a black solid square in Figure 6.
  • the smart speaker can determine the location information of the sound source according to the spatial position of the microphone array.
  • analyzing the sound attribute information of the second sound by the electronic device may further include: the electronic device separates different human voices in the second sound according to a blind source separation algorithm, and extracts voiceprint features of different human voices to obtain Voiceprint characteristics of each vocal. Since the voiceprint characteristics of different sound sources are different from each other, the electronic device can distinguish different human voices according to the voiceprint characteristics.
  • the foregoing electronic device analyzing the sound attribute information of the second sound may further include: the electronic device analyzing the decibel level of the second sound.
  • the electronic device obtains the third sound according to the sound attribute information of the second sound.
  • the electronic device may extract a third sound (user voice interaction command) from at least one human voice according to the attribute information of the sound.
  • the voiceprint information of the third voice mentioned above matches the voiceprint information of the registered user in the electronic device.
  • the voiceprint of the registered user may be self-registered by the user, or the electronic device may be registered through an adaptive learning method, which is not limited in this embodiment of the application.
  • the voiceprint information may be voiceprint features.
  • the electronic device obtains the third sound according to the sound attribute information of the second sound, which may include: the electronic device separates different human voices in the second sound according to a blind source separation algorithm, and then separates different human voices from the different human voices.
  • the voiceprint feature is extracted in the, and the voiceprint features of different voices are matched with the voiceprint features of the registered user. If they match, the matched voice is determined as the third voice.
  • the electronic device may use a convolutional neural network algorithm to extract voiceprint features from the human voice.
  • the blind source separation algorithm can use traditional separation algorithms for sound separation, or deep learning algorithms for sound separation.
  • the embodiment of the present application does not limit the specific algorithm of sound separation, and is only an exemplary description here.
  • the embodiment of the present application performs sound separation based on a deep learning algorithm, and when the second sound includes multiple human voices, the multiple human voices can be separated.
  • the above-mentioned voiceprint feature matching with the voiceprint feature of a registered user may mean that the degree of matching between the voiceprint feature of the human voice and the voiceprint feature of the registered user exceeds a preset threshold, or the voiceprint of the human voice
  • the features have the highest matching degree with the voiceprint features of registered users.
  • the degree of confidence may be used to indicate the degree of matching between the input voice and the voiceprint features of the registered user. The higher the degree of confidence, the higher the degree of matching between the input voice and the voiceprint features of the registered user. It is understandable that the degree of confidence can also be described as a confidence probability, a confidence score, a credibility score, etc., which is a value used to characterize the credibility of the voiceprint recognition result.
  • the embodiment of the present application does not limit this, and the confidence level is used for description in the embodiment of the present application.
  • a confidence threshold can be set. If the confidence that the input voice belongs to each registered user is less than or equal to the confidence threshold, it is determined that the input voice does not belong to any registered user. If the confidence that the input voice belongs to a certain registered user is greater than or equal to the confidence threshold, it is determined that the input voice is input to the registered user.
  • the confidence threshold is 0.5
  • the confidence that the input voice belongs to user 1 is 0.9
  • the confidence that the input voice belongs to user 2 is 0.4
  • the first sound collected by the electronic device includes voices such as character conversation, user voice interaction commands, TV sound, dog barking, etc.
  • the electronic device can combine the TV sound, dog barking sound with character conversation sound and user voice according to the sound category information. Separate interactive commands, filter out TV sounds and dog barks, and retain character conversations and user voice interactive commands.
  • the electronic device then separates the character’s voice from the user’s voice interaction commands according to the blind source separation algorithm, extracts the voiceprint features of the character’s voice and the user’s voice interaction commands, and determines whether it matches the registered user’s voiceprint features, if they match , The electronic device determines the matched human voice as the third voice.
  • smart speakers can separate human voices from sounds in the daily environment from human voices, TV sounds, and dog barking sounds based on the category information of the sound, and filter out dog barking sounds And TV sound, separate different human voices (character conversation sound and user voice interaction command) using blind source separation algorithm, and extract the voiceprint characteristics of character conversation sound and user voice interaction command, and separate them with the registered voice
  • the user’s voiceprint features are matched. If the voiceprint feature of the user’s voice asking "What’s the weather today" matches the voiceprint of the registered user, the electronic device can determine the user’s voice asking "What’s the weather today?" Three voices.
  • the electronic device may The decibel level of the two human voices is determined as the third voice with the louder voice.
  • the electronic device may also The position information of at least two human voices from the electronic device determines the human voice closer to the electronic device as the third voice.
  • the above-mentioned electronic device obtains the third sound according to the sound attribute information of the second sound, and may also include: the electronic device obtains the third sound from the second sound according to the orientation information or energy characteristics or sound characteristics of the second sound. Get the third voice in.
  • the electronic device may determine the third sound from the second sound according to the position information of different human voices in the second sound.
  • a smart speaker can filter the TV sound and dog bark from the TV sound, dog barking sound, character conversation sound and user voice interaction commands, and separate the character conversation sound and user voice
  • the interactive commands are separated and combined with the position information of the characters' conversation and the user's voice interactive commands. If the location information of the user's voice interaction command is on the sofa and the location information of the character's conversation sound is on the dining table, the smart speaker can determine the user's voice interaction command on the sofa as the third voice.
  • the user's voice interaction command and the position information of the character's conversation sound are both located on the sofa, but the decibel size of the user's voice interaction command is much larger than the character's conversation sound, so the smart speaker can also determine the user's voice interaction command as the voice of the target user.
  • the electronic device may also use a deep learning algorithm to determine the third sound from different human voices included in the second sound according to the energy characteristics or sound characteristics of the sound.
  • a neural network model can be trained through a large number of users' conversations at different positions relative to the electronic device and user voice interaction commands, and the neural network model is used to determine user voice interaction commands.
  • the electronic device may determine the user's voice interaction command as the third voice from the character's conversation sound and the user's voice interaction command according to the voice feature or the energy feature.
  • the electronic device may determine whether different human voices in the second sound are facing the electronic device or facing away from the electronic device, and determine the human voice facing the electronic device as the third sound.
  • the electronic device obtains the third sound according to the sound attribute information of the second sound, and may further include: the electronic device determines the human voice with a larger decibel in the second sound according to the decibel level of the second sound For the third voice.
  • the embodiment of the present application can extract the user's voice interaction commands from various human voices by analyzing the sound attributes of the second voice and combining the separation algorithm.
  • the electronic device processes the third sound.
  • the smart speaker can extract the user’s voice interaction command from the sound environment shown in Fig. 1 as "how is the weather today", and input "what is the weather today” into the voice interaction module for processing . It is understandable that since the sound input to the voice interaction module in the embodiment of the present application is a clean user voice interaction command, the voice interaction module can obtain a more accurate voice interaction output according to the voice interaction command.
  • the electronic device outputs interactive information according to the third sound.
  • the interactive information output by the electronic device may be voice interactive information, or may be a control signal.
  • the user's voice interactive command is "how is the weather today", and the electronic device can output voice interactive information "it is sunny today, the minimum temperature is 25 degrees, and the maximum temperature is 35 degrees.”
  • the user's voice interactive command is "turn on the living room light", and the electronic device may output a control signal to turn on the living room light.
  • the electronic device may also combine the sound attribute information of the third sound to output interactive information.
  • the electronic device may combine the position information and/or time information of the third sound to output interactive information.
  • the electronic device can use the position information of the third sound. If the third sound is "turn on the light" and the position information of the third sound is in the kitchen, the electronic device can output the indicator of turning on the kitchen light according to the position information of the third sound. control signal.
  • the electronic device may, according to the time information of the third sound, if the third sound is "Turn on the TV", the electronic device may go through a lot of learning process, output a control signal to turn on the TV, and adjust the volume of the TV.
  • the electronic device can use the time information of the third sound. If the third sound is "turn on the lights in the living room", the electronic device can combine the time information of the third sound to output a control signal to turn on the lights in the living room, and turn it on according to the time information. Suitable brightness.
  • the electronic device may also combine the voiceprint characteristics of the third sound to output interactive information.
  • the electronic device can combine the user's voiceprint characteristics and sound time information to output a control signal for turning on the light in the second sleeping room after a lot of learning process.
  • the electronic device can combine the user's voiceprint characteristics and sound time information to output a control signal for turning on the kitchen light.
  • the electronic device may also combine the age information of the third voice to output interactive information.
  • the electronic device can recognize the age information of the voice interactive command as a child, and play cartoons or children's programs according to the recognized age information.
  • the electronic device can recognize that the age information of the voice interactive command is an elderly person, and play the last TV program watched according to the recognized age information.
  • the user rights of different voiceprints in the embodiments of the present application may be different.
  • the third voice is the voice interactive command "Turn on the TV”
  • the voiceprint feature of the third voice matches the voiceprint feature of the registered user, but due to the voiceprint
  • the interactive information output by the electronic device can be "You do not have the permission to turn on the TV” or other voice prompts.
  • the user authority can be different.
  • users who register with self-adaptive learning have lower permissions, and users have higher permissions when registering voiceprints actively, and can use high-confidentiality functions.
  • the user permissions for the above different voiceprints may also be preset by the user. The embodiment of the present application does not limit this.
  • this embodiment uses the electronic device to extract clean user voice interaction commands (third voice) from the complex sound environment (first voice), and process the user voice interaction commands, thereby outputting more accurate interaction information , The user experience of intelligent interaction is better.
  • This embodiment provides a voice enhancement method, which collects a first sound through an electronic device; the electronic device recognizes the first sound; when the first sound has a second sound, the electronic device performs sound analysis on the second sound to obtain the third sound; The electronic device processes the third sound; the electronic device outputs interactive information.
  • human voices can be separated from multiple sounds including background sounds, and clean user voice interaction commands can be extracted from the human voices, thereby processing the user voice interaction commands, and the obtained voice interaction
  • the output is relatively accurate, so the voice interaction can be accurately completed, which improves the user's intelligent interaction experience.
  • the voice enhancement method when extracting user voice interaction commands from a complex sound environment, this application recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user voice interaction command (the third voice). Therefore, the voice enhancement method does not Instead of performing the same enhancement on all input sounds, it combines the attribute information of the currently collected sound for targeted enhancement, so it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
  • the embodiment of the present application also provides a voice enhancement method. As shown in FIG. 7, after the above step S502, steps S701-S702 may be further included.
  • the electronic device clusters the fourth sound to obtain new sound category information.
  • the fourth voice is the voice of the first voice whose voice category information is not recognized according to the voice event recognition model.
  • the electronic device may store the unidentified fourth sound in the memory.
  • the sound event recognition model can recognize limited types of sounds.
  • the sound event recognition model of the electronic device can recognize more common sounds.
  • there are many sounds such as human voice, water flow, dog barking, TV sound, etc., but the sound of coffee machine grinding, the sound of oven, and the sound of washing machine cannot be recognized.
  • the electronic device can collect and store the sounds of the coffee machine grinding beans, the sound of the oven, and the sound of the washing machine that are not recognized by the sound event recognition model.
  • the foregoing fourth sound may be an unrecognized sound that frequently appears, and the electronic device may collect a large amount of fourth sounds.
  • the electronic device can cluster a large number of fourth sounds through unsupervised learning, discover the inherent properties and laws of the data through learning, and obtain new sound category information.
  • the electronic device can obtain new sound category information through clustering analysis.
  • the clustering analysis method can include a comprehensive hierarchical clustering algorithm, a density-based clustering algorithm, expectation-maximization (EM), and fuzzy clustering. Class (Fuzzy Clustering), K-means algorithm, K-means clustering (K-means Clustering), etc.
  • the embodiment of the present application does not limit the clustering analysis method specifically adopted by the electronic device, and is only an exemplary description here.
  • the electronic device can collect a large number of coffee machine grinding sounds through it, and a new type of sound category information can be obtained through a cluster analysis algorithm. Therefore, next time the sound input by the electronic device includes the sound of the coffee machine grinding beans, the electronic device can recognize the sound category.
  • the electronic device updates the sound event recognition model according to the new sound category information, and obtains the updated sound event recognition model.
  • the electronic device may retrain the sound event recognition model based on the new sound category information obtained after clustering and use it as a training sample to obtain a new sound event recognition model, and the updated new sound event recognition
  • the model is more stable and robust. That is, after the electronic device obtains the new sound event recognition model, the electronic device can update the original sound event recognition model to the new sound event recognition model.
  • step S502 after the electronic device updates the sound event recognition model, when the electronic device executes the voice enhancement method in steps S501-S505 again, in step S502, multiple sounds can be recognized according to the updated sound event recognition model, and each The category information of the sound. Since the updated sound event recognition model can recognize more types of sound events, the updated sound event recognition model has a better recognition effect.
  • steps S701-S702 can be executed after steps S503-S505, or before steps S503-S505, or they can be executed simultaneously with steps S503-S505.
  • the order of their execution in this embodiment of the application is not Qualify.
  • the embodiment of the present application updates the sound event recognition model through new sound category information, so that the sound event recognition model can adapt to the environment.
  • This embodiment can cluster the sounds not recognized by the sound event recognition model, and train to obtain a new sound event recognition model, that is, the electronic device can self-learn the sounds that often appear in the environment in which it is located, and update the sound event recognition model. It can be adaptive to the environment and improve the user's interactive experience. Moreover, the stability and robustness of the sound event recognition model will become more stable with the increase of the user's use time, and the more used the better.
  • step S502 it may further include: steps S801-S803.
  • the electronic device obtains sound orientation information of the first sound.
  • the electronic device may obtain the position information of the first sound according to the sound localization method based on the microphone array in step S5032, and save the sound position information.
  • the electronic device can record the location information of a fixed sound source, such as TV sound, water flow sound, washing machine sound and other sounds.
  • the electronic device may store sound location information in the memory.
  • the electronic device obtains the location information of the unnecessary sound according to the sound location information of the first voice and the sound category information of the first voice.
  • the electronic device may include a sound localization module, and the sound localization module is used to determine the position information of the sound source.
  • the sound localization module can determine the position information of the sound source through the microphone array in the electronic device.
  • the electronic device may perform adaptive learning according to the sound orientation information and the sound category information to obtain the orientation information of the useless sound.
  • the sound localization module combines the location information of the sound and the category information of the sound to determine the sound in a fixed direction other than the human voice as a useless sound.
  • the sound of TV, the sound of running water, the sound of cooking, the sound of coffee machine grinding, the sound of washing machine, etc. are all useless sounds.
  • the position information of the useless sound is relatively fixed relative to the electronic device. The electronic device can obtain the position of the useless sound. information.
  • the electronic device filters the sound from the direction of the unnecessary sound according to the position information of the unnecessary sound.
  • the electronic device may filter the sound from the direction information of the unnecessary sound when enhancing the sound received by the electronic device according to the position information of the unnecessary sound.
  • the electronic device may filter the sound from the direction according to the direction information of the unnecessary sound and the type information of the sound. For example, if among the sounds received by the electronic device, the sound from the direction information of the unnecessary sound is not a human voice, the electronic device filters the sound from the direction. That is, the electronic device can assist the electronic device to separate and filter the sound according to the location information of the useless sound.
  • the electronic device may combine the position information of the sound and the sound category information to obtain the position of the sound that frequently appears in a specific scene.
  • the position of the TV sound, the position of the sound of water, etc., electronic equipment can shield the sound in this direction, and better assist the sound separation.
  • the azimuths of sounds frequently appearing in a specific scene can be obtained, so as to better assist in sound separation.
  • the traditional speech enhancement method cannot perform targeted enhancement of sounds in different directions, and cannot adapt to the environment.
  • the method in this embodiment can identify the direction of frequently-occurring useless sounds through the position information of the sound under the ever-changing environment in which the electronic device is located, and the sound in this direction can be ignored during separation, thereby ensuring the enhancement effect. Improve user interaction experience, and be able to adapt to the environment.
  • steps S901-S902 may be included after the above steps S501-S505.
  • the voice interaction information of the third sound includes: a user voice interaction command (third voice) processed by the electronic device, and a command for the user to further perform voice interaction with the electronic device after the electronic device feedbacks the voice interaction output. That is, the voice interaction feedback between the electronic device and the user is good, and the user further actively performs voice interaction with the electronic device.
  • the user's voice interaction command is "How is the weather today?"
  • the voice interaction output of the electronic device feedback is "Today's weather is sunny, the minimum temperature is 25 degrees, the maximum temperature is 35 degrees", because the electronic device feedback voice interaction output is more accurate , The user further asked "What kind of clothes are suitable to wear?". It is understandable that since the electronic device can accurately feedback the user's interaction command, it can promote further voice interaction between the user and the electronic device.
  • the electronic device may collect voice interaction information of the third voice.
  • the voice interaction information of the third sound may include at least two user voice interaction commands. For example: “How is the weather today?" and “What kind of clothes are suitable for you?".
  • S902 Perform voiceprint registration for a third voice whose voiceprint is not registered in the interactive voice information of the third voice.
  • the electronic device may perform unsupervised mining on the interactive voice information it collects. If the voiceprint feature extracted from the interactive voice information is not registered, the electronic device will perform text-independent voiceprint registration.
  • the electronic device may collect a large amount of interactive voice information with good interactive feedback, and through unsupervised learning, from the large amount of interactive voice information with good interactive feedback, dig out user voices with unregistered voiceprint features. And based on the deep learning algorithm, the voiceprint feature is extracted from the user's voice information, and the voiceprint registration is performed.
  • this voiceprint registration process may be called adaptive voiceprint registration.
  • the voiceprint information of the registered user in step S503 includes the adaptive voiceprint Registered voiceprint information.
  • the electronic device can match the voiceprint information of multiple sounds with the voiceprint information of the registered user to obtain the third voice.
  • this embodiment collects user interactive voices with good interactive feedback on a daily basis, combined with voice recognition and text-independent voiceprint registration, the electronic device can learn voiceprint information by itself and perform registration. Therefore, the voice corresponding to the voiceprint information can be strengthened in the intelligent interaction, and the intelligent interaction experience can be improved.
  • the traditional voice enhancement algorithm cannot specifically enhance the voices from different people, and it cannot perform adaptive voiceprint registration for people who have not registered voiceprints.
  • this embodiment can When the environment becomes more and more complex, the voiceprint adaptive registration is used to ensure the enhancement effect, and this method can achieve better results with the increase of the user's use time in practical applications.
  • the speech enhancement algorithm provided in this application can be applied to scenes with a more complex sound environment.
  • the first sound collected by the electronic device includes both the user's voice and the background sound in the daily environment in which the electronic device is located. Perform sound event recognition on the collected first sound, and determine the sound category information of the first sound. Then, according to the sound category information of the first sound, the second sound is separated from the first sound, that is, the background sound in the daily environment is filtered out, and only the user's voice is retained.
  • the electronic device may separate the multiple user voices included in the second voice according to the deep learning algorithm, and then combine the attribute information of the multiple user voices to determine from the second voice
  • the third voice user voice interactive command
  • the third voice is a voice interactive command of the target user. Therefore, a clean voice interaction command of the target user can be extracted from the complex sound environment, so that the voice interaction feedback output by the electronic device according to the third sound is more accurate.
  • the solution extracts the user's voice interaction command from the complex sound environment, it recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user's voice interaction command (the third voice). Therefore, the voice enhancement method is not correct All input sounds are enhanced in the same way, but are targeted to be enhanced in combination with the attribute information of the currently collected sound, so that it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
  • the speech enhancement algorithm provided in the embodiment of the present application will be described in conjunction with the speech environment in which the smart speaker shown in FIG. 1 is located.
  • the smart speaker collects first sounds, which include TV sounds, dog barking sounds, user conversation sounds, and user voice interaction commands.
  • the electronic device recognizes the first sound according to the sound event recognition model, and determines sound category information of the first sound.
  • the category information of the first sound includes human voices, dog barking sounds, and television sounds.
  • the electronic device can combine the sound category information of the first sound to separate the human voice from the dog bark and TV sound according to the sound source separation algorithm, and filter out the dog bark and TV sound to obtain the human voice (second sound).
  • Electronic equipment performs blind source separation of human voices (including human voices and user voice interaction commands), that is, different human voices can be separated, and then based on the sound attribute information of different human voices (voiceprint characteristics and/or sound orientation information) ), the third voice is obtained, that is, the voice interaction command of the target user is extracted from multiple human voices.
  • the electronic device in the embodiments of the present application can identify and analyze the collected first sound from a complex sound environment, that is, by targeted enhancement of the sound in a complex sound environment, it can learn from the complex sound environment.
  • a clean voice interaction command of the target user is extracted from the sound environment, which can improve the user's intelligent interaction experience.
  • the above-mentioned electronic device includes hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present application.
  • the embodiments of the present application may divide the above-mentioned electronic equipment into functional modules according to the above-mentioned method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 10 shows a schematic diagram of a possible structure of the electronic device involved in the foregoing embodiment.
  • the electronic device 1000 includes a processing unit 1001 and a storage unit 1002.
  • the processing unit 1001 is used to control and manage the actions of the electronic device 1000. For example, it can be used to perform the processing steps of S501-S505 in Figure 5; or it can be used to perform the processing steps of S701 and S702 in Figure 7; or it can be used to perform the processing steps of S801-S803 in Figure 8. ; Or, it can be used to perform the processing steps of S901-S902 in FIG. 9; and/or other processes used in the technology described herein.
  • the storage unit 1002 is used to store the program code and data of the electronic device 1000. For example, it can be used to store sound location information.
  • the unit modules in the aforementioned electronic device 1000 include but are not limited to the aforementioned processing unit 1001 and storage unit 1002.
  • the electronic device 1000 may also include an audio unit, a communication unit, and the like.
  • the audio unit is used to collect the voice uttered by the user and play the voice.
  • the communication unit is used to support communication between the electronic device 1000 and other devices.
  • the processing unit 1001 may be a processor or a controller, for example, a central processing unit (CPU), a digital signal processor (digital signal processor, DSP), or an application-specific integrated circuit (ASIC). ), a field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • the processor may include an application processor and a baseband processor. It can implement or execute various exemplary logical blocks, modules and circuits described in conjunction with the disclosure of this application.
  • the processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the storage unit 1002 may be a memory.
  • the audio unit may include a microphone, a speaker, and so on.
  • the communication unit may be a transceiver, a transceiver circuit, or a communication interface.
  • the processing unit 1001 is a processor (the processor 110 shown in FIG. 2), and the storage unit 1002 may be a memory (the internal memory 120 shown in FIG. 2).
  • the audio unit may include a speaker (the speaker 170A shown in FIG. 2) and a microphone (the microphone 170B shown in FIG. 2).
  • the communication unit includes a wireless communication module (the wireless communication module 160 shown in FIG. 2). Wireless communication modules can be collectively referred to as communication interfaces.
  • the electronic device 1000 provided by the embodiment of the present application may be the electronic device 100 shown in FIG. 2. Wherein, the foregoing processor, memory, and communication interface may be coupled together, for example, connected by a bus.
  • the embodiment of the present application also provides a computer storage medium, the computer storage medium stores computer program code, when the above-mentioned processor executes the computer program code, the electronic device executes the steps in FIG. 5, FIG. 7, FIG. 8 or FIG.
  • the relevant method steps implement the method in any of the foregoing embodiments.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product runs on a computer, the computer can execute the relevant method steps in FIG. 5, FIG. 7, FIG. 8 or FIG. Methods.
  • the electronic device 1000, the computer storage medium, or the computer program product provided in the embodiments of the present application are all used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the corresponding method provided above. The beneficial effects of the method will not be repeated here.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate parts may or may not be physically separate.
  • the parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium.
  • a device which may be a single-chip microcomputer, a chip, etc.
  • a processor processor
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Disclosed by the embodiments of the present application are a voice enhancement method and device, relating to the technical field of communications and solving the problem of the prior art of it being impossible, in a more complex sound environment, to adapt to the environment, and the user experience of smart interaction being poor; the voice enhancement method uses artificial intelligence technology to enhance the the smart interactive experience of the user. The specific solution is: an electronic device acquires a first sound, said first sound comprising at least one of a second sound and a background sound; the electronic device recognizing the first sound; if the second sound is present in the first sound, the electronic device performing sound analysis on the second sound to obtain a third sound; the electronic device processing the third sound.

Description

一种语音增强方法和装置Method and device for speech enhancement
本申请要求于2019年08月21日提交国家知识产权局、申请号为201910774538.0、申请名称为“一种语音增强方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on August 21, 2019, the application number is 201910774538.0, and the application name is "a method and device for speech enhancement", the entire content of which is incorporated into this application by reference in.
技术领域Technical field
本申请实施例涉及通信技术领域,尤其涉及一种语音增强方法和装置。The embodiments of the present application relate to the field of communication technologies, and in particular, to a method and device for speech enhancement.
背景技术Background technique
目前,传统的语音增强算法对不同环境下的所有输入声音进行同一套语音增强算法,不论输入任何声音都进行同样的增强。例如,以智能音箱的接收的声音包括人声、电视声、狗叫声、水流声等多种声音的情况下,采用现有的语音增强算法将对所有声音进行同样的增强,这将导致智能音箱可能无法准确的获知用户命令,造成智能音箱语音交互不输出,或者语音交互输出不出准确等问题,使得用户智能交互的体验较差。因此,现有的语音增强算法在声音环境较为复杂的场景下,无法对环境做到自适应,用户智能交互的体验较差。At present, traditional speech enhancement algorithms apply the same set of speech enhancement algorithms to all input sounds in different environments, and perform the same enhancement regardless of input sounds. For example, when the sound received by a smart speaker includes various sounds such as human voice, TV sound, dog barking, water flow sound, etc., using the existing voice enhancement algorithm will enhance all sounds in the same way, which will lead to intelligent The speaker may not be able to accurately learn user commands, causing the smart speaker to fail to output voice interaction, or the voice interaction output is not accurate, and so on, making the user's smart interaction experience poor. Therefore, the existing speech enhancement algorithms cannot adapt to the environment in a scene with a more complex sound environment, and the user's intelligent interaction experience is poor.
发明内容Summary of the invention
本申请实施例提供一种语音增强方法和装置,能够在声音环境较为复杂的场景下,准确的从复杂的声音环境中捕捉到目标用户的语音,提升了用户智能交互的体验。The embodiments of the present application provide a voice enhancement method and device, which can accurately capture the voice of a target user from a complex sound environment in a scene with a relatively complex sound environment, thereby improving the user's intelligent interaction experience.
本申请实施例的第一方面,提供一种语音增强方法,该方法包括:电子设备采集第一声音,该第一声音包括第二声音和背景声音中的至少一项;该电子设备识别该第一声音;当该第一声音存在上述第二声音时,该电子设备对该第二声音进行声音分析,得到第三声音;该电子设备处理该第三声音。基于本方案,通过识别采集到的第一声音,并在第一声音中存在第二声音时,从第一声音中分离出第二声音(人声),并从第二声音中提取出第三声音(用户语音交互命令),从而对用户语音交互命令进行处理,获得的语音交互输出较为准确,因此能够准确的完成语音交互,提升了用户智能交互的体验。而且本申请在从复杂的声音环境中提取用户语音交互命令时,通过对声音进行识别和分析,并结合声音的属性信息,得到用户语音交互命令(第三声音),因此,该语音增强方法并不是对所有输入的声音进行同样的增强,而是结合当前采集的声音的属性信息进行针对性增强,因此能够适应复杂的声音环境,提升了复杂声音环境下用户的智能交互体验。In a first aspect of the embodiments of the present application, there is provided a voice enhancement method, the method includes: an electronic device collects a first sound, the first sound includes at least one of a second sound and a background sound; and the electronic device recognizes the first sound A sound; when the first sound has the second sound, the electronic device performs sound analysis on the second sound to obtain a third sound; the electronic device processes the third sound. Based on this solution, by identifying the collected first sound, and when the second sound exists in the first sound, the second sound (human voice) is separated from the first sound, and the third sound is extracted from the second sound. Voice (user voice interaction command), so as to process the user's voice interaction command, and the obtained voice interaction output is more accurate, so the voice interaction can be accurately completed, and the user's intelligent interaction experience is improved. Moreover, when extracting user voice interaction commands from a complex sound environment, this application recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user voice interaction command (the third voice). Therefore, the voice enhancement method does not Instead of performing the same enhancement on all input sounds, it combines the attribute information of the currently collected sound for targeted enhancement, so it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
结合第一方面,在一种可能的实现方式中,上述电子设备识别第一声音,包括:上述电子设备根据声音事件识别模型,对该第一声音进行声音事件识别,获取该第一声音的声音类别信息。基于本方案,通过识别第一声音,可以获取第一声音的声音类别信息,从而能够根据该第一声音的声音类别信息,从第一声音中提取出第二声音。With reference to the first aspect, in a possible implementation manner, the electronic device recognizing the first sound includes: the electronic device performs sound event recognition on the first sound according to a sound event recognition model, and obtains the sound of the first sound Category information. Based on this solution, by recognizing the first sound, the sound type information of the first sound can be obtained, so that the second sound can be extracted from the first sound according to the sound type information of the first sound.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述电子设备对上述第二声音进行声音分析,得到第三声音,包括:上述电子设备根 据上述第一声音的声音类别信息,从上述第一声音中分离出上述第二声音;上述电子设备分析上述第二声音的声音属性信息;其中,该声音属性信息包括:声音方位信息、声纹信息、声音时间信息、声音分贝信息中的一种或多种;上述电子设备根据该第二声音的声音属性信息,得到第三声音。基于本方案,通过从第一声音中分离出第二声音,并对第二声音进行属性分析,能够根据第二声音的属性信息从多个人声中提取出声音,得到干净的用户语音交互命令,实现针对性增强。With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the electronic device performs sound analysis on the second sound to obtain the third sound, including: The sound category information of the first sound separates the second sound from the first sound; the electronic device analyzes the sound attribute information of the second sound; wherein the sound attribute information includes: sound orientation information, voiceprint information, One or more of sound time information and sound decibel information; the above electronic device obtains the third sound according to the sound attribute information of the second sound. Based on this solution, by separating the second voice from the first voice and analyzing the attributes of the second voice, the voice can be extracted from multiple human voices based on the attribute information of the second voice to obtain clean user voice interaction commands. Achieve targeted enhancement.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述第三声音的声纹信息与已注册用户的声纹信息匹配。基于本方案,可以通过声纹信息,将与已注册用户的声纹信息匹配的声音确定为第三声音。With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the voiceprint information of the third voice is matched with the voiceprint information of the registered user. Based on this solution, the voice matching the voiceprint information of the registered user can be determined as the third voice through the voiceprint information.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述电子设备将第四声音进行聚类,获取新的声音类别信息;该第四声音为上述第一声音中,根据声音事件识别模型未识别出声音类别信息的声音;根据该新的声音类别信息,更新上述声音事件识别模型,获取更新后的声音事件识别模型。基于本方案,可以将声音事件识别模型未识别出的声音进行聚类,并训练得到新的声音事件识别模型,即能够通过电子设备自学习所处环境中经常出现的声音,更新声音事件识别模型,可以做到对环境自适应,提升了用户的交互体验。而且该声音事件识别模型的稳定性和鲁棒性将随用户的使用时间的增加更加稳定,能够达到越用越好的效果。With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the above method further includes: the above electronic device clusters the fourth sound to obtain new sound category information; The fourth voice is the voice whose voice category information is not recognized according to the voice event recognition model in the first voice; the voice event recognition model is updated according to the new voice category information, and the updated voice event recognition model is obtained. Based on this scheme, the sounds not recognized by the sound event recognition model can be clustered, and a new sound event recognition model can be trained, that is, the sound event recognition model can be updated through the self-learning of the sounds that often appear in the environment through the electronic device , It can be adaptive to the environment and enhance the user's interactive experience. Moreover, the stability and robustness of the sound event recognition model will become more stable with the increase of the user's use time, and the more used the better.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述电子设备获取上述第一声音的声音方位信息;根据该第一声音的声音方位信息以及该第一声音的声音类别信息,获取无用声音的方位信息;上述电子设备根据该无用声音的方位信息,将来自该方位的声音过滤。基于本方案,通过学习各种声音的方位信息,结合声音类别信息,可以获得特定场景下经常出现的无用声音的方位,从而能够更好的辅助声音分离。With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the above method further includes: the electronic device obtains the sound orientation information of the first sound; The sound position information and the sound category information of the first sound are used to obtain the position information of the unnecessary sound; the electronic device filters the sound from the position according to the position information of the unnecessary sound. Based on this solution, by learning the azimuth information of various sounds, combined with the sound category information, the azimuths of the useless sounds that often appear in a specific scene can be obtained, so as to better assist the sound separation.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述方法还包括:获取上述第三声音的语音交互信息;将上述第三声音的交互语音信息中未注册声纹的第三声音,进行声纹注册。基于本方案,通过日常收集交互反馈良好的用户交互语音,结合语音识别以及文本无关的声纹注册,电子设备可以自己学习声纹信息,并进行注册。从而在智能交互中可以对声纹信息对应的语音进行加强,提升智能交互体验。In combination with the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the above method further includes: acquiring the voice interaction information of the third voice; and combining the interaction voice information of the third voice For the third voice that has not registered voiceprint in, perform voiceprint registration. Based on this solution, through daily collection of user interaction voices with good interactive feedback, combined with voice recognition and text-independent voiceprint registration, electronic devices can learn voiceprint information and register by themselves. Therefore, the voice corresponding to the voiceprint information can be strengthened in the intelligent interaction, and the intelligent interaction experience can be improved.
结合第一方面或第一方面的任一可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述电子设备根据上述第三声音的声音属性信息,输出交互信息。该交互信息可以为语音交互信息,也可以为控制信号。基于本方案,电子设备可以结合第三声音的方位信息、声纹信息、时间信息、年龄信息等,输出相应的交互信息。With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the above method further includes: the electronic device outputs interactive information according to the sound attribute information of the third sound. The interactive information can be voice interactive information or control signals. Based on this solution, the electronic device can combine the position information, voiceprint information, time information, age information, etc. of the third voice to output corresponding interactive information.
本申请实施例的第二方面,提供一种语音增强装置,该装置包括:处理器,用于识别第一声音,该第一声音由语音采集设备采集,该第一声音包括第二声音和背景声音中的至少一项;当该第一声音存在上述第二声音时,上述处理器对该第二声音进行声音分析,得到第三声音;上述处理器处理该第三声音。In a second aspect of the embodiments of the present application, there is provided a voice enhancement device, the device including: a processor for recognizing a first sound, the first sound being collected by a voice collecting device, the first sound including a second sound and a background At least one of the sounds; when the first sound has the second sound, the processor performs sound analysis on the second sound to obtain a third sound; the processor processes the third sound.
结合第二方面,在一种可能的实现方式中,上处理器还用于,根据声音事件识别 模型,对上述第一声音进行声音事件识别,获取该第一声音的声音类别信息。With reference to the second aspect, in a possible implementation manner, the upper processor is further configured to perform sound event recognition on the first sound according to the sound event recognition model, and obtain sound category information of the first sound.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述处理器,还用于:根据上述第一声音的声音类别信息,从上述第一声音中分离出上述第二声音;分析上述第二声音的声音属性信息;其中,该声音属性信息包括:声音方位信息、声纹信息、声音时间信息、声音分贝信息中的一种或多种;根据该第二声音的声音属性信息,得到上述第三声音。With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the above-mentioned processor is further configured to: according to the sound category information of the above-mentioned first sound, from the above-mentioned first sound Separate the second sound; analyze the sound attribute information of the second sound; wherein, the sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information; The sound attribute information of the second sound obtains the above-mentioned third sound.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述第三声音的声纹信息与已注册用户的声纹信息匹配。In combination with the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the voiceprint information of the third voice mentioned above matches the voiceprint information of the registered user.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述处理器,还用于:将第四声音进行聚类,获取新的声音类别信息;该第四声音为上述第一声音中,根据声音事件识别模型未识别出声音类别信息的声音;根据该新的声音类别信息,更新上述声音事件识别模型,获取更新后的声音事件识别模型。With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the above-mentioned processor is further configured to: cluster the fourth sound to obtain new sound category information; The fourth voice is the voice whose voice category information is not recognized according to the voice event recognition model in the first voice; the voice event recognition model is updated according to the new voice category information, and the updated voice event recognition model is obtained.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述处理器,还用于:获取上述第一声音的声音方位信息;根据上述第一声音的声音方位信息以及上述第一声音的声音类别信息,获取无用声音的方位信息;根据该无用声音的方位信息,将来自该方位的声音过滤。In combination with the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the above-mentioned processor is further configured to: obtain the sound orientation information of the above-mentioned first sound; The sound position information and the sound category information of the first sound mentioned above are used to obtain the position information of the unwanted sound; according to the position information of the unwanted sound, the sound from the position is filtered.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述处理器,还用于:获取上述第三声音的交互语音信息;将上述第三声音的交互语音信息中未注册声纹的第三声音,进行声纹注册。In combination with the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the above-mentioned processor is further configured to: obtain the interactive voice information of the above-mentioned third sound; For the third voice whose voiceprint is not registered in the interactive voice message, perform voiceprint registration.
结合第二方面或第二方面的任一可能的实现方式,在另一种可能的实现方式中,上述处理器,还用于:根据上述第三声音的声音属性信息,输出交互信息。With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the above-mentioned processor is further configured to output interactive information according to the sound attribute information of the above-mentioned third sound.
本申请实施例的第三方面,本申请实施例提供一种电子设备,该电子设备可以实现第一方面所述的语音增强方法,其可以通过软件、硬件、或者通过硬件执行相应的软件实现上述方法。在一种可能的设计中,该电子设备可以包括处理器和存储器。该处理器被配置为支持该电子设备执行上述第一方面或第二方面方法中相应的功能。存储器用于与处理器耦合,其保存该电子设备必要的程序指令和数据。In the third aspect of the embodiments of the present application, the embodiments of the present application provide an electronic device that can implement the voice enhancement method described in the first aspect, which can be implemented by software, hardware, or by hardware executing corresponding software. method. In one possible design, the electronic device may include a processor and a memory. The processor is configured to support the electronic device to perform corresponding functions in the above-mentioned first aspect or second aspect method. The memory is used for coupling with the processor, and it stores the necessary program instructions and data of the electronic device.
本申请实施例的第四方面,本申请实施例提供一种计算机存储介质,该计算机存储介质包括计算机指令,当所述计算机指令在电子设备上运行时,使得电子设备执行如上述任一方面及其可能的设计方式所述的语音增强方法。In the fourth aspect of the embodiments of the present application, the embodiments of the present application provide a computer storage medium. The computer storage medium includes computer instructions. When the computer instructions run on an electronic device, the electronic device executes any of the above aspects and The possible design method of the speech enhancement method.
本申请实施例的第五方面,本申请实施例提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如上述任一方面及其可能的设计方式所述的语音增强方法。In the fifth aspect of the embodiments of the present application, the embodiments of the present application provide a computer program product that, when the computer program product runs on a computer, causes the computer to execute as described in any of the foregoing aspects and possible design methods. Voice enhancement method.
上述第二方面,第三方面,第四方面以及第五方面的效果描述可以参考第一方面的相应效果的描述,在此不再赘述。For the description of the effects of the second aspect, the third aspect, the fourth aspect, and the fifth aspect described above, reference may be made to the description of the corresponding effects of the first aspect, which will not be repeated here.
附图说明Description of the drawings
图1为本申请实施例提供的一种声音环境的场景实例示意图;FIG. 1 is a schematic diagram of a scene example of a sound environment provided by an embodiment of the application;
图2为本申请实施例提供的一种电子设备的硬件结构的组成示意图;2 is a schematic diagram of the composition of a hardware structure of an electronic device provided by an embodiment of the application;
图3为本申请实施例提供的一种电子设备的软件架构示意图;3 is a schematic diagram of the software architecture of an electronic device provided by an embodiment of the application;
图4为本申请实施例提供的一种语音增强方法适应的***架构示意图;4 is a schematic diagram of a system architecture adapted to a voice enhancement method provided by an embodiment of this application;
图5为本申请实施例提供的一种语音增强方法的流程示意图;FIG. 5 is a schematic flowchart of a voice enhancement method provided by an embodiment of this application;
图6为本申请实施例提供的一种声音定位方法的原理示意图;6 is a schematic diagram of the principle of a sound localization method provided by an embodiment of the application;
图7为本申请实施例提供的另一种语音增强方法的流程示意图;FIG. 7 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application;
图8为本申请实施例提供的另一种语音增强方法的流程示意图;FIG. 8 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application;
图9为本申请实施例提供的另一种语音增强方法的流程示意图;FIG. 9 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application;
图10为本申请实施例提供的一种电子设备的结构组成示意图。FIG. 10 is a schematic diagram of the structural composition of an electronic device provided by an embodiment of the application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c,或,a和b和c,其中a、b和c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分,本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。比如,本申请实施例中的第一应用中的“第一”和第二应用中的“第二”仅用于区分不同的应用程序。本申请实施例中出现的第一、第二等描述,仅作示意与区分描述对象之用,没有次序之分,也不表示本申请实施例中对设备个数的特别限定,不能构成对本申请实施例的任何限制。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, both A and B exist, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are in an "or" relationship. "The following at least one item (a)" or similar expressions refers to any combination of these items, including any combination of a single item (a) or plural items (a). For example, at least one of a, b, or c can mean: a, b, c, a and b, a and c, b and c, or, a and b and c, where a, b and c c can be single or multiple. In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same items or similar items that have substantially the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order. For example, the "first" in the first application and the "second" in the second application in the embodiments of the present application are only used to distinguish different applications. The descriptions of the first, second, etc. appearing in the embodiments of this application are only used for illustration and distinguishing the description objects, and there is no order, and it does not mean that the number of devices in the embodiments of this application is particularly limited, and cannot constitute a reference to this application. Any limitations of the embodiment.
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。It should be noted that in this application, words such as "exemplary" or "for example" are used to indicate examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.
首先,对本申请实施例中的基本概念进行介绍。First, the basic concepts in the embodiments of the present application are introduced.
声纹:是由波长、频率以及强度等百余种特征维度组成的生物特征。人类语言的产生是人体语言中枢与发音器官之间一个复杂的生理物理过程,由于不同人的发声器官在大小、形态和功能上都存在差异,所以任意两个人的声纹图谱都会有差异。声纹既具有特定性、相对稳定性、又有变异性的特点。声纹的特定性是指不同人的声纹不同,即使讲话者是故意模仿他人声音和语气,还是耳语轻声讲话,即使模仿得惟妙惟肖,其声纹却始终不相同。声纹的相对稳定性是指成年以后,人的声纹可以保持长期相对稳定不变。声纹的变异性可能来自生理、病理、心理、模拟、伪装,也与环境干扰等因素有关。由于每个人的发音器官都不完全相同,基于声纹特定性和相对稳定性的特点,可以通过声纹对不同人的声音进行区分。Voiceprint: It is a biological feature composed of more than one hundred characteristic dimensions such as wavelength, frequency and intensity. The production of human language is a complex physiological and physical process between the human body's language center and the vocal organs. Because different people's vocal organs are different in size, shape and function, the voiceprint patterns of any two people will be different. Voiceprint has the characteristics of specificity, relative stability and variability. The specificity of voiceprint means that different people have different voiceprints. Even if the speaker deliberately imitates the voice and tone of others, or whispers softly, even if the imitation is vivid, the voiceprint is always different. The relative stability of the voiceprint means that the human voiceprint can remain relatively stable for a long time after adulthood. The variability of voiceprint may come from physiology, pathology, psychology, simulation, camouflage, and is also related to factors such as environmental interference. Since each person’s articulation organs are not exactly the same, based on the specificity and relative stability of the voiceprint, the voices of different people can be distinguished by the voiceprint.
声纹识别:是生物识别技术的一种,可以从说话人发出的语音信号中提取声纹特征进行身份辨识。即声纹识别也可称为说话人识别,包括说话人辨认和说话人确认两种。其中,说话人辨认是用于判断某段语音是多个人中的哪个人所说;而说话人确认是用于确认某段语音是否是指定的某个人所说的。可选的,可以采用深度学习算法提 取说话人的声纹特征。例如,采用经典的深度神经网络的说话人识别***以及基于端到端深度神经网络的说话人特征提取***,提取说话人的声纹特征。Voiceprint recognition: It is a type of biometric technology that can extract voiceprint features from the speaker's voice signal for identity recognition. That is, voiceprint recognition can also be called speaker recognition, including speaker recognition and speaker confirmation. Among them, speaker recognition is used to determine which of a plurality of people speaks a certain speech; and speaker confirmation is used to confirm whether a certain speech is spoken by a designated person. Optionally, a deep learning algorithm can be used to extract the speaker's voiceprint features. For example, the speaker recognition system based on the classic deep neural network and the speaker feature extraction system based on the end-to-end deep neural network are used to extract the voiceprint features of the speaker.
示例性的,电子设备在进行声纹识别前,可以先进行声纹注册。声纹注册即,用户通过电子设备的拾音设备(比如,麦克风)录入一段语音,比如,可以称为注册语音;设备提取该注册语音的声纹特征;并建立和保存该声纹特征与录入语音的用户的对应关系。本申请实施例中,进行了声纹注册的用户可以称为已注册用户。可以理解的,上述声纹注册可以是进行与用户录入语音的内容无关的声纹注册,即电子设备可以从用户录入的语音中提取声纹特征进行注册。本申请实施例对于从语音信息中提取声纹特征的具体方法并不进行限定。Exemplarily, the electronic device may perform voiceprint registration before performing voiceprint recognition. Voiceprint registration means that the user records a piece of voice through the pickup device of the electronic device (such as a microphone), for example, it can be called a registered voice; the device extracts the voiceprint feature of the registered voice; and establishes and saves the voiceprint feature and input Correspondence of voice users. In the embodiment of the present application, a user who has registered a voiceprint may be referred to as a registered user. It is understandable that the aforementioned voiceprint registration may be a voiceprint registration that is not related to the content of the voice entered by the user, that is, the electronic device may extract voiceprint features from the voice entered by the user for registration. The embodiment of the present application does not limit the specific method for extracting voiceprint features from voice information.
示例性的,电子设备可以支持多个用户的声纹注册。比如,多个用户分别通过电子设备的拾音设备录入一段语音。示例性的,用户1录入语音1,用户2录入语音2,用户3录入语音3。电子设备分别提取各段语音(语音1、语音2和语音3)的声纹特征。进一步的,电子设备建立和保存语音1的声纹特征与用户1的对应关系,建立和保存语音2的声纹特征与用户2的对应关系,建立和保存语音3的声纹特征与用户3的对应关系。这样,电子设备保存了多个用户与该用户的注册语音提取的声纹特征的对应关系,这些用户即为该电子设备上的已注册用户。Exemplarily, the electronic device may support voiceprint registration of multiple users. For example, multiple users separately record a segment of voice through the pickup device of the electronic device. Exemplarily, user 1 enters voice 1, user 2 enters voice 2, and user 3 enters voice 3. The electronic device separately extracts the voiceprint features of each segment of voice (voice 1, voice 2 and voice 3). Further, the electronic device establishes and saves the corresponding relationship between the voiceprint feature of voice 1 and user 1, establishes and saves the corresponding relationship between the voiceprint feature of voice 2 and user 2, and establishes and saves the voiceprint feature of voice 3 with that of user 3. Correspondence. In this way, the electronic device saves the correspondence between multiple users and the voiceprint features extracted from the user's registered voice, and these users are the registered users on the electronic device.
示例性的,电子设备接收到用户的语音后,可以进行声纹识别。比如,电子设备接收到用户的语音后,可以提取该语音的声纹特征;电子设备还可以获取各个注册用户的注册语音的声纹特征;电子设备将接收到的用户语音的声纹特征分别与各个注册用户的注册语音的声纹特征进行比较,获取接收到的用户语音的声纹特征分别与各个注册用户的注册语音的声纹特征是否匹配。Exemplarily, the electronic device may perform voiceprint recognition after receiving the user's voice. For example, after the electronic device receives the user’s voice, it can extract the voiceprint features of the voice; the electronic device can also acquire the voiceprint features of the registered voice of each registered user; the electronic device compares the received voiceprint features of the user’s voice with The voiceprint features of the registered voices of each registered user are compared, and it is obtained whether the voiceprint features of the received user voice match the voiceprint features of the registered voice of each registered user.
可选的,本申请实施例中已注册声纹的多个用户的权限可以不同。例如,根据用户注册声纹的方式的不同,其使用权限可以不同。如果已注册的声纹为用户主动注册的,那么该用户的权限较高。如果已注册的声纹为通过自适应学习注册的,那么该用户的权限较低,不能使用该保密性的功能。该已注册声纹的用户权限也可以是用户预先设置的,本申请实施例对此并不进行限定。Optionally, the permissions of multiple users who have registered voiceprints in this embodiment of the application may be different. For example, according to the different ways in which users register voiceprints, their usage rights can be different. If the registered voiceprint is actively registered by the user, then the user has higher authority. If the registered voiceprint is registered through adaptive learning, the user's authority is low and the confidentiality function cannot be used. The user authority of the registered voiceprint may also be preset by the user, which is not limited in the embodiment of the present application.
语音增强技术是指在复杂的声音环境下,从包含多种噪声的语音信号中,提取有用的语音信号,抑制并降低噪声干扰的技术。即从含噪语音中提取尽可能纯净的原始语音。Voice enhancement technology refers to a technology that extracts useful voice signals from voice signals containing multiple noises in a complex sound environment to suppress and reduce noise interference. That is, extract the original voice as pure as possible from the noisy voice.
示例性的,以电子设备为智能音箱,该智能音箱所处的声音环境为如图1所示的场景为例。如图1所示,智能音箱可以接收环境中的多种声音输入,该声音输入可以包括用户询问“今天天气怎么样”,宠物狗的叫声、电视声、人物交谈的声音等。一种语音增强方法,会将图1所示的环境中的四种声音都进行同样的增强,因此输入智能音箱的声音既包括用户命令语音,也包括环境中的声音,这将导致电子设备无法正确的捕捉到目标用户的语音命令,造成用户体验感较差的问题。而且该语音增强算法对于电子设备所处在复杂的声音环境中时,无法做到对环境自适应。Exemplarily, the electronic device is a smart speaker, and the sound environment in which the smart speaker is located is the scene shown in FIG. 1 as an example. As shown in Figure 1, the smart speaker can receive a variety of sound inputs in the environment. The sound input may include the user's asking "how is the weather today", the barking of pet dogs, the sound of TV, the sound of people talking, etc. A voice enhancement method that performs the same enhancement on the four sounds in the environment shown in Figure 1. Therefore, the sound input to the smart speaker includes both the user command voice and the sound in the environment, which will cause the electronic device to fail The voice command of the target user is captured correctly, which causes the problem of poor user experience. Moreover, the speech enhancement algorithm cannot adapt to the environment when the electronic device is in a complex sound environment.
为了解决现有技术中的语音增强算法在复杂的声音环境下,用户智能交互的体验较差的问题。本申请实施例提供了一种语音增强方法,能够在声音环境较为复杂的场景下,准确的从复杂的声音环境中捕捉到目标用户的语音,提升用户智能交互体验。In order to solve the problem that the voice enhancement algorithm in the prior art has a poor user experience of intelligent interaction in a complex sound environment. The embodiment of the present application provides a voice enhancement method, which can accurately capture the voice of a target user in a complex sound environment in a scene with a relatively complex sound environment, and enhance the user's intelligent interaction experience.
本申请实施例通过的语音增强方法,可以应用于电子设备中,该电子设备可以是智能音箱、智能电视、智能冰箱等可以进行语音交互的智能家居设备,也可以是智能手表、智能眼镜、智能头盔等可以进行语音交互的可穿戴式电子设备,还可以是其他形式的可以进行语音交互的设备。本申请实施例对该电子设备的具体形态不作特殊限制。The voice enhancement method adopted in the embodiments of this application can be applied to electronic devices. The electronic devices can be smart speakers, smart TVs, smart refrigerators, and other smart home devices that can perform voice interaction, or smart watches, smart glasses, smart Wearable electronic devices such as helmets that can perform voice interactions can also be other forms of devices that can perform voice interactions. The embodiments of the present application do not impose special restrictions on the specific form of the electronic device.
请参考图2,为本申请实施例提供的一种电子设备100的结构示意图。如图2所示,电子设备100可以包括处理器110,存储器120,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块150,无线通信模块160,音频模块170,扬声器170A,麦克风170B等。可选的,电子设备100还可以包括电池180。Please refer to FIG. 2, which is a schematic structural diagram of an electronic device 100 provided by an embodiment of this application. As shown in FIG. 2, the electronic device 100 may include a processor 110, a memory 120, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 150, a wireless communication module 160, and an audio module 170 , Speaker 170A, microphone 170B, etc. Optionally, the electronic device 100 may further include a battery 180.
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP)。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a controller, a memory, a video codec, and a digital signal processor (DSP). . Among them, the different processing units may be independent devices or integrated in one or more processors.
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。Among them, the controller may be the nerve center and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了***的效率。A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
存储器120,用于存储指令和数据。在一些实施例中,存储器120为高速缓冲存储器。该存储器可以保存处理器110使用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从存储器120中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了***的效率。The memory 120 is used to store instructions and data. In some embodiments, the memory 120 is a cache memory. The memory can store instructions or data used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory 120. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
存储器120可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在存储器120的指令,从而执行电子设备100的各种功能应用以及数据处理。例如,在本申请实施例中,处理器110可以通过执行存储在存储器120中的指令,对声音进行增强处理。其中,存储程序区可存储操作***,至少一个功能所需的应用程序(比如声音播放功能)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The memory 120 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the memory 120. For example, in the embodiment of the present application, the processor 110 may perform sound enhancement processing by executing instructions stored in the memory 120. Among them, the storage program area can store an operating system, an application program required by at least one function (such as a sound playback function), and the like. The storage data area can store data (such as audio data, etc.) created during the use of the electronic device 100. In addition, the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
在一些实施例中,存储器120也可以设置于处理器110中,即处理器110包括存储器120。本申请实施例对此不进行限定。In some embodiments, the memory 120 may also be provided in the processor 110, that is, the processor 110 includes the memory 120. The embodiment of the present application does not limit this.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB 接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池180充电的同时,还可以通过电源管理模块150为电子设备供电。The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive the charging input of the wired charger through the USB interface 130. In some embodiments of wireless charging, the charging management module 140 may receive the wireless charging input through the wireless charging coil of the electronic device 100. While the charging management module 140 charges the battery 180, it can also supply power to the electronic device through the power management module 150.
电源管理模块150用于连接电池180,充电管理模块140与处理器110。电源管理模块150接收电池180和/或充电管理模块140的输入,为处理器110,存储器120,和无线通信模块160等供电。电源管理模块150还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块150也可以设置于处理器110中。在另一些实施例中,电源管理模块150和充电管理模块140也可以设置于同一个器件中。The power management module 150 is used to connect the battery 180, the charging management module 140 and the processor 110. The power management module 150 receives input from the battery 180 and/or the charging management module 140, and supplies power to the processor 110, the memory 120, and the wireless communication module 160. The power management module 150 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance). In some other embodiments, the power management module 150 may also be provided in the processor 110. In other embodiments, the power management module 150 and the charging management module 140 may also be provided in the same device.
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星***(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线转为电磁波辐射出去。The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via an antenna, modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110. The wireless communication module 160 may also receive the signal to be sent from the processor 110, perform frequency modulation, amplify, and convert it into electromagnetic waves to radiate through the antenna.
电子设备100可以通过音频模块170,扬声器170A,以及应用处理器等实现音频功能。例如音乐播放,语音收集等。The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, and an application processor. For example, music playback, voice collection, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。The speaker 170A, also called a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
麦克风170B,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当用户与电子设备之间进行语音交互时,麦克风170B可以接收用户的语音输入。电子设备100可以设置至少一个麦克风170B。在另一些实施例中,电子设备100可以设置两个麦克风170B,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170B,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。可选的,电子设备100可以设置6个麦克风170B,构成麦克风阵列,通过该麦克风阵列可以分析声音的方位信息。The microphone 170B, also called "microphone", "microphone", is used to convert sound signals into electric signals. When the user performs voice interaction with the electronic device, the microphone 170B can receive the user's voice input. The electronic device 100 may be provided with at least one microphone 170B. In some other embodiments, the electronic device 100 may be provided with two microphones 170B, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 100 may also be provided with three, four or more microphones 170B to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions. Optionally, the electronic device 100 may be provided with six microphones 170B to form a microphone array, through which the azimuth information of the sound can be analyzed.
以下实施例中的方法均可以在具有上述硬件结构的电子设备100中实现。The methods in the following embodiments can all be implemented in the electronic device 100 having the above hardware structure.
可以理解的,电子设备100在实现本申请实施例提供的语音增强方法时,可以划分为不同的功能模块。示例性的,如图3,本申请实施例提供的一种电子设备100可以包括音频收集模块,音频处理模块,音频识别模块,音频合成模块,语音交互模块等功能模块。It can be understood that the electronic device 100 can be divided into different functional modules when implementing the voice enhancement method provided in the embodiments of the present application. Exemplarily, as shown in FIG. 3, an electronic device 100 provided in an embodiment of the present application may include functional modules such as an audio collection module, an audio processing module, an audio recognition module, an audio synthesis module, and a voice interaction module.
其中,音频收集模块用于获取音频数据,并负责存储和转发获取的音频数据。比如,该音频收集模块的功能可以通过图2中的麦克风170B实现,该麦克风170B可以 接收多种声音输入,并通过音频模块170转化为音频数据;音频收集模块可以获取音频数据,并存储和向其他模块转发。Among them, the audio collection module is used to obtain audio data, and is responsible for storing and forwarding the obtained audio data. For example, the function of the audio collection module can be realized by the microphone 170B in FIG. 2. The microphone 170B can receive a variety of sound inputs and convert them into audio data through the audio module 170; the audio collection module can obtain audio data, store and send it to Other modules forward.
音频处理模块用于对音频收集模块获取的音频数据进行增强处理。比如,对接收到的语音数据进行声音属性分析以及声音的分离和过滤。可选的,音频处理模块可以包括声音事件识别模块、声音定位模块、声纹识别模块等。其中,声音事件识别模块用于对音频收集模块收集的多种声音进行识别,确定每种声音的类别信息。声音定位模块用于对音频收集模块收集的多种声音进行定位,分析每种声音的方位信息。声纹识别模块用于对音频收集模块收集的多种声音数据进行声纹识别。本申请实施例中,声纹识别模块可以用于对输入语音进行声纹特征提取,确定其是否与已注册用户的声纹匹配。The audio processing module is used to enhance the audio data acquired by the audio collection module. For example, perform sound attribute analysis and sound separation and filtering on the received voice data. Optionally, the audio processing module may include a sound event recognition module, a sound localization module, a voiceprint recognition module, and so on. Among them, the sound event recognition module is used to recognize multiple sounds collected by the audio collection module, and determine the category information of each sound. The sound location module is used to locate multiple sounds collected by the audio collection module and analyze the location information of each sound. The voiceprint recognition module is used to perform voiceprint recognition on various voice data collected by the audio collection module. In the embodiment of the present application, the voiceprint recognition module may be used to extract voiceprint features of the input voice to determine whether it matches the voiceprint of a registered user.
音频识别模块用于对处理后的音频数据进行语音识别。比如,识别唤醒词,识别语音指令等。The audio recognition module is used to perform voice recognition on the processed audio data. For example, recognize wake words, recognize voice commands, etc.
音频合成模块用于对音频数据进行合成和播放。比如,可以将服务器的命令合成为音频数据,并通过扬声器170A进行语音播报。The audio synthesis module is used to synthesize and play audio data. For example, the commands of the server can be synthesized into audio data, and voice broadcast can be performed through the speaker 170A.
语音交互模块的输入为音频处理模块进行增强处理后的目标用户的语音,该语音交互模块用于根据该目标用户的语音,获得语音交互输出。The input of the voice interaction module is the voice of the target user that is enhanced by the audio processing module, and the voice interaction module is used to obtain the voice interaction output according to the voice of the target user.
示例性的,上述音频处理模块、音频识别模块、音频合成模块以及语音交互模块的相关功能可以通过存储在存储器120中的程序指令实现。例如,处理器110可以通过执行存储器120中存储的音频处理模块相关的程序指令,实现声纹识别的功能。Exemplarily, the related functions of the audio processing module, audio recognition module, audio synthesis module, and voice interaction module described above may be implemented by program instructions stored in the memory 120. For example, the processor 110 may implement the voiceprint recognition function by executing program instructions related to the audio processing module stored in the memory 120.
示例性的,本申请实施例提供的语音增强方法可以应用于图4所示的***架构。如图4所示,输入智能语音增强模型的声音既包括日常环境中的声音(例如,水流声、电视声、做饭声等),也包括用户输入的语音命令。该智能语音增强模型用于将输入的多种声音通过增强处理,其输出仅包括用户输入的语音命令。Exemplarily, the voice enhancement method provided in the embodiment of the present application may be applied to the system architecture shown in FIG. 4. As shown in Fig. 4, the sound input to the intelligent speech enhancement model includes both the sound in the daily environment (for example, the sound of water, the sound of TV, the sound of cooking, etc.), and the voice command input by the user. The intelligent voice enhancement model is used to enhance the input multiple sounds, and its output only includes the voice commands input by the user.
如图4所示,智能语音增强模型的处理过程包括声音识别、声音属性分析以及声音分离及过滤,其中,声音识别包括声音类别信息的识别,结合识别出的声音类别信息可以将用户声音与背景声音分离。声音属性分析是对分离出的用户声音进行声音定位、时间分析、声纹分析等属性分析,该增强模型可以结合声音的属性信息从用户声音中确定目标用户的语音交互命令,即可以从多个人声中提取出目标用户的语音交互命令。可以理解的,本申请通过智能语音增强模型电子设备采集到的较为复杂的声音环境中的声音进行针对性增强处理,从而提取目标用户的语音交互命令,因此根据该干净的目标用户的语音交互命令,能够输出较为准确的交互信息,提升用户的智能交互体验。As shown in Figure 4, the processing process of the intelligent voice enhancement model includes voice recognition, voice attribute analysis, and voice separation and filtering. Among them, voice recognition includes the recognition of voice category information. Combining the recognized voice category information, the user’s voice can be combined with the background. The sound is separated. Voice attribute analysis is to perform attribute analysis such as sound location, time analysis, and voiceprint analysis of the separated user voice. The enhancement model can determine the target user’s voice interaction command from the user’s voice in combination with the voice attribute information, that is, it can be from multiple people The voice interaction commands of the target user are extracted from the voice. It is understandable that this application performs targeted enhancement processing on the sounds in the more complex sound environment collected by the intelligent voice enhancement model electronic device, thereby extracting the voice interaction commands of the target user, and therefore according to the voice interaction commands of the clean target user , Can output more accurate interactive information, and enhance the user's intelligent interactive experience.
图4中的用户交互反馈是指用户对交互过程的反馈。例如,用户输入语音交互命令,电子设备反馈交互输出后,用户继续与电子设备进行语音交互;或者,用户输出交互命令,电子设备反馈语音交互输出后,用户停止对话等,均可以作为增强结果的判断。用户停止对话说明增强结果可能不是很好,用户对语音交互输出不太满意,停止对话。用户继续请求语音交互,说明增强结果较好,能够满足用户的交互需求。上述交互反馈可以协助智能增强模型进行自学习,使得增强模型更加适应所处的声音环境。The user interaction feedback in Figure 4 refers to the user's feedback on the interaction process. For example, after the user inputs a voice interactive command and the electronic device feeds back the interactive output, the user continues to perform voice interaction with the electronic device; or, the user outputs an interactive command, and the electronic device feeds back the voice interactive output, and the user stops the dialogue, etc., all of which can be used as enhancement results judgment. The user stops the dialogue, indicating that the enhancement result may not be very good, and the user is not satisfied with the voice interactive output and stops the dialogue. The user continues to request voice interaction, indicating that the enhancement result is better and can meet the user's interaction needs. The above interactive feedback can assist the intelligent enhancement model in self-learning, so that the enhancement model is more suitable for the sound environment in which it is located.
以下将以上述电子设备为智能音箱为例,对本申请实施例提供的技术方案进行具体阐述。结合图4,如图5所示,该语音增强方法可以包括步骤S501-S505:The following will take the above-mentioned electronic device as a smart speaker as an example to describe the technical solutions provided in the embodiments of the present application in detail. With reference to Fig. 4, as shown in Fig. 5, the speech enhancement method may include steps S501-S505:
S501、电子设备采集第一声音。S501. The electronic device collects the first sound.
该第一声音包括第二声音和背景声音中的至少一项。该第二声音可以包括用户说话的声音(人类声音,简称人声)。该第二声音中可以包括用户输入的语音交互命令。The first sound includes at least one of a second sound and a background sound. The second voice may include the voice of the user (human voice, human voice for short). The second sound may include a voice interaction command input by the user.
示例性的,背景声音可以为电子设备当前所处环境中的声音。例如,电子设备当前所处环境中的声音可以包括宠物叫声、电视声、水流声、咖啡机声、微波炉声、做饭声、空调声等日常环境声音。本申请实施例对于背景声音的具体类型并不进行限定,在此仅是示例性说明。Exemplarily, the background sound may be the sound in the environment where the electronic device is currently located. For example, the sound in the environment where the electronic device is currently located may include daily environmental sounds such as pet calls, TV sounds, water flowing sounds, coffee machine sounds, microwave oven sounds, cooking sounds, air conditioning sounds, and so on. The embodiments of the present application do not limit the specific types of background sounds, which are only exemplary descriptions here.
如图1所示,以智能音箱所处的环境为图1所示的声音环境为例,该第一声音可以包括:第二声音和背景声音。其中,第二声音包括人物交谈的声音、用户询问“今天天气怎么样”的语音命令。背景声音包括电视声和狗叫声。As shown in FIG. 1, taking the environment of the smart speaker as the sound environment shown in FIG. 1 as an example, the first sound may include: a second sound and a background sound. Among them, the second voice includes the voice of the character talking, and the voice command of the user asking "how is the weather today". Background sounds include TV sounds and dog barking.
示例性的,上述第一声音包括的第二声音和背景声音可以来自多个不同的声源。例如,电视声、狗叫声、用户语音命令、以及人物交谈声这四种声音的声源互不相同。可选的,不同声源的声音特征互不相同。Exemplarily, the second sound and background sound included in the foregoing first sound may come from multiple different sound sources. For example, the sound sources of the four sounds of TV sound, dog barking, user voice command, and character conversation sound are different from each other. Optionally, the sound characteristics of different sound sources are different from each other.
示例性的,电子设备采集第一声音包括:电子设备的麦克风阵列采集第一声音。可选的,麦克风阵列可以通过音频模块170将麦克风170B采集的第一声音转化为音频数据。Exemplarily, collecting the first sound by the electronic device includes: collecting the first sound by a microphone array of the electronic device. Optionally, the microphone array can convert the first sound collected by the microphone 170B into audio data through the audio module 170.
S502、电子设备识别第一声音。S502. The electronic device recognizes the first voice.
示例性的,电子设备识别第一声音包括:电子设备根据声音事件识别模型,对第一声音进行声音事件识别,获取第一声音的声音类别信息。例如,声音类别信息可以包括:人声、电视声、空调声、狗叫声、水流声等声音类型。Exemplarily, the electronic device recognizing the first sound includes: the electronic device performs sound event recognition on the first sound according to the sound event recognition model, and obtains sound category information of the first sound. For example, the sound category information may include: human voice, television sound, air-conditioning sound, dog barking sound, water flow sound and other sound types.
例如,上述声音事件识别模型可以为通过大量的声音训练得到的神经网络模型,根据该声音事件识别模型,可以获知声音的类别信息。可选的,上述声音事件识别模型可以在电子设备的本地存储,也可以存储在云端,本申请实施例对此并不进行限定。For example, the aforementioned sound event recognition model may be a neural network model obtained through a large amount of sound training. According to the sound event recognition model, the category information of the sound can be obtained. Optionally, the above-mentioned sound event recognition model can be stored locally in the electronic device, or can be stored in the cloud, which is not limited in the embodiment of the present application.
可选的,上述获取第一声音的类别信息的具体方法可以采用深度学习算法获得,该深度学习算法可以为卷积神经网络、全连接神经网络、长短时记忆网络、门控循环单元等算法。本申请实施例对于获得声音类别信息的具体深度学习算法并不进行限定,在此仅是示例性说明。本实施例通过深度学习算法,可以获取电子设备采集的第一声音的声音类别信息。Optionally, the above-mentioned specific method for obtaining the category information of the first sound may be obtained by using a deep learning algorithm, and the deep learning algorithm may be a convolutional neural network, a fully connected neural network, a long and short-term memory network, a gated loop unit and other algorithms. The embodiment of the present application does not limit the specific deep learning algorithm for obtaining sound category information, and is only an exemplary description here. In this embodiment, through the deep learning algorithm, the sound category information of the first sound collected by the electronic device can be obtained.
例如,结合图1所示,电子设备可以根据卷积神经网络算法,确定图1中智能音箱采集的第一声音的类别信息,包括人声、狗叫声和电视声。For example, in conjunction with FIG. 1, the electronic device can determine the category information of the first sound collected by the smart speaker in FIG. 1, according to the convolutional neural network algorithm, including human voice, dog barking, and TV sound.
示例性的,通过步骤S502中电子设备识别第一声音,获取第一声音的声音类别信息,当电子设备确定第一声音中存在第二声音时,即第一声音中包括人声时,上述方法还包括步骤S503。Exemplarily, the electronic device recognizes the first sound in step S502, and obtains the sound category information of the first sound. When the electronic device determines that the second sound exists in the first sound, that is, when the first sound includes human voices, the above method It also includes step S503.
S503、当第一声音存在第二声音时,电子设备对第二声音进行声音分析,得到第三声音。S503: When the first sound has the second sound, the electronic device performs sound analysis on the second sound to obtain the third sound.
该第三声音为用户语音交互命令。可选的,上述第二声音可以包括第三声音,也可以和第三声音相同。例如,当第二声音仅包括一种人声时,该第二声音可以和第三 声音相同,即该第二声音可以为用户语音交互命令;当第二声音包括至少两种人声时,该第二声音包括第三声音,电子设备可以从第二声音中提取出第三声音,即电子设备可以从多个人声中提取出用户语音交互命令。The third sound is a user's voice interaction command. Optionally, the foregoing second sound may include a third sound, or may be the same as the third sound. For example, when the second sound includes only one human voice, the second sound may be the same as the third sound, that is, the second sound may be a user voice interaction command; when the second sound includes at least two human voices, the The second sound includes the third sound, and the electronic device can extract the third sound from the second sound, that is, the electronic device can extract the user voice interaction command from multiple human voices.
示例性的,上述步骤S503中当第一声音存在第二声音时,电子设备对第二声音进行声音分析,得到第三声音,可以包括步骤S5031-S5033。Exemplarily, when the second sound exists in the first sound in step S503, the electronic device performs sound analysis on the second sound to obtain the third sound, which may include steps S5031-S5033.
S5031、电子设备根据第一声音的声音类别信息,从第一声音中分离出第二声音。S5031. The electronic device separates the second sound from the first sound according to the sound category information of the first sound.
示例性的,电子设备可以结合声音类别信息,从第一声音中分离出第二声音。即在第一声音存在第二声音时,电子设备可以将第一声音中的第二声音与背景声音分离,过滤掉背景声音,分离出人声。Exemplarily, the electronic device may combine the sound category information to separate the second sound from the first sound. That is, when the first sound has the second sound, the electronic device can separate the second sound in the first sound from the background sound, filter out the background sound, and separate the human voice.
可选的,电子设备可以结合声音类别信息,采用声源分离算法将人声与日常环境的背景声音分离,过滤掉日常环境中的背景声音,仅保留人声。例如,电子设备采集的第一声音包括人物交谈声、用户语音交互命令、电视声、狗叫声等声音,电子设备可以结合声音类别信息,将电视声、狗叫声与人物交谈声和用户语音交互命令分离,并过滤掉电视声和狗叫声,仅保留人物交谈声和用户语音交互命令。Optionally, the electronic device can combine the sound category information and use a sound source separation algorithm to separate the human voice from the background sound of the daily environment, filter out the background sound in the daily environment, and retain only the human voice. For example, the first sound collected by an electronic device includes voices such as character conversation, user voice interaction commands, TV sound, dog barking, etc. The electronic device can combine sound category information to combine the TV sound, dog barking sound with character conversation sound and user voice Separate interactive commands, filter out TV sounds and dog barking, and only retain character conversations and user voice interactive commands.
示例性的,上述声源分离算法可以采用传统的分离算法进行声音分离,也可以采用深度学习算法进行声音分离。例如,可以采用传统的矩阵分解算法或奇异值分解算法等,也可以采用深度神经网络、卷积神经网络、全连接神经网络等深度学习算法,实现声音分离。本申请实施例对于声音分离的具体算法并不进行限定,在此仅是示例性说明。本申请实施例基于深度学习算法进行声音分离,能够使电子设备将人声从背景声音中分离,从而能从嘈杂的日常环境声音中,提取出人声。Exemplarily, the foregoing sound source separation algorithm may use a traditional separation algorithm for sound separation, or a deep learning algorithm for sound separation. For example, traditional matrix factorization algorithms or singular value decomposition algorithms can be used, and deep learning algorithms such as deep neural networks, convolutional neural networks, and fully connected neural networks can also be used to achieve sound separation. The embodiment of the present application does not limit the specific algorithm of sound separation, and is only an exemplary description here. The embodiment of the present application performs sound separation based on a deep learning algorithm, which enables the electronic device to separate human voices from background sounds, so that human voices can be extracted from noisy daily environmental sounds.
S5032、电子设备分析第二声音的声音属性信息。S5032. The electronic device analyzes the sound attribute information of the second sound.
其中,声音属性信息包括:声音方位信息、声纹信息、声音时间信息、声音分贝信息中的一种或多种。示例性的,声音方位信息是指声源的方位信息。声纹信息包括声纹特征,是从说话人发出的语音信息中提取的,声纹特征可以进行身份辨识。声音时间信息是指某类声音经常出现的时间段。本申请实施例对于声音的属性信息的具体内容并不进行限定,在此仅是示例性说明。The sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information. Exemplarily, the sound position information refers to the position information of the sound source. Voiceprint information includes voiceprint features, which are extracted from the voice information sent by the speaker, and the voiceprint features can be used for identity recognition. Sound time information refers to the time period when a certain type of sound often appears. The embodiment of the present application does not limit the specific content of the attribute information of the sound, and is only an exemplary description here.
示例性的,上述电子设备分析第二声音的声音属性信息,可以包括:电子设备通过声音定位模块,获取第二声音中不同声源的声音方位信息。例如,以电子设备的麦克风阵列包括6个麦克风为例,电子设备可以通过该6个麦克风采集声音,并分析不同人声的声音方位信息。本申请实施例对于麦克风阵列如何分析声音方位信息的具体方法并不进行限定,可以参考现有技术中基于麦克风阵列的声音定位方法。例如,基于波束形成的声音定位方法,或者,基于高分辨率谱估计的声音定位方法,或者,基于声达时延差的声音定位方法等。Exemplarily, analyzing the sound attribute information of the second sound by the electronic device may include: the electronic device obtains sound orientation information of different sound sources in the second sound through a sound localization module. For example, taking the microphone array of the electronic device including 6 microphones as an example, the electronic device can collect sounds through the 6 microphones and analyze the sound orientation information of different human voices. The embodiment of the present application does not limit the specific method of how the microphone array analyzes the sound orientation information, and reference may be made to the sound localization method based on the microphone array in the prior art. For example, a sound localization method based on beamforming, or a sound localization method based on high-resolution spectrum estimation, or a sound localization method based on a sound arrival delay difference, etc.
例如,以电子设备基于声达时延差的方法确定声音的方位信息为例。智能音箱的麦克风阵列可以根据声源,先进行声达时间差估计,并从中获取麦克风阵列中阵元间的声延迟;再利用获取的声达时间差,结合已知的麦克风阵列的空间位置进一步确定声源的位置。比如,如图6所示,实心圆点1、2、3分别表示智能音箱的3个麦克风170B。声源到麦克风1和麦克风3的时延是一个常数,根据这个常数可以画出图6中的黑色加粗的双曲线;声源到麦克风2和麦克风3的时延也是一个常数,根据这个常 数可以画出图6中的虚线双曲线。这两条双曲线的相交,其交点即为声源的位置,在图6中用黑色实心方格表示。进一步的,智能音箱可以根据麦克风阵列的空间位置确定声源的方位信息。For example, take the method that the electronic device determines the sound position information based on the sound arrival delay difference as an example. The microphone array of the smart speaker can first estimate the sound arrival time difference according to the sound source, and obtain the sound delay between the elements of the microphone array; then use the acquired sound arrival time difference and combine the known spatial position of the microphone array to further determine the sound The location of the source. For example, as shown in FIG. 6, the solid circles 1, 2, and 3 respectively represent the three microphones 170B of the smart speaker. The time delay from the sound source to microphone 1 and microphone 3 is a constant. According to this constant, the black and bold hyperbola in Figure 6 can be drawn; the time delay from the sound source to microphone 2 and microphone 3 is also a constant, according to this constant The dashed hyperbola in Figure 6 can be drawn. The intersection of these two hyperbolas, the intersection point is the position of the sound source, which is represented by a black solid square in Figure 6. Further, the smart speaker can determine the location information of the sound source according to the spatial position of the microphone array.
示例性的,上述电子设备分析第二声音的声音属性信息,还可以包括:电子设备根据盲源分离算法将第二声音中不同的人声分离,对不同的人声进行声纹特征提取,获取每个人声的声纹特征。由于不同的声源的声纹特征互不相同,因此,电子设备可以根据声纹特征对不同人声进行区分。Exemplarily, analyzing the sound attribute information of the second sound by the electronic device may further include: the electronic device separates different human voices in the second sound according to a blind source separation algorithm, and extracts voiceprint features of different human voices to obtain Voiceprint characteristics of each vocal. Since the voiceprint characteristics of different sound sources are different from each other, the electronic device can distinguish different human voices according to the voiceprint characteristics.
示例性的,上述电子设备分析第二声音的声音属性信息,还可以包括:电子设备分析第二声音的分贝大小。Exemplarily, the foregoing electronic device analyzing the sound attribute information of the second sound may further include: the electronic device analyzing the decibel level of the second sound.
S5033、电子设备根据第二声音的声音属性信息,得到第三声音。S5033. The electronic device obtains the third sound according to the sound attribute information of the second sound.
示例性的,电子设备可以根据声音的属性信息,从至少一种人声中提取第三声音(用户语音交互命令)。Exemplarily, the electronic device may extract a third sound (user voice interaction command) from at least one human voice according to the attribute information of the sound.
示例性的,上述第三声音的声纹信息与电子设备中已注册用户的声纹信息匹配。可选的,该已注册用户的声纹可以为用户自主注册的,也可以是电子设备通过自适应学习的方式注册的,本申请实施例对此并不进行限定。可选的,该声纹信息可以为声纹特征。Exemplarily, the voiceprint information of the third voice mentioned above matches the voiceprint information of the registered user in the electronic device. Optionally, the voiceprint of the registered user may be self-registered by the user, or the electronic device may be registered through an adaptive learning method, which is not limited in this embodiment of the application. Optionally, the voiceprint information may be voiceprint features.
一种实现方式中,上述电子设备根据第二声音的声音属性信息,得到第三声音,可以包括:电子设备根据盲源分离算法将第二声音中不同的人声分离,再从不同的人声中提取声纹特征,将不同人声的声纹特征与已注册用户的声纹特征进行匹配,若匹配,将该匹配的人声确定为第三声音。可选的,电子设备可以采用卷积神经网络算法从人声中提取声纹特征。该盲源分离算法可以采用传统的分离算法进行声音分离,也可以采用深度学习算法进行声音分离。例如,可以采用传统的矩阵分解算法或奇异值分解算法等,也可以采用深度神经网络、卷积神经网络、全连接神经网络等深度学习算法,实现声音分离。本申请实施例对于声音分离的具体算法并不进行限定,在此仅是示例性说明。本申请实施例基于深度学习算法进行声音分离,能够第二声音包括多种人声时,将该多种人声分离。In an implementation manner, the electronic device obtains the third sound according to the sound attribute information of the second sound, which may include: the electronic device separates different human voices in the second sound according to a blind source separation algorithm, and then separates different human voices from the different human voices. The voiceprint feature is extracted in the, and the voiceprint features of different voices are matched with the voiceprint features of the registered user. If they match, the matched voice is determined as the third voice. Optionally, the electronic device may use a convolutional neural network algorithm to extract voiceprint features from the human voice. The blind source separation algorithm can use traditional separation algorithms for sound separation, or deep learning algorithms for sound separation. For example, traditional matrix factorization algorithms or singular value decomposition algorithms can be used, and deep learning algorithms such as deep neural networks, convolutional neural networks, and fully connected neural networks can also be used to achieve sound separation. The embodiment of the present application does not limit the specific algorithm of sound separation, and is only an exemplary description here. The embodiment of the present application performs sound separation based on a deep learning algorithm, and when the second sound includes multiple human voices, the multiple human voices can be separated.
示例性的,上述声纹特征与已注册用户的声纹特征匹配可以是指,人声的声纹特征与已注册用户的声纹特征的匹配度超过预设阈值,或者,人声的声纹特征与已注册用户的声纹特征的匹配度最高。例如,可以采用置信度表示输入语音与已注册用户的声纹特征的匹配度,置信度越高,表示输入语音与已注册用户的语音的声纹特征的匹配度越高。可以理解的,置信度还可以描述为置信概率,置信分值,可信分值等,是用于表征声纹识别结果可信程度的值。本申请实施例对此并不进行限定,在本申请实施例中采用置信度进行描述。Exemplarily, the above-mentioned voiceprint feature matching with the voiceprint feature of a registered user may mean that the degree of matching between the voiceprint feature of the human voice and the voiceprint feature of the registered user exceeds a preset threshold, or the voiceprint of the human voice The features have the highest matching degree with the voiceprint features of registered users. For example, the degree of confidence may be used to indicate the degree of matching between the input voice and the voiceprint features of the registered user. The higher the degree of confidence, the higher the degree of matching between the input voice and the voiceprint features of the registered user. It is understandable that the degree of confidence can also be described as a confidence probability, a confidence score, a credibility score, etc., which is a value used to characterize the credibility of the voiceprint recognition result. The embodiment of the present application does not limit this, and the confidence level is used for description in the embodiment of the present application.
例如,输入语音属于用户1的置信度高于输入语音属于其他已注册用户的置信度,则确定该输入语音与用户1匹配。再例如,可以设定一个置信阈值。如果输入语音属于每个已注册用户的置信度都小于或等于置信阈值,则确定输入语音不属于任何一个已注册用户。如果输入语音属于某个已注册用户的置信度大于或等于置信阈值,则确定输入语音输入该已注册用户。示例性的,置信阈值为0.5,输入语音属于用户1的置信度为0.9,输入语音属于用户2的置信度为0.4,则可以确定输入语音属于用户1, 即输入语音与用户1的声纹匹配。For example, if the confidence that the input voice belongs to user 1 is higher than the confidence that the input voice belongs to other registered users, it is determined that the input voice matches user 1. For another example, a confidence threshold can be set. If the confidence that the input voice belongs to each registered user is less than or equal to the confidence threshold, it is determined that the input voice does not belong to any registered user. If the confidence that the input voice belongs to a certain registered user is greater than or equal to the confidence threshold, it is determined that the input voice is input to the registered user. Exemplarily, if the confidence threshold is 0.5, the confidence that the input voice belongs to user 1 is 0.9, and the confidence that the input voice belongs to user 2 is 0.4, it can be determined that the input voice belongs to user 1, that is, the input voice matches the voiceprint of user 1 .
例如,电子设备采集的第一声音包括人物交谈声、用户语音交互命令、电视声、狗叫声等声音,电子设备可以根据声音类别信息,将电视声、狗叫声与人物交谈声和用户语音交互命令分离,并过滤掉电视声和狗叫声,保留人物交谈声和用户语音交互命令。电子设备再根据盲源分离算法将人物交谈声和用户语音交互命令进行分离,分别提取人物交谈声和用户语音交互命令的声纹特征,确定其是否与已注册用户的声纹特征匹配,若匹配,电子设备将该匹配的人声确定为第三声音。For example, the first sound collected by the electronic device includes voices such as character conversation, user voice interaction commands, TV sound, dog barking, etc. The electronic device can combine the TV sound, dog barking sound with character conversation sound and user voice according to the sound category information. Separate interactive commands, filter out TV sounds and dog barks, and retain character conversations and user voice interactive commands. The electronic device then separates the character’s voice from the user’s voice interaction commands according to the blind source separation algorithm, extracts the voiceprint features of the character’s voice and the user’s voice interaction commands, and determines whether it matches the registered user’s voiceprint features, if they match , The electronic device determines the matched human voice as the third voice.
例如,以图1所示的声音环境为例,智能音箱可以根据声音的类别信息从人声、电视声、狗叫声等声音中将人声与日常环境的声音分离,并过滤掉狗叫声和电视声,将不同的人声(人物交谈声和用户语音交互命令)利用盲源分离算法进行分离,并提取人物交谈声和用户语音交互命令的声纹特征,并将其分别与已注册的用户的声纹特征进行匹配,若用户询问“今天天气怎么样”的用户语音的声纹特征与已注册用户的声纹匹配,电子设备可以将用户询问“今天天气怎么样”的声音确定为第三声音。For example, taking the sound environment shown in Figure 1 as an example, smart speakers can separate human voices from sounds in the daily environment from human voices, TV sounds, and dog barking sounds based on the category information of the sound, and filter out dog barking sounds And TV sound, separate different human voices (character conversation sound and user voice interaction command) using blind source separation algorithm, and extract the voiceprint characteristics of character conversation sound and user voice interaction command, and separate them with the registered voice The user’s voiceprint features are matched. If the voiceprint feature of the user’s voice asking "What’s the weather today" matches the voiceprint of the registered user, the electronic device can determine the user’s voice asking "What’s the weather today?" Three voices.
可选的,当电子设备采集的第一声音中包括多个人声,而且该多个人声中有至少两个人声的声纹特征与已注册用户的声纹特征匹配,那么电子设备可以根据该至少两个人声的分贝大小,将声音分贝较大的人声确定为第三声音。可选的,当电子设备采集的第一声音中包括多个人声,而且该多个人声中有至少两个人声的声纹特征与已注册用户的声纹特征匹配,那么电子设备也可以根据该至少两个人声距离电子设备的位置信息,将距离电子设备较近的人声确定为第三声音。Optionally, when the first sound collected by the electronic device includes multiple human voices, and the voiceprint characteristics of at least two of the multiple human voices match the voiceprint characteristics of the registered user, then the electronic device may The decibel level of the two human voices is determined as the third voice with the louder voice. Optionally, when the first sound collected by the electronic device includes multiple human voices, and the voiceprint characteristics of at least two of the multiple human voices match the voiceprint characteristics of the registered user, the electronic device may also The position information of at least two human voices from the electronic device determines the human voice closer to the electronic device as the third voice.
另一种实现方式中,上述电子设备根据第二声音的声音属性信息,得到第三声音,还可以包括:电子设备根据第二声音的方位信息或能量特征或声音特征等信息,从第二声音中得到第三声音。In another implementation manner, the above-mentioned electronic device obtains the third sound according to the sound attribute information of the second sound, and may also include: the electronic device obtains the third sound from the second sound according to the orientation information or energy characteristics or sound characteristics of the second sound. Get the third voice in.
示例性的,电子设备可以根据第二声音中不同人声的方位信息,从第二声音中确定第三声音。例如,以图1所示的声音环境为例,智能音箱可以从电视声、狗叫声、人物交谈声和用户语音交互命令中,将电视声和狗叫声过滤,将人物交谈声和用户语音交互命令分离,再结合人物交谈声以及用户语音交互命令的方位信息。若用户语音交互命令的方位信息位于沙发,人物交谈声的方位信息位于餐桌,智能音箱可以将位于沙发的用户语音交互命令确定为第三声音。再例如,用户语音交互命令和人物交谈声的方位信息均位于沙发,但用户语音交互命令的分贝大小远大于人物交谈声,那么智能音箱也可以将用户语音交互命令确定为目标用户的声音。Exemplarily, the electronic device may determine the third sound from the second sound according to the position information of different human voices in the second sound. For example, taking the sound environment shown in Figure 1 as an example, a smart speaker can filter the TV sound and dog bark from the TV sound, dog barking sound, character conversation sound and user voice interaction commands, and separate the character conversation sound and user voice The interactive commands are separated and combined with the position information of the characters' conversation and the user's voice interactive commands. If the location information of the user's voice interaction command is on the sofa and the location information of the character's conversation sound is on the dining table, the smart speaker can determine the user's voice interaction command on the sofa as the third voice. For another example, the user's voice interaction command and the position information of the character's conversation sound are both located on the sofa, but the decibel size of the user's voice interaction command is much larger than the character's conversation sound, so the smart speaker can also determine the user's voice interaction command as the voice of the target user.
示例性的,电子设备还可以采用深度学习算法,根据声音的能量特征或声音特征,从第二声音包括的不同人声中,确定第三声音。例如,可以通过大量用户相对电子设备的不同位置的交谈声和用户语音交互命令,训练一个神经网络模型,该神经网络模型用于确定用户语音交互命令。电子设备可以基于该训练得到的神经网络模型,根据声音特征或能量特征,从人物交谈声和用户语音交互命令中,将用户语音交互命令确定为第三声音。例如,电子设备可以第二声音中不同人声是正对电子设备,还是背对电子设备,将正对电子设备的人声确定为第三声音。Exemplarily, the electronic device may also use a deep learning algorithm to determine the third sound from different human voices included in the second sound according to the energy characteristics or sound characteristics of the sound. For example, a neural network model can be trained through a large number of users' conversations at different positions relative to the electronic device and user voice interaction commands, and the neural network model is used to determine user voice interaction commands. Based on the neural network model obtained by the training, the electronic device may determine the user's voice interaction command as the third voice from the character's conversation sound and the user's voice interaction command according to the voice feature or the energy feature. For example, the electronic device may determine whether different human voices in the second sound are facing the electronic device or facing away from the electronic device, and determine the human voice facing the electronic device as the third sound.
再一种实现方式中,上述电子设备根据第二声音的声音属性信息,得到第三声音,还可以包括:电子设备根据第二声音的分贝大小,将第二声音中分贝较大的人声确定 为第三声音。In still another implementation manner, the electronic device obtains the third sound according to the sound attribute information of the second sound, and may further include: the electronic device determines the human voice with a larger decibel in the second sound according to the decibel level of the second sound For the third voice.
可以理解的,本申请实施例通过对第二声音进行声音属性分析,并结合分离算法,能够从多种人声中提取出用户语音交互命令。It is understandable that the embodiment of the present application can extract the user's voice interaction commands from various human voices by analyzing the sound attributes of the second voice and combining the separation algorithm.
S504、电子设备处理第三声音。S504. The electronic device processes the third sound.
示例性的,如图1所示,智能音箱可以从图1所示的声音环境中提取出用户语音交互命令为“今天天气怎么样”,并将“今天天气怎么样”输入语音交互模块进行处理。可以理解的,由于本申请实施例中输入语音交互模块的声音为干净的用户语音交互命令,因此,语音交互模块能够根据该语音交互命令,获取较为准确的语音交互输出。Exemplarily, as shown in Fig. 1, the smart speaker can extract the user’s voice interaction command from the sound environment shown in Fig. 1 as "how is the weather today", and input "what is the weather today" into the voice interaction module for processing . It is understandable that since the sound input to the voice interaction module in the embodiment of the present application is a clean user voice interaction command, the voice interaction module can obtain a more accurate voice interaction output according to the voice interaction command.
S505、电子设备根据第三声音,输出交互信息。S505. The electronic device outputs interactive information according to the third sound.
示例性的,该电子设备输出的交互信息可以为语音交互信息,也可以为控制信号。例如,用户的语音交互命令为“今天天气怎么样”,电子设备可以输出语音交互信息“今天天气晴,最低温度25度,最高温度35度”。再例如,用户的语音交互命令为“打开客厅灯”,电子设备可以输出打开客厅灯的控制信号。Exemplarily, the interactive information output by the electronic device may be voice interactive information, or may be a control signal. For example, the user's voice interactive command is "how is the weather today", and the electronic device can output voice interactive information "it is sunny today, the minimum temperature is 25 degrees, and the maximum temperature is 35 degrees." For another example, the user's voice interactive command is "turn on the living room light", and the electronic device may output a control signal to turn on the living room light.
可选的,电子设备还可以结合第三声音的声音属性信息,输出交互信息。示例性的,电子设备可以结合第三声音的方位信息和/或时间信息,输出交互信息。例如,电子设备可以根据第三声音的方位信息,如果第三声音为“开灯”,且该第三声音的方位信息在厨房,电子设备可以根据第三声音的方位信息,输出打开厨房灯的控制信号。再例如,电子设备可以根据第三声音的时间信息,如果第三声音为“打开电视”,电子设备可以经过大量学习过程,输出打开电视的控制信号,并调节电视的音量。再例如,电子设备可以根据第三声音的时间信息,如果第三声音为“打开客厅灯”,电子设备可以结合第三声音的时间信息,输出打开客厅灯的控制信号,并且根据该时间信息打开相适应的亮度。Optionally, the electronic device may also combine the sound attribute information of the third sound to output interactive information. Exemplarily, the electronic device may combine the position information and/or time information of the third sound to output interactive information. For example, the electronic device can use the position information of the third sound. If the third sound is "turn on the light" and the position information of the third sound is in the kitchen, the electronic device can output the indicator of turning on the kitchen light according to the position information of the third sound. control signal. For another example, the electronic device may, according to the time information of the third sound, if the third sound is "Turn on the TV", the electronic device may go through a lot of learning process, output a control signal to turn on the TV, and adjust the volume of the TV. For another example, the electronic device can use the time information of the third sound. If the third sound is "turn on the lights in the living room", the electronic device can combine the time information of the third sound to output a control signal to turn on the lights in the living room, and turn it on according to the time information. Suitable brightness.
示例性的,电子设备还可以结合第三声音的声纹特征,输出交互信息。例如,用户(家里的小孩)的语音交互命令为“开灯”时,电子设备可以结合该用户的声纹特征以及声音时间信息,经过大量学习过程,输出打开次卧灯的控制信号。再例如,妈妈输入语音交互命令“开灯”时,电子设备可以结合该用户的声纹特征以及声音时间信息,输出打开厨房灯的控制信号。Exemplarily, the electronic device may also combine the voiceprint characteristics of the third sound to output interactive information. For example, when the voice interaction command of the user (a child at home) is "turn on the light", the electronic device can combine the user's voiceprint characteristics and sound time information to output a control signal for turning on the light in the second sleeping room after a lot of learning process. For another example, when the mother inputs the voice interactive command "turn on the light", the electronic device can combine the user's voiceprint characteristics and sound time information to output a control signal for turning on the kitchen light.
示例性的,电子设备还可以结合第三声音的年龄信息,输出交互信息。例如,小孩输入“打开电视”的用户语音交互命令时,电子设备可以识别该语音交互命令的年龄信息为儿童,并根据识别出的年龄信息播放动画片或儿童节目。再例如,爷爷输入“打开电视”的用户语音交互命令时,电子设备可以识别该语音交互命令的年龄信息为老人,并根据识别出的年龄信息播放上次观看的电视节目。Exemplarily, the electronic device may also combine the age information of the third voice to output interactive information. For example, when a child inputs a user voice interactive command to "turn on the TV", the electronic device can recognize the age information of the voice interactive command as a child, and play cartoons or children's programs according to the recognized age information. For another example, when grandpa inputs a user voice interactive command to "turn on the TV", the electronic device can recognize that the age information of the voice interactive command is an elderly person, and play the last TV program watched according to the recognized age information.
可选的,本申请实施例中不同声纹的用户权限可能不同。例如,家里的小孩不具有打开电视的权限时,该第三声音为语音交互命令“打开电视”,且该第三声音的声纹特征与已注册用户的声纹特征匹配,但由于该声纹的用户不具备打开电视的权限时,电子设备输出的交互信息可以为“您不具备打开电视的权限”或其他语音提示信息。Optionally, the user rights of different voiceprints in the embodiments of the present application may be different. For example, when a child at home does not have the permission to turn on the TV, the third voice is the voice interactive command "Turn on the TV", and the voiceprint feature of the third voice matches the voiceprint feature of the registered user, but due to the voiceprint When the user of does not have the permission to turn on the TV, the interactive information output by the electronic device can be "You do not have the permission to turn on the TV" or other voice prompts.
可选的,用户注册声纹的方式不同时,用户的权限可以不同。例如,用户采用自适应学习的方式注册的权限较低,用户主动注册声纹时权限较高,可以使用高保密性 的功能。上述不同声纹的用户权限也可以是用户预先设置的。本申请实施例对此并不进行限定。Optionally, when the user registers the voiceprint in different ways, the user authority can be different. For example, users who register with self-adaptive learning have lower permissions, and users have higher permissions when registering voiceprints actively, and can use high-confidentiality functions. The user permissions for the above different voiceprints may also be preset by the user. The embodiment of the present application does not limit this.
可以理解的,本实施例通过电子设备从复杂的声音环境(第一声音)中提取出干净的用户语音交互命令(第三声音),对用户语音交互命令进行处理,从而输出较为准确的交互信息,用户智能交互的体验较好。It is understandable that this embodiment uses the electronic device to extract clean user voice interaction commands (third voice) from the complex sound environment (first voice), and process the user voice interaction commands, thereby outputting more accurate interaction information , The user experience of intelligent interaction is better.
本实施例提供一种语音增强方法,通过电子设备采集第一声音;电子设备识别第一声音;当第一声音存在第二声音时,电子设备对第二声音进行声音分析,得到第三声音;电子设备处理第三声音;电子设备输出交互信息。本实施例能够基于深度学习算法,从包括背景声音的多种声音中分离出人声,并从人声中提取出干净的用户语音交互命令,从而对用户语音交互命令进行处理,获得的语音交互输出较为准确,因此能够准确的完成语音交互,提升了用户智能交互的体验。而且本申请在从复杂的声音环境中提取用户语音交互命令时,通过对声音进行识别和分析,并结合声音的属性信息,得到用户语音交互命令(第三声音),因此,该语音增强方法并不是对所有输入的声音进行同样的增强,而是结合当前采集的声音的属性信息进行针对性增强,因此能够适应复杂的声音环境,提升了复杂声音环境下用户的智能交互体验。This embodiment provides a voice enhancement method, which collects a first sound through an electronic device; the electronic device recognizes the first sound; when the first sound has a second sound, the electronic device performs sound analysis on the second sound to obtain the third sound; The electronic device processes the third sound; the electronic device outputs interactive information. In this embodiment, based on a deep learning algorithm, human voices can be separated from multiple sounds including background sounds, and clean user voice interaction commands can be extracted from the human voices, thereby processing the user voice interaction commands, and the obtained voice interaction The output is relatively accurate, so the voice interaction can be accurately completed, which improves the user's intelligent interaction experience. Moreover, when extracting user voice interaction commands from a complex sound environment, this application recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user voice interaction command (the third voice). Therefore, the voice enhancement method does not Instead of performing the same enhancement on all input sounds, it combines the attribute information of the currently collected sound for targeted enhancement, so it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
本申请实施例还提供一种语音增强方法,如图7所示,在上述步骤S502之后还可以包括步骤S701-S702。The embodiment of the present application also provides a voice enhancement method. As shown in FIG. 7, after the above step S502, steps S701-S702 may be further included.
S701、电子设备将第四声音进行聚类,获取新的声音类别信息。S701. The electronic device clusters the fourth sound to obtain new sound category information.
第四声音为第一声音中,根据声音事件识别模型未识别出声音类别信息的声音。可选的,电子设备可以在存储器中存储该未识别出的第四声音。The fourth voice is the voice of the first voice whose voice category information is not recognized according to the voice event recognition model. Optionally, the electronic device may store the unidentified fourth sound in the memory.
示例性的,电子设备出厂时,声音事件识别模型能够识别的声音类型有限,例如,电子设备的声音事件识别模型能够识别较为常见的声音。例如,人声、水流声、狗叫声、电视声等多种声音,但不能识别出咖啡机磨豆的声音、烤箱的声音以及洗衣机的声音等。电子设备可以收集并存储咖啡机磨豆的声音、烤箱的声音以及洗衣机的声音等未被声音事件识别模型未识别出类别的声音。Exemplarily, when the electronic device leaves the factory, the sound event recognition model can recognize limited types of sounds. For example, the sound event recognition model of the electronic device can recognize more common sounds. For example, there are many sounds such as human voice, water flow, dog barking, TV sound, etc., but the sound of coffee machine grinding, the sound of oven, and the sound of washing machine cannot be recognized. The electronic device can collect and store the sounds of the coffee machine grinding beans, the sound of the oven, and the sound of the washing machine that are not recognized by the sound event recognition model.
可选的,上述第四声音可以为经常出现的未识别出的声音,电子设备可以收集大量的第四声音。Optionally, the foregoing fourth sound may be an unrecognized sound that frequently appears, and the electronic device may collect a large amount of fourth sounds.
示例性的,电子设备可以通过无监督学习将大量的第四声音进行聚类,通过学习发现数据内在的性质及规律,获取新的声音类别信息。例如,电子设备可以通过聚类分析获取新的声音类别信息,该聚类分析方法可以包括综合的层次聚类算法、基于密度的聚类算法,期望最大化(Expectation-maximization,EM)、模糊聚类(Fuzzy Clustering)、K-means算法、K均值聚类(K-means Clustering)等。本申请实施例对于电子设备具体采用的聚类分析方法并不进行限定,在此仅是示例性说明。Exemplarily, the electronic device can cluster a large number of fourth sounds through unsupervised learning, discover the inherent properties and laws of the data through learning, and obtain new sound category information. For example, the electronic device can obtain new sound category information through clustering analysis. The clustering analysis method can include a comprehensive hierarchical clustering algorithm, a density-based clustering algorithm, expectation-maximization (EM), and fuzzy clustering. Class (Fuzzy Clustering), K-means algorithm, K-means clustering (K-means Clustering), etc. The embodiment of the present application does not limit the clustering analysis method specifically adopted by the electronic device, and is only an exemplary description here.
例如,电子设备可以通过其收集的大量的咖啡机磨豆的声音,通过聚类分析算法可以获取一种新的声音类别信息。从而在下次电子设备输入的声音包括咖啡机磨豆的声音时,电子设备可以识别出该声音类别。For example, the electronic device can collect a large number of coffee machine grinding sounds through it, and a new type of sound category information can be obtained through a cluster analysis algorithm. Therefore, next time the sound input by the electronic device includes the sound of the coffee machine grinding beans, the electronic device can recognize the sound category.
S702、电子设备根据新的声音类别信息,更新声音事件识别模型,获取更新后的声音事件识别模型。S702. The electronic device updates the sound event recognition model according to the new sound category information, and obtains the updated sound event recognition model.
示例性的,电子设备可以根据聚类后获取的新的声音类别信息,并将其作为训练 样本,重新训练声音事件识别模型,获取新的声音事件识别模型,该更新后的新的声音事件识别模型更加稳定,鲁棒性更好。即电子设备获取新的声音事件识别模型后,电子设备可以将原声音事件识别模型更新为新的声音事件识别模型。Exemplarily, the electronic device may retrain the sound event recognition model based on the new sound category information obtained after clustering and use it as a training sample to obtain a new sound event recognition model, and the updated new sound event recognition The model is more stable and robust. That is, after the electronic device obtains the new sound event recognition model, the electronic device can update the original sound event recognition model to the new sound event recognition model.
可选的,电子设备更新声音事件识别模型后,电子设备再次执行步骤S501-S505的语音增强方法时,步骤S502中可以根据该更新后的声音事件识别模型对多种声音进行识别,获取每种声音的类别信息。由于更新后的声音事件识别模型能够识别出更多的声音事件的类型,因此,该更新后的声音事件识别模型的识别效果更好。Optionally, after the electronic device updates the sound event recognition model, when the electronic device executes the voice enhancement method in steps S501-S505 again, in step S502, multiple sounds can be recognized according to the updated sound event recognition model, and each The category information of the sound. Since the updated sound event recognition model can recognize more types of sound events, the updated sound event recognition model has a better recognition effect.
可选的,上述步骤S701-S702可以在步骤S503-S505之后执行,也可以在步骤S503-S505之前执行,还可以与步骤S503-S505同时执行,本申请实施例对其执行的先后顺序并不进行限定。Optionally, the above steps S701-S702 can be executed after steps S503-S505, or before steps S503-S505, or they can be executed simultaneously with steps S503-S505. The order of their execution in this embodiment of the application is not Qualify.
可以理解的,本申请实施例通过新的声音类别信息,更新声音事件识别模型,从而使得声音事件识别模型能够做到对环境自适应。It is understandable that the embodiment of the present application updates the sound event recognition model through new sound category information, so that the sound event recognition model can adapt to the environment.
本实施例可以将声音事件识别模型未识别出的声音进行聚类,并训练得到新的声音事件识别模型,即能够通过电子设备自学习所处环境中经常出现的声音,更新声音事件识别模型,可以做到对环境自适应,提升了用户的交互体验。而且该声音事件识别模型的稳定性和鲁棒性将随用户的使用时间的增加更加稳定,能够达到越用越好的效果。This embodiment can cluster the sounds not recognized by the sound event recognition model, and train to obtain a new sound event recognition model, that is, the electronic device can self-learn the sounds that often appear in the environment in which it is located, and update the sound event recognition model. It can be adaptive to the environment and improve the user's interactive experience. Moreover, the stability and robustness of the sound event recognition model will become more stable with the increase of the user's use time, and the more used the better.
本申请还提供又一实施例,如图8所示,在上述步骤S502之后还可以包括:步骤S801-S803。This application also provides another embodiment. As shown in FIG. 8, after the above step S502, it may further include: steps S801-S803.
S801、电子设备获取第一声音的声音方位信息。S801. The electronic device obtains sound orientation information of the first sound.
示例性的,电子设备可以根据步骤S5032中基于麦克风阵列的声音定位方法获取第一声音的方位信息,并将声音方位信息保存。例如,电子设备可以记录固定声源的方位信息,例如,电视声、水流声、洗衣机声等声音。可选的,电子设备可以在存储器中存储声音方位信息。Exemplarily, the electronic device may obtain the position information of the first sound according to the sound localization method based on the microphone array in step S5032, and save the sound position information. For example, the electronic device can record the location information of a fixed sound source, such as TV sound, water flow sound, washing machine sound and other sounds. Optionally, the electronic device may store sound location information in the memory.
S802、电子设备根据第一声音的声音方位信息以及第一声音的声音类别信息,获取无用声音的方位信息。S802. The electronic device obtains the location information of the unnecessary sound according to the sound location information of the first voice and the sound category information of the first voice.
示例性的,该电子设备可以包括声音定位模块,该声音定位模块用于确定声源的方位信息。例如,该声音定位模块可以通过电子设备中的麦克风阵列确定声源的方位信息。Exemplarily, the electronic device may include a sound localization module, and the sound localization module is used to determine the position information of the sound source. For example, the sound localization module can determine the position information of the sound source through the microphone array in the electronic device.
示例性的,电子设备可以根据声音方位信息以及声音类别信息进行自适应学习,获取无用声音的方位信息。示例性的,声音定位模块结合声音的方位信息以及声音的类别信息,将人声以外的其他固定方向的声音确定为无用声音。例如,电视机声音、水流声、做饭声、咖啡机磨豆声、洗衣机声等声音均无用声音,该无用声音的方位信息相对电子设备而言较为固定,电子设备可以获取该无用声音的方位信息。Exemplarily, the electronic device may perform adaptive learning according to the sound orientation information and the sound category information to obtain the orientation information of the useless sound. Exemplarily, the sound localization module combines the location information of the sound and the category information of the sound to determine the sound in a fixed direction other than the human voice as a useless sound. For example, the sound of TV, the sound of running water, the sound of cooking, the sound of coffee machine grinding, the sound of washing machine, etc. are all useless sounds. The position information of the useless sound is relatively fixed relative to the electronic device. The electronic device can obtain the position of the useless sound. information.
S803、电子设备根据无用声音的方位信息,将来自该方位的声音过滤。S803. The electronic device filters the sound from the direction of the unnecessary sound according to the position information of the unnecessary sound.
示例性的,电子设备可以根据无用声音的方位信息,在对电子设备接收的声音进行增强时,可以将来自该无用声音的方位信息的声音过滤。Exemplarily, the electronic device may filter the sound from the direction information of the unnecessary sound when enhancing the sound received by the electronic device according to the position information of the unnecessary sound.
可选的,电子设备可以根据无用声音的方位信息以及声音的类别信息,将来自该方位的声音过滤。例如,若电子设备接收的声音中,来自该无用声音的方位信息的声 音不为人声,电子设备将来自该方位的声音过滤。即电子设备可以根据无用声音的方位信息,辅助电子设备进行声音分离和过滤。Optionally, the electronic device may filter the sound from the direction according to the direction information of the unnecessary sound and the type information of the sound. For example, if among the sounds received by the electronic device, the sound from the direction information of the unnecessary sound is not a human voice, the electronic device filters the sound from the direction. That is, the electronic device can assist the electronic device to separate and filter the sound according to the location information of the useless sound.
示例性的,电子设备可以结合声音的方位信息以及声音类别信息,获得特定场景下经常出现的声音的方位。例如,电视机声音的位置、水流声的位置等,电子设备可以屏蔽此方向的声音,更好的辅助声音分离。Exemplarily, the electronic device may combine the position information of the sound and the sound category information to obtain the position of the sound that frequently appears in a specific scene. For example, the position of the TV sound, the position of the sound of water, etc., electronic equipment can shield the sound in this direction, and better assist the sound separation.
本实施例通过学习各种声音的方位信息,结合声音类别信息,可以获得特定场景下经常出现的声音的方位,从而能够更好的辅助声音分离。In this embodiment, by learning the azimuth information of various sounds and combining the sound category information, the azimuths of sounds frequently appearing in a specific scene can be obtained, so as to better assist in sound separation.
需要说明的是,传统的语音增强方法不能对不同方向的声音进行针对性增强,而且不能做到对环境自适应。而本实施例中的方法能够在电子设备所处的环境***的情况下,通过声音的方位信息,识别经常出现的无用声音的方向,在分离时可以忽略此方向的声音,从而确保增强效果,提升用户交互体验,并且能够做到对环境自适应。It should be noted that the traditional speech enhancement method cannot perform targeted enhancement of sounds in different directions, and cannot adapt to the environment. However, the method in this embodiment can identify the direction of frequently-occurring useless sounds through the position information of the sound under the ever-changing environment in which the electronic device is located, and the sound in this direction can be ignored during separation, thereby ensuring the enhancement effect. Improve user interaction experience, and be able to adapt to the environment.
本申请实施例还提供一种语音增强方法,如图9所示,在上述步骤S501-S505之后还可以包括步骤S901-S902。The embodiment of the present application also provides a voice enhancement method. As shown in FIG. 9, steps S901-S902 may be included after the above steps S501-S505.
S901、获取第三声音的语音交互信息。S901. Acquire voice interaction information of the third voice.
该第三声音的语音交互信息包括:电子设备处理的用户语音交互命令(第三声音)、以及电子设备反馈语音交互输出后,用户进一步与电子设备进行语音交互的命令。即电子设备和用户之间的语音交互反馈良好,用户进一步积极的与电子设备进行语音交互。The voice interaction information of the third sound includes: a user voice interaction command (third voice) processed by the electronic device, and a command for the user to further perform voice interaction with the electronic device after the electronic device feedbacks the voice interaction output. That is, the voice interaction feedback between the electronic device and the user is good, and the user further actively performs voice interaction with the electronic device.
例如,用户的语音交互命令为“今天天气怎么样?”,电子设备反馈的语音交互输出为“今天天气晴,最低温度25度,最高温度35度”,由于电子设备反馈的语音交互输出较为准确,用户进一步询问“适合穿什么衣服?”。可以理解的,由于电子设备能够准确的反馈用户的交互命令,可以促进用户进一步与电子设备之间的语音交互。For example, the user's voice interaction command is "How is the weather today?", and the voice interaction output of the electronic device feedback is "Today's weather is sunny, the minimum temperature is 25 degrees, the maximum temperature is 35 degrees", because the electronic device feedback voice interaction output is more accurate , The user further asked "What kind of clothes are suitable to wear?". It is understandable that since the electronic device can accurately feedback the user's interaction command, it can promote further voice interaction between the user and the electronic device.
示例性的,电子设备可以收集第三声音的语音交互信息。该第三声音的语音交互信息可以包括至少两次用户语音交互命令。例如:“今天天气怎么样?”以及“适合穿什么衣服?”。Exemplarily, the electronic device may collect voice interaction information of the third voice. The voice interaction information of the third sound may include at least two user voice interaction commands. For example: "How is the weather today?" and "What kind of clothes are suitable for you?".
S902、将第三声音的交互语音信息中未注册声纹的第三声音,进行声纹注册。S902: Perform voiceprint registration for a third voice whose voiceprint is not registered in the interactive voice information of the third voice.
示例性的,电子设备对其收集的交互语音信息可以进行无监督的挖掘,若该交互语音信息中提取的声纹特征未注册,电子设备将进行文本无关的声纹注册。Exemplarily, the electronic device may perform unsupervised mining on the interactive voice information it collects. If the voiceprint feature extracted from the interactive voice information is not registered, the electronic device will perform text-independent voiceprint registration.
示例性的,电子设备可以收集大量的交互反馈良好的交互语音信息,并通过无监督学习,从该大量的交互反馈良好的交互语音信息中,挖掘出未注册声纹特征的用户声音。并基于深度学习算法从该用户的语音信息中提取声纹特征,进行声纹注册。可选的,将该声纹注册过程可以称为自适应声纹注册。Exemplarily, the electronic device may collect a large amount of interactive voice information with good interactive feedback, and through unsupervised learning, from the large amount of interactive voice information with good interactive feedback, dig out user voices with unregistered voiceprint features. And based on the deep learning algorithm, the voiceprint feature is extracted from the user's voice information, and the voiceprint registration is performed. Optionally, this voiceprint registration process may be called adaptive voiceprint registration.
可选的,电子设备通过步骤S901-S902进行自适应声纹注册后,电子设备再次执行步骤S501-S505的语音增强方法时,步骤S503中已注册用户的声纹信息包括该进行自适应声纹注册的声纹信息。电子设备可以将多种声音的声纹信息与已注册用户的声纹信息进行匹配,得到第三声音。Optionally, after the electronic device performs adaptive voiceprint registration through steps S901-S902, when the electronic device re-executes the voice enhancement method of steps S501-S505, the voiceprint information of the registered user in step S503 includes the adaptive voiceprint Registered voiceprint information. The electronic device can match the voiceprint information of multiple sounds with the voiceprint information of the registered user to obtain the third voice.
可以理解的,本实施例通过日常收集交互反馈良好的用户交互语音,结合语音识 别以及文本无关的声纹注册,电子设备可以自己学习声纹信息,并进行注册。从而在智能交互中可以对声纹信息对应的语音进行加强,提升智能交互体验。It is understandable that this embodiment collects user interactive voices with good interactive feedback on a daily basis, combined with voice recognition and text-independent voiceprint registration, the electronic device can learn voiceprint information by itself and perform registration. Therefore, the voice corresponding to the voiceprint information can be strengthened in the intelligent interaction, and the intelligent interaction experience can be improved.
需要说明的是,传统的语音增强算法不能对来自不同人的声音进行针对性的增强,而且不能对未注册声纹的人进行自适应的声纹注册,而本实施例能够在设备所处的环境越来越复杂的情况下,采用声纹自适应注册确保增强效果,而且该方法在实际应用中随用户使用时间的增加,可以达到越用越好的效果。It should be noted that the traditional voice enhancement algorithm cannot specifically enhance the voices from different people, and it cannot perform adaptive voiceprint registration for people who have not registered voiceprints. However, this embodiment can When the environment becomes more and more complex, the voiceprint adaptive registration is used to ensure the enhancement effect, and this method can achieve better results with the increase of the user's use time in practical applications.
本申请提供的语音增强算法,可以应用于声音环境较为复杂的场景中,电子设备采集的第一声音中既包括用户语音又包括电子设备所处的日常环境中的背景声音,电子设备可以对其采集的第一声音进行声音事件识别,确定第一声音的声音类别信息。再根据第一声音的声音类别信息,从第一声音中分离出第二声音,即过滤掉日常环境中的背景声音,仅保留用户声音。在第二声音包括多个用户声音的情况下,电子设备可以根据深度学习算法将第二声音包括的多个用户声音进行分离,再结合该多个用户声音的属性信息,从第二声音中确定第三声音(用户语音交互命令),该第三声音是目标用户的语音交互命令。从而能够从复杂的声音环境中提取出干净的目标用户的语音交互命令,使得电子设备根据该第三声音输出的语音交互反馈更加准确。该方案从复杂的声音环境中提取用户语音交互命令时,通过对声音进行识别和分析,并结合声音的属性信息,得到用户语音交互命令(第三声音),因此,该语音增强方法并不是对所有输入的声音进行同样的增强,而是结合当前采集的声音的属性信息进行针对性增强,因此能够适应复杂的声音环境,提升了复杂声音环境下用户的智能交互体验。The speech enhancement algorithm provided in this application can be applied to scenes with a more complex sound environment. The first sound collected by the electronic device includes both the user's voice and the background sound in the daily environment in which the electronic device is located. Perform sound event recognition on the collected first sound, and determine the sound category information of the first sound. Then, according to the sound category information of the first sound, the second sound is separated from the first sound, that is, the background sound in the daily environment is filtered out, and only the user's voice is retained. In the case where the second sound includes multiple user voices, the electronic device may separate the multiple user voices included in the second voice according to the deep learning algorithm, and then combine the attribute information of the multiple user voices to determine from the second voice The third voice (user voice interactive command), the third voice is a voice interactive command of the target user. Therefore, a clean voice interaction command of the target user can be extracted from the complex sound environment, so that the voice interaction feedback output by the electronic device according to the third sound is more accurate. When the solution extracts the user's voice interaction command from the complex sound environment, it recognizes and analyzes the voice, and combines the attribute information of the voice to obtain the user's voice interaction command (the third voice). Therefore, the voice enhancement method is not correct All input sounds are enhanced in the same way, but are targeted to be enhanced in combination with the attribute information of the currently collected sound, so that it can adapt to the complex sound environment and enhance the user's intelligent interactive experience in the complex sound environment.
示例性的,结合图1所示的智能音箱所处的语音环境对本申请实施例提供的语音增强算法进行说明。智能音箱采集第一声音,该第一声音包括电视声、狗叫声、用户交谈声以及用户语音交互命令。电子设备根据声音事件识别模型对第一声音进行识别,确定第一声音的声音类别信息,该第一声音的类别信息包括人声、狗叫声和电视声。电子设备可以结合第一声音的声音类别信息,根据声源分离算法,将人声与狗叫声、电视声分离,并过滤掉狗叫声和电视声,得到人声(第二声音)。电子设备对人声(包括人物交谈声和用户语音交互命令)进行盲源分离,即可以将不同的人声进行分离,再根据不同人声的声音属性信息(声纹特征和/或声音方位信息),得到第三声音,即从多个人声中提取出目标用户的语音交互命令。可以理解的,本申请实施例中电子设备可以从复杂的声音环境中,通过对采集的第一声音进行识别和属性分析,即通过对复杂声音环境中的声音进行针对性增强,能够从复杂的声音环境中提取出干净的目标用户的语音交互命令,从而能够提高用户的智能交互体验。Exemplarily, the speech enhancement algorithm provided in the embodiment of the present application will be described in conjunction with the speech environment in which the smart speaker shown in FIG. 1 is located. The smart speaker collects first sounds, which include TV sounds, dog barking sounds, user conversation sounds, and user voice interaction commands. The electronic device recognizes the first sound according to the sound event recognition model, and determines sound category information of the first sound. The category information of the first sound includes human voices, dog barking sounds, and television sounds. The electronic device can combine the sound category information of the first sound to separate the human voice from the dog bark and TV sound according to the sound source separation algorithm, and filter out the dog bark and TV sound to obtain the human voice (second sound). Electronic equipment performs blind source separation of human voices (including human voices and user voice interaction commands), that is, different human voices can be separated, and then based on the sound attribute information of different human voices (voiceprint characteristics and/or sound orientation information) ), the third voice is obtained, that is, the voice interaction command of the target user is extracted from multiple human voices. It is understandable that the electronic device in the embodiments of the present application can identify and analyze the collected first sound from a complex sound environment, that is, by targeted enhancement of the sound in a complex sound environment, it can learn from the complex sound environment. A clean voice interaction command of the target user is extracted from the sound environment, which can improve the user's intelligent interaction experience.
可以理解的是,上述电子设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。It can be understood that, in order to realize the above-mentioned functions, the above-mentioned electronic device includes hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present application.
本申请实施例可以根据上述方法示例对上述电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处 理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiments of the present application may divide the above-mentioned electronic equipment into functional modules according to the above-mentioned method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
在采用集成的单元的情况下,图10示出了上述实施例中所涉及的电子设备的一种可能的结构示意图。该电子设备1000包括:处理单元1001和存储单元1002。In the case of using an integrated unit, FIG. 10 shows a schematic diagram of a possible structure of the electronic device involved in the foregoing embodiment. The electronic device 1000 includes a processing unit 1001 and a storage unit 1002.
其中,处理单元1001,用于对电子设备1000的动作进行控制管理。例如,可以用于执行图5中,S501-S505的处理步骤;或者,可以用于执行图7中,S701和S702的处理步骤;或者,可以用于执行图8中,S801-S803的处理步骤;或者,可以用于执行图9中,S901-S902的处理步骤;和/或用于本文所描述的技术的其它过程。The processing unit 1001 is used to control and manage the actions of the electronic device 1000. For example, it can be used to perform the processing steps of S501-S505 in Figure 5; or it can be used to perform the processing steps of S701 and S702 in Figure 7; or it can be used to perform the processing steps of S801-S803 in Figure 8. ; Or, it can be used to perform the processing steps of S901-S902 in FIG. 9; and/or other processes used in the technology described herein.
存储单元1002用于保存电子设备1000的程序代码和数据。例如,可以用于储存声音方位信息等。The storage unit 1002 is used to store the program code and data of the electronic device 1000. For example, it can be used to store sound location information.
当然,上述电子设备1000中的单元模块包括但不限于上述处理单元1001和存储单元1002。例如,电子设备1000中还可以包括音频单元、通信单元等。音频单元用于采集用户发出的语音,以及播放语音。通信单元用于支持电子设备1000与其他装置的通信。Of course, the unit modules in the aforementioned electronic device 1000 include but are not limited to the aforementioned processing unit 1001 and storage unit 1002. For example, the electronic device 1000 may also include an audio unit, a communication unit, and the like. The audio unit is used to collect the voice uttered by the user and play the voice. The communication unit is used to support communication between the electronic device 1000 and other devices.
其中,处理单元1001可以是处理器或控制器,例如可以是中央处理器(central processing unit,CPU),数字信号处理器(digital signal processor,DSP),专用集成电路(application-specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器可以包括应用处理器和基带处理器。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。存储单元1002可以是存储器。音频单元可以包括麦克风、扬声器等。通信单元可以是收发器、收发电路或通信接口等。The processing unit 1001 may be a processor or a controller, for example, a central processing unit (CPU), a digital signal processor (digital signal processor, DSP), or an application-specific integrated circuit (ASIC). ), a field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The processor may include an application processor and a baseband processor. It can implement or execute various exemplary logical blocks, modules and circuits described in conjunction with the disclosure of this application. The processor may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on. The storage unit 1002 may be a memory. The audio unit may include a microphone, a speaker, and so on. The communication unit may be a transceiver, a transceiver circuit, or a communication interface.
例如,处理单元1001为处理器(如图2所示的处理器110),存储单元1002可以为存储器(如图2所示的内部存储器120)。音频单元可以包括扬声器(如图2所示的扬声器170A)、麦克风(如图2所示的麦克风170B)。通信单元包括无线通信模块(如图2所示的无线通信模块160)。无线通信模块可以统称为通信接口。本申请实施例所提供的电子设备1000可以为图2所示的电子设备100。其中,上述处理器、存储器和通信接口等可以耦合在一起,例如通过总线连接。For example, the processing unit 1001 is a processor (the processor 110 shown in FIG. 2), and the storage unit 1002 may be a memory (the internal memory 120 shown in FIG. 2). The audio unit may include a speaker (the speaker 170A shown in FIG. 2) and a microphone (the microphone 170B shown in FIG. 2). The communication unit includes a wireless communication module (the wireless communication module 160 shown in FIG. 2). Wireless communication modules can be collectively referred to as communication interfaces. The electronic device 1000 provided by the embodiment of the present application may be the electronic device 100 shown in FIG. 2. Wherein, the foregoing processor, memory, and communication interface may be coupled together, for example, connected by a bus.
本申请实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机程序代码,当上述处理器执行该计算机程序代码时,电子设备执行图5、图7、图8或图9中的相关方法步骤实现上述任一实施例中的方法。The embodiment of the present application also provides a computer storage medium, the computer storage medium stores computer program code, when the above-mentioned processor executes the computer program code, the electronic device executes the steps in FIG. 5, FIG. 7, FIG. 8 or FIG. The relevant method steps implement the method in any of the foregoing embodiments.
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行图5、图7、图8或图9中的相关方法步骤实现上述任一实施例中的方法。The embodiments of the present application also provide a computer program product. When the computer program product runs on a computer, the computer can execute the relevant method steps in FIG. 5, FIG. 7, FIG. 8 or FIG. Methods.
其中,本申请实施例提供的电子设备1000、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。Among them, the electronic device 1000, the computer storage medium, or the computer program product provided in the embodiments of the present application are all used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the corresponding method provided above. The beneficial effects of the method will not be repeated here.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated as needed. It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate parts may or may not be physically separate. The parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以使用硬件的形式实现,也可以使用软件功能单元的形式实现。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium There are several instructions to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the method described in each embodiment of the application. The aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any change or replacement within the technical scope disclosed in this application shall be covered by the protection scope of this application . Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (19)

  1. 一种语音增强方法,其特征在于,所述方法包括:A speech enhancement method, characterized in that the method includes:
    电子设备采集第一声音,所述第一声音包括第二声音和背景声音中的至少一项;The electronic device collects a first sound, where the first sound includes at least one of a second sound and a background sound;
    所述电子设备识别所述第一声音;当所述第一声音存在所述第二声音时,所述电子设备对所述第二声音进行声音分析,得到第三声音;The electronic device recognizes the first sound; when the second sound exists in the first sound, the electronic device performs sound analysis on the second sound to obtain a third sound;
    所述电子设备处理所述第三声音。The electronic device processes the third sound.
  2. 根据权利要求1所述的方法,其特征在于,所述电子设备识别所述第一声音,包括:The method of claim 1, wherein the electronic device recognizing the first sound comprises:
    所述电子设备根据声音事件识别模型,对所述第一声音进行声音事件识别,获取所述第一声音的声音类别信息。The electronic device performs sound event recognition on the first sound according to the sound event recognition model, and obtains sound category information of the first sound.
  3. 根据权利要求2所述的方法,其特征在于,所述电子设备对所述第二声音进行声音分析,得到第三声音,包括:The method of claim 2, wherein the electronic device performs sound analysis on the second sound to obtain the third sound, comprising:
    所述电子设备根据所述第一声音的声音类别信息,从所述第一声音中分离出所述第二声音;The electronic device separates the second sound from the first sound according to the sound category information of the first sound;
    所述电子设备分析所述第二声音的声音属性信息;其中,所述声音属性信息包括:声音方位信息、声纹信息、声音时间信息、声音分贝信息中的一种或多种;The electronic device analyzes the sound attribute information of the second sound; wherein the sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information;
    所述电子设备根据所述第二声音的声音属性信息,得到所述第三声音。The electronic device obtains the third sound according to the sound attribute information of the second sound.
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述第三声音的声纹信息与已注册用户的声纹信息匹配。The method according to any one of claims 1 to 3, wherein the voiceprint information of the third voice matches the voiceprint information of the registered user.
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    所述电子设备将第四声音进行聚类,获取新的声音类别信息;所述第四声音为所述第一声音中,根据声音事件识别模型未识别出声音类别信息的声音;The electronic device clusters the fourth sound to obtain new sound category information; the fourth sound is the sound of the first sound whose sound category information is not recognized according to the sound event recognition model;
    根据所述新的声音类别信息,更新所述声音事件识别模型,获取更新后的声音事件识别模型。According to the new sound category information, the sound event recognition model is updated, and the updated sound event recognition model is obtained.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    所述电子设备获取所述第一声音的声音方位信息;Acquiring, by the electronic device, sound orientation information of the first sound;
    根据所述第一声音的声音方位信息以及所述第一声音的声音类别信息,获取无用声音的方位信息;Acquiring the location information of the useless voice according to the voice location information of the first voice and the voice category information of the first voice;
    所述电子设备根据所述无用声音的方位信息,将来自该方位的声音过滤。The electronic device filters the sound from the direction according to the position information of the unnecessary sound.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-6, wherein the method further comprises:
    获取所述第三声音的语音交互信息;Acquiring voice interaction information of the third voice;
    将所述第三声音的交互语音信息中未注册声纹的第三声音,进行声纹注册。Perform voiceprint registration for a third voice whose voiceprint is not registered in the interactive voice information of the third voice.
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-7, wherein the method further comprises:
    所述电子设备根据所述第三声音的声音属性信息,输出交互信息。The electronic device outputs interactive information according to the sound attribute information of the third sound.
  9. 一种电子设备,其特征在于,所述电子设备包括:处理器和存储器;所述存储器与所述处理器耦合;所述存储器用于存储计算机程序代码;所述计算机程序代码包括计算机指令,当所述处理器执行上述计算机指令时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。An electronic device, characterized in that the electronic device comprises: a processor and a memory; the memory is coupled to the processor; the memory is used to store computer program code; the computer program code includes computer instructions, when When the processor executes the above computer instructions, it causes the electronic device to execute the method according to any one of claims 1-8.
  10. 一种计算机存储介质,其特征在于,所述计算机存储介质包括计算机指令, 当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。A computer storage medium, wherein the computer storage medium includes computer instructions, and when the computer instructions are executed on an electronic device, the electronic device is caused to execute any one of claims 1-8. method.
  11. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1-8中任一项所述的方法。A computer program product, characterized in that, when the computer program product runs on a computer, the computer is caused to execute the method according to any one of claims 1-8.
  12. 一种语音增强装置,其特征在于,包括:A speech enhancement device, characterized in that it comprises:
    处理器,用于识别第一声音,所述第一声音由语音采集设备采集,所述第一声音包括第二声音和背景声音中的至少一项;A processor, configured to recognize a first sound, the first sound being collected by a voice collection device, and the first sound including at least one of a second sound and a background sound;
    当所述第一声音存在所述第二声音时,所述处理器对所述第二声音进行声音分析,得到第三声音;When the second sound exists in the first sound, the processor performs sound analysis on the second sound to obtain a third sound;
    所述处理器处理所述第三声音。The processor processes the third sound.
  13. 根据权利要求12所述的装置,其特征在于,所述处理器还用于,根据声音事件识别模型,对所述第一声音进行声音事件识别,获取所述第一声音的声音类别信息。The device according to claim 12, wherein the processor is further configured to perform sound event recognition on the first sound according to a sound event recognition model to obtain sound category information of the first sound.
  14. 根据权利要求13所述的装置,其特征在于,所述处理器,还用于:The device according to claim 13, wherein the processor is further configured to:
    根据所述第一声音的声音类别信息,从所述第一声音中分离出第二声音;Separating a second sound from the first sound according to the sound category information of the first sound;
    分析所述第二声音的声音属性信息;其中,所述声音属性信息包括:声音方位信息、声纹信息、声音时间信息、声音分贝信息中的一种或多种;Analyze the sound attribute information of the second sound; wherein the sound attribute information includes one or more of sound orientation information, voiceprint information, sound time information, and sound decibel information;
    根据所述第二声音的声音属性信息,得到所述第三声音。Obtain the third sound according to the sound attribute information of the second sound.
  15. 根据权利要求12-14中任一项所述的装置,其特征在于,所述第三声音的声纹信息与已注册用户的声纹信息匹配。The device according to any one of claims 12-14, wherein the voiceprint information of the third voice matches the voiceprint information of the registered user.
  16. 根据权利要求12-15中任一项所述的装置,其特征在于,所述处理器,还用于:The device according to any one of claims 12-15, wherein the processor is further configured to:
    将第四声音进行聚类,获取新的声音类别信息;所述第四声音为所述第一声音中,根据声音事件识别模型未识别出声音类别信息的声音;Clustering the fourth voice to obtain new voice category information; the fourth voice is a voice in the first voice whose voice category information is not recognized according to the voice event recognition model;
    根据所述新的声音类别信息,更新所述声音事件识别模型,获取更新后的声音事件识别模型。According to the new sound category information, the sound event recognition model is updated, and the updated sound event recognition model is obtained.
  17. 根据权利要求12-16中任一项所述的装置,其特征在于,所述处理器,还用于:The device according to any one of claims 12-16, wherein the processor is further configured to:
    获取所述第一声音的声音方位信息;Acquiring sound position information of the first sound;
    根据所述第一声音的声音方位信息以及所述第一声音的声音类别信息,获取无用声音的方位信息;Acquiring the location information of the useless voice according to the voice location information of the first voice and the voice category information of the first voice;
    根据所述无用声音的方位信息,将来自该方位的声音过滤。According to the position information of the useless sound, the sound from the position is filtered.
  18. 根据权利要求12-17中任一项所述的装置,其特征在于,所述处理器,还用于:The device according to any one of claims 12-17, wherein the processor is further configured to:
    获取所述第三声音的交互语音信息;Acquiring interactive voice information of the third voice;
    将所述第三声音的交互语音信息中未注册声纹的第三声音,进行声纹注册。Perform voiceprint registration for a third voice whose voiceprint is not registered in the interactive voice information of the third voice.
  19. 根据权利要求12-18中任一项所述的装置,其特征在于,所述处理器,还用于:The device according to any one of claims 12-18, wherein the processor is further configured to:
    根据所述第三声音的声音属性信息,输出交互信息。According to the sound attribute information of the third sound, interactive information is output.
PCT/CN2020/105296 2019-08-21 2020-07-28 Method and device for voice enhancement WO2021031811A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910774538.0 2019-08-21
CN201910774538.0A CN112420063A (en) 2019-08-21 2019-08-21 Voice enhancement method and device

Publications (1)

Publication Number Publication Date
WO2021031811A1 true WO2021031811A1 (en) 2021-02-25

Family

ID=74660413

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105296 WO2021031811A1 (en) 2019-08-21 2020-07-28 Method and device for voice enhancement

Country Status (2)

Country Link
CN (1) CN112420063A (en)
WO (1) WO2021031811A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113794963B (en) * 2021-09-14 2022-08-05 深圳大学 Speech enhancement system based on low-cost wearable sensor
CN114722884B (en) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889172A (en) * 2005-06-28 2007-01-03 松下电器产业株式会社 Sound sorting system and method capable of increasing and correcting sound class
US20150242180A1 (en) * 2014-02-21 2015-08-27 Adobe Systems Incorporated Non-negative Matrix Factorization Regularized by Recurrent Neural Networks for Audio Processing
CN107146614A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of audio signal processing method, device and electronic equipment
CN107357875A (en) * 2017-07-04 2017-11-17 北京奇艺世纪科技有限公司 A kind of voice search method, device and electronic equipment
CN108172219A (en) * 2017-11-14 2018-06-15 珠海格力电器股份有限公司 The method and apparatus for identifying voice
CN109087631A (en) * 2018-08-08 2018-12-25 北京航空航天大学 A kind of Vehicular intelligent speech control system and its construction method suitable for complex environment
WO2019046151A1 (en) * 2017-08-28 2019-03-07 Bose Corporation User-controlled beam steering in microphone array

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7117145B1 (en) * 2000-10-19 2006-10-03 Lear Corporation Adaptive filter for speech enhancement in a noisy environment
CN105703978A (en) * 2014-11-24 2016-06-22 武汉物联远科技有限公司 Smart home control system and method
CN105280183B (en) * 2015-09-10 2017-06-20 百度在线网络技术(北京)有限公司 voice interactive method and system
CN105185378A (en) * 2015-10-20 2015-12-23 珠海格力电器股份有限公司 Voice control method, voice control system and voice-controlled air-conditioner
WO2019133732A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation
CN108831440A (en) * 2018-04-24 2018-11-16 中国地质大学(武汉) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
CN109712625A (en) * 2019-02-18 2019-05-03 珠海格力电器股份有限公司 Smart machine control method based on gateway, control system, intelligent gateway

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889172A (en) * 2005-06-28 2007-01-03 松下电器产业株式会社 Sound sorting system and method capable of increasing and correcting sound class
US20150242180A1 (en) * 2014-02-21 2015-08-27 Adobe Systems Incorporated Non-negative Matrix Factorization Regularized by Recurrent Neural Networks for Audio Processing
CN107146614A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of audio signal processing method, device and electronic equipment
CN107357875A (en) * 2017-07-04 2017-11-17 北京奇艺世纪科技有限公司 A kind of voice search method, device and electronic equipment
WO2019046151A1 (en) * 2017-08-28 2019-03-07 Bose Corporation User-controlled beam steering in microphone array
CN108172219A (en) * 2017-11-14 2018-06-15 珠海格力电器股份有限公司 The method and apparatus for identifying voice
CN109087631A (en) * 2018-08-08 2018-12-25 北京航空航天大学 A kind of Vehicular intelligent speech control system and its construction method suitable for complex environment

Also Published As

Publication number Publication date
CN112420063A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
US11386905B2 (en) Information processing method and device, multimedia device and storage medium
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN111091824B (en) Voice matching method and related equipment
US20220317641A1 (en) Device control method, conflict processing method, corresponding apparatus and electronic device
CN113035227B (en) Multi-modal voice separation method and system
CN109189980A (en) The method and electronic equipment of interactive voice are carried out with user
WO2019134473A1 (en) Speech recognition system, method and apparatus
WO2021031811A1 (en) Method and device for voice enhancement
CN109999314A (en) One kind is based on brain wave monitoring Intelligent sleep-assisting system and its sleep earphone
CN108536418A (en) A kind of method, apparatus and wireless sound box of the switching of wireless sound box play mode
TW201821946A (en) Data transmission system and method thereof
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN110265012A (en) It can interactive intelligence voice home control device and control method based on open source hardware
CN105182763A (en) Intelligent remote controller based on voice recognition and realization method thereof
CN107507620A (en) A kind of voice broadcast sound method to set up, device, mobile terminal and storage medium
WO2022033556A1 (en) Electronic device and speech recognition method therefor, and medium
US20230319190A1 (en) Acoustic echo cancellation control for distributed audio devices
CN110248021A (en) A kind of smart machine method for controlling volume and system
CN109147787A (en) A kind of smart television acoustic control identifying system and its recognition methods
CN108534297A (en) A kind of intelligent air-conditioning system and control method based on speech recognition
CN113611318A (en) Audio data enhancement method and related equipment
WO2022007846A1 (en) Speech enhancement method, device, system, and storage medium
CN208724111U (en) Far field speech control system based on television equipment
US20230239800A1 (en) Voice Wake-Up Method, Electronic Device, Wearable Device, and System
CN106328154A (en) Front-end audio processing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20854399

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20854399

Country of ref document: EP

Kind code of ref document: A1