WO2022178852A1 - 一种辅助聆听方法及装置 - Google Patents

一种辅助聆听方法及装置 Download PDF

Info

Publication number
WO2022178852A1
WO2022178852A1 PCT/CN2021/078222 CN2021078222W WO2022178852A1 WO 2022178852 A1 WO2022178852 A1 WO 2022178852A1 CN 2021078222 W CN2021078222 W CN 2021078222W WO 2022178852 A1 WO2022178852 A1 WO 2022178852A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
user
audio signal
coordinates
microphone
Prior art date
Application number
PCT/CN2021/078222
Other languages
English (en)
French (fr)
Inventor
张立斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180004382.3A priority Critical patent/CN115250646A/zh
Priority to PCT/CN2021/078222 priority patent/WO2022178852A1/zh
Publication of WO2022178852A1 publication Critical patent/WO2022178852A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present application relates to the technical field of intelligent terminals, and in particular, to a method and device for assisting listening.
  • the sound sources include the sound source that the user wants to listen to, such as the sound source S1 shown in FIG. 1, and the sound source that the user does not want to listen to, such as the sound source S2 shown in FIG. 1,
  • This sound source S2 is noise to the user. How to make the user listen more attentively to the audio content of the sound source S1 is a technical problem to be solved in the embodiment of the present application.
  • the present application provides an auxiliary listening method and device, so that the user's attention can be more focused on the audio content of the sound source concerned by the ringtone, and the user is assisted in listening selection.
  • a method for assisting listening comprising: determining coordinates of a first sound source according to an external audio signal collected by a microphone and a first direction, where the first direction is a direction determined by detecting a user, so The first sound source is a sound source in other directions except the first direction; according to the coordinates of the first sound source and the preset head-related transfer function HRTF, determine the first HRTF corresponding to the first sound source ; Obtain a noise signal corresponding to the first sound source according to the first HRTF and a preset virtual noise, and play the noise signal.
  • the sound source in the first direction that is, the sound source in the user's attention direction is S1
  • the first sound source is the sound source S2 in the user's non-attention direction as an example.
  • noise N can be superimposed on the audio content of the sound source S2 that the user is not concerned about/not interested in, thereby reducing the clarity of the audio content of the sound source S2, so that the user cannot clearly hear the sound source S2.
  • audio content so that the user's attention is more focused on the ringtone listening of the audio content of the sound source S1, and the selection of listening is assisted.
  • the determining the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining at least one location near the user according to the external audio signal collected by the microphone The coordinates of the sound source; the user is detected to determine the first direction; according to the first direction, the coordinates of the first sound source are determined among the coordinates of at least one sound source near the user.
  • the determining the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone includes: the number of microphones is multiple, and each microphone separately collects the external audio signal, There is a time delay between the external audio signals collected by different microphones; the coordinates of at least one sound source near the user are determined according to the time delays of the external audio signals collected by different microphones.
  • the detecting the user to determine the first direction includes: detecting the user's gaze direction; or, detecting the user's binaural current difference, according to the binaural current The correspondence between the difference and the gaze direction determines the user's gaze direction, and the user's gaze direction is the first direction.
  • the determining the coordinates of the first sound source from the coordinates of at least one sound source near the user according to the first direction includes: determining the coordinates of the at least one sound source near the user.
  • the coordinates of the source are analyzed to determine the directional relationship between each sound source and the user; according to the first direction and the directional relationship between the sound source and the user, the deviation between each sound source and the first direction is determined; Among at least one sound source near the user, a sound source whose deviation from the first direction is greater than a threshold is selected, and the coordinates of the sound source are the coordinates of the first sound source.
  • the external audio signal collected by the microphone is a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources; the external audio signal collected by the microphone is Perform separation to obtain the first audio signal output by the first sound source.
  • the method further includes: analyzing the separated first audio signal to determine the content of the first audio signal; and determining virtual noise to be added according to the content of the first audio signal type.
  • the content of the first audio signal is human conversation sound
  • the type of virtual noise to be added is multi-person conversation babble noise
  • the method further includes: determining the energy of the separated first audio signal; and determining the energy of the virtual noise to be added according to the energy of the first audio signal.
  • virtual noise corresponding to energy can be added according to the energy of the first audio signal output by the first sound source, which can avoid adding virtual noise with more energy and reduce the power consumption of the electronic device.
  • an auxiliary listening device including corresponding functional modules or units for implementing the functions of the first aspect or any one of the designs of the first aspect.
  • This function can be implemented by hardware, or by executing corresponding software by hardware, and the hardware or software includes one or more modules or units corresponding to the above functions.
  • an auxiliary listening device including a processor and a memory.
  • the memory is used to store computing programs or instructions, and the processor is coupled to the memory; when the processor executes the computer program or instructions, the apparatus is made to execute the method in the first aspect or any design of the first aspect.
  • an electronic device is provided, and the electronic device is used to execute the method in the first aspect or any one of the designs of the first aspect.
  • the electronic device may be an earphone (including a wired earphone or a wireless earphone, etc.), a smart phone, a vehicle-mounted device, a wearable device, or the like.
  • the wireless earphones include but are not limited to Bluetooth earphones, etc.
  • the wearable devices can be smart glasses, smart watches, or smart bracelets.
  • a fifth aspect provides a computer-readable storage medium, in which a computer program or instruction is stored, and when the computer program or instruction is executed by a device, the device is made to perform the above-mentioned first aspect or the first aspect. any of the methods in the design.
  • a computer program product in a sixth aspect, includes a computer program or an instruction, and when the computer program or instruction is executed by a device, the device is made to execute the method in the first aspect or any one of the designs of the first aspect. .
  • FIG. 1 is a schematic diagram of a user selecting listening to a sound provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the principle of assisted listening provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a coordinate system provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a model of a microphone and a sound source provided by an embodiment of the present application
  • FIG. 7 is a functional schematic diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of HRTF rendering provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an apparatus provided by an embodiment of the present application.
  • Cocktail party effect refers to a human hearing selection ability. In this case, one's attention is focused on the conversation of one individual and ignores other conversations or noises in the background. The effect reveals the amazing ability of the human auditory system to talk in noise.
  • the cocktail party effect is an auditory version of the figure-background phenomenon.
  • the "graphics” here are the sounds that we notice or draw our attention to, and the "backgrounds” are other sounds.
  • the cocktail party effect in acoustics refers to the masking effect of the human ear. In the noisy crowd at the cocktail party, the two can talk smoothly. Although there is a lot of noise around, they hear each other's voice in their ears, and people seem to be unable to hear all kinds of noise other than the content of the conversation. Because people have put their own focus (this is the selectivity of attention) on the topic of conversation.
  • DNNs deep neural networks
  • the nerves in his brain can record a spectrogram that is used to reconstruct that speech source.
  • the solution is used for hearing aids, and can enhance playback of audio objects that the user pays attention to or is interested in, so that the audio objects that the user pays attention to or are interested in can be heard more clearly.
  • this method is based on brainwave technology, it identifies the external sound source signal that the current user is interested in or pays attention to (restores and recovers from the brainwave), and enhances and plays the corresponding sound source signal. From a technical point of view, this technology requires speech separation first, and then matches with the brainwave decoded signal. The effect depends on the accuracy of the front-end speech signal separation, which is a huge challenge and difficult to achieve.
  • the second solution includes: step 1, detecting the user's attention or direction of interest, such as determining the user's interest or direction of interest based on the user's gaze or the orientation of the user's head.
  • Step 2 focus on sound pickup, and pick up the audio signal s(n) in the direction of the user's attention; optionally, sound pickup refers to a specific action of collecting audio signals.
  • the method for picking up the audio signals focused on the user and picking up the audio signals that the user pays attention to may include: picking up all the audio signals around the user, and then separating the audio signals that the user pays attention to. Alternatively, only the audio signals that the user pays attention to are picked up, for example, using a beamforming method, etc., to acquire only the audio signals that the user pays attention to.
  • Step 3 binaural playback, binaural rendering playback of the sound in the pickup direction, combined with the head related transfer function (HRTF), to improve the sense of presence of s(n).
  • HRTF head related transfer function
  • the essence of the above-mentioned second solution is to make the audio content of the sound source S1 clearer and to realize the user's more effective attention perception of the S1 by picking up the sound of the focused sound source S1. But in essence, the signal-to-noise ratio of S2 relative to N is not changed, so that S2 may still be perceived, thereby affecting the user's effective attention perception of the sound source S1. For example, the user is talking with object A, but there is also object B near the user talking, and object B is noise to the user.
  • the volume or clarity of the conversation content of the object A is improved by focusing on sound pickup. However, no processing was performed on the content of subject B's conversation.
  • the conversation content of object B involves words that the user pays attention to, it may still attract the user's attention, so that the user cannot better concentrate on listening to the conversation content of object A. For example, if the user pays more attention to "salary increase", during the conversation between the user and object A, if the content of the conversation of object B involves words such as "salary increase", it will attract the user's attention and make the user Can't focus on listening to subject A's conversation.
  • the present application provides an auxiliary listening method and device.
  • the principle of the solution is to superimpose noise (noise, N) on the audio content of the sound source S2 that the user is not concerned about/not interested in, thereby reducing the sound source S2
  • the clarity of the audio content prevents the user from clearly listening to the audio content of the sound source S2, thereby making the user pay more attention to the ringtone listening of the audio content of the sound source S1, and assist in selecting listening.
  • noise N can be superimposed on the conversation content of the object B, thereby reducing the signal-to-noise ratio of the conversation content of the object B.
  • the assisted listening method provided in the embodiments of the present application can be applied to electronic devices, including but not limited to earphones, smart phones, in-vehicle devices, or wearable devices (such as smart glasses, smart watches, or wristbands, etc.).
  • electronic devices including but not limited to earphones, smart phones, in-vehicle devices, or wearable devices (such as smart glasses, smart watches, or wristbands, etc.).
  • the application scenario may be that a button for assisting listening is set in the earphone. When the user activates the button, the earphone can detect the user to determine the sound source of the user's attention.
  • Noise N is added to the audio signal of the sound source that the user does not care about, thereby reducing the signal-to-noise ratio of the audio signal of the unconcerned sound source, so that the user can concentrate more on listening to the content of the sound source that he cares about and assist in selecting listening.
  • an embodiment of the present application provides an electronic device, which can be used to implement the auxiliary ringtone method provided by the embodiment of the present application, and the electronic device at least includes: a processor 301 , a memory 302 , and at least one speaker 304, at least one microphone 305, and a power supply 306, etc.
  • the memory 302 may store program codes, and the program codes may include program codes for implementing assisted listening.
  • the processor 301 may execute the above program codes to implement the function of assisting listening in the embodiments of the present application.
  • the processor 301 can execute the program code of the memory 302 to realize the following functions: according to the external audio signal collected by the microphone and the detected first direction of the user's attention, determine the coordinates of the first sound source in the direction that the user is not concerned about; The coordinates of the sound source and the predetermined HRTFs at different positions determine the first HRTF corresponding to the first sound source; according to the first HRTF and the preset virtual noise, the noise signal corresponding to the first sound source is obtained.
  • the speaker 304 can be used to convert audio electrical signals into sounds and play them.
  • the speaker 304 is used to play the noise signal corresponding to the first sound source and so on.
  • the microphone 305 which may also be referred to as a microphone, a microphone, etc., is used to convert a sound signal into an audio electrical signal.
  • the microphone 305 can collect sound signals in the vicinity of the user and convert them into audio electrical signals.
  • the audio electrical signal is the audio signal in this embodiment of the present application.
  • a power supply 306 can be used to supply power to various components included in the electronic device.
  • the power source 306 may be a battery, such as a rechargeable battery or the like.
  • the electronic device 300 may further include: a sensor 303, a wireless communication module 307, and the like.
  • Sensor 303 may be a proximity light sensor.
  • the processor 301 can determine whether the earphone is worn by the user through the sensor 303 .
  • the processor 301 can use the proximity light sensor to detect whether there is an object near the earphone, so as to determine whether the earphone is worn or the like.
  • the processor 301 may use the proximity light sensor to determine whether the charging box of the earphone is opened, so as to determine whether to control the earphone to be in a pairing state.
  • the earphone may further include a bone conduction sensor, and the bone conduction sensor may acquire the vibration signal of the vibrating bone mass of the voice part, and the processor 301 parses the voice signal to realize the control function corresponding to the voice signal.
  • the earphone may further include a touch sensor or a pressure sensor, etc., which are respectively used to detect the user's touch operation and pressing operation on the earphone.
  • the headset may further include a fingerprint sensor for detecting the user's fingerprint, identifying the user's identity, and the like.
  • the wireless communication module 307 is used to establish a wireless connection with other electronic devices, so that the headset can perform data interaction with other electronic devices.
  • the wireless communication module 307 may be a near field communication (near field communication, NFC) module, so that the headset can perform near field communication with other electronic devices having an NFC module.
  • NFC near field communication
  • the NFC module can store relevant information about the headset, such as the name, address information or unique identifier of the headset, so that other electronic devices with the NFC module can establish an NFC connection with the headset according to the relevant information, and based on the relevant information NFC connection for data transfer.
  • the wireless communication module 307 may also be a Bluetooth module, and the Bluetooth module stores the Bluetooth address of the headset, so that other electronic devices can establish a Bluetooth connection with the headset according to the Bluetooth address, and transmit audio through the Bluetooth connection. data etc.
  • the Bluetooth module can support multiple Bluetooth connection types at the same time, such as serial port profile (SPP) of traditional Bluetooth technology or Bluetooth low energy (bluetooth low energy, BLE) general attribute configuration protocol (generic attribute profile, GAP), etc., there are no restrictions here.
  • the wireless communication module 307 may also be an infrared module or wireless fidelity (wireless fidelity, WIFI), etc., and the specific implementation of the wireless communication module 307 is not limited herein.
  • only one wireless communication module 307 may be provided, or a plurality of wireless communication modules 307 may be provided as required.
  • two wireless communication modules may be provided in the headset, wherein one wireless communication module is a Bluetooth module, and the other wireless communication module is an NFC module.
  • the earphone can perform data communication through the two wireless communication modules respectively, and the number of the wireless communication modules 307 is not limited here.
  • the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device 300 . It may have more or fewer components than those shown in Figure 3, may combine two or more components, or may have a different configuration of components, and the like.
  • the electronic device 300 may also include components such as an indicator light (which can indicate states such as power), a dust filter (which can be used in conjunction with an earpiece), and the like.
  • the various components shown in Figure 3 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.
  • a flow of an auxiliary listening method is provided, and the flow includes at least:
  • Step 401 The electronic device determines the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction, the first direction is determined by detecting the user, and the first sound source is the first sound source except the first direction. sound sources in other directions.
  • the first direction may be a direction in which the user pays attention
  • the first sound source may be a sound source in a direction that the user does not pay attention to.
  • S1 may be the sound source in the direction of the user's attention, and the sound sources S2 and S3 are the sound sources in the direction not the user pays attention to.
  • the implementation process of the above step 401 may include: the electronic device determines the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone.
  • the electronic device is provided with a microphone array, the microphone array includes at least one microphone, each microphone separately collects external audio signals, and there is a time delay between the external audio signals collected by different microphones.
  • the electronic device may determine the coordinates of at least one sound source near the user according to the time delay between the external audio signals collected by different microphones.
  • the above-mentioned microphone may be a vector microphone or the like. For example, a sound source S1 and a sound source S2 exist near the user.
  • the microphone array of the electronic device can collect the external sound signal of the user and convert the sound signal into an audio signal, which includes the audio signal corresponding to the sound source S1 and the audio signal corresponding to the sound source S2.
  • the electronic device can determine the coordinates of the sound source S1 and the coordinates of the sound source S2 according to the time delays of the audio signals collected by different microphones in the microphone array.
  • the microphone array is composed of N+1 microphones, and the N+1 microphones are respectively (M 0 , M 1 , . . . , Mn ) and the like.
  • the distance between the sound source S and the ith microphone is:
  • d i represents the distance from the sound source S to the ith microphone
  • (x, y, z) represents the coordinates of the sound source S
  • ( xi, y i , z i ) represents the coordinates of the ith microphone.
  • d ij represents the difference d ij between the distance d i from the sound source to the ith microphone and the distance d j from the sound source to the j th microphone.
  • the distance between the various microphones is known, and the speed of sound is also known.
  • the specific algorithm may be the maximum likelihood estimation method, or the least square method, etc., which is not limited.
  • the coordinates of at least one sound source near the user can be obtained. Then, continue to describe the process of detecting the first direction the user pays attention to, and how to screen out the first sound source that the user does not pay attention to from at least one sound source according to the first direction the user pays attention to.
  • a user may wear an electronic device.
  • the electronic device is an earphone
  • the user can wear the electronic device at the ear position.
  • the electronic device can determine the first direction that the user pays attention to.
  • the electronic device detects the gaze direction of the user's eyes, and uses the gaze direction of the user's eyes as the first direction that the user pays attention to.
  • an inertial measurement unit IMU
  • the electronic device may use the IMU to determine the orientation of the user's head, and determine the orientation of the user's head according to the orientation of the user's head. Eye gaze direction, etc.
  • the gaze direction of the user's eyes that is, the first direction the user pays attention to, is straight ahead, and so on.
  • a camera may be deployed in the electronic device, and the camera may collect an image of the user's head, and determine the gaze direction of the user's eyes according to the image of the user's head collected by the camera.
  • an electroencephalogram detection sensor may be deployed in the electronic device, and the electroencephalogram detection sensor may detect the difference in the current between the two ears of the user, and determine the direction of the user's gaze according to the corresponding relationship between the difference in the current between the two ears and the gaze direction.
  • the electronic device may determine the coordinates of the first sound source among the coordinates of at least one sound source near the user according to the first direction. This process can be implemented in two specific ways:
  • the electronic device determines the sound source concerned by the user among at least one sound source near the user according to the above-mentioned first direction;
  • the sound source is the first sound source that the user does not pay attention to. For example, there are 5 sound sources near the user. According to the first direction the user pays attention to, it is determined that the sound source A among the above 5 sound sources is the sound source that the user pays attention to. The remaining 4 sound sources are the first sound sources that the user does not pay attention to.
  • the first sound source that the user does not pay attention to may include one sound source, or may include multiple sound sources, etc., which is not limited.
  • the electronic device directly determines a first sound source that the user does not pay attention to among at least one sound source near the user according to the above-mentioned first direction.
  • the electronic device may analyze the coordinates of at least one sound source near the user, and determine the positional relationship between each sound source and the user.
  • the coordinates of the sound source may be coordinates relative to the user.
  • a coordinate system is established.
  • the coordinate system may be a three-dimensional coordinate system. The origin of the coordinate system is the user's head position, the X axis represents the user's left/right direction, the Y axis represents the user's front/rear direction, and the Z axis represents the user's up/down direction.
  • the positive direction of the X axis is the positive right direction of the user
  • the negative direction of the X axis is the positive left direction of the user
  • the positive direction of the Y axis is the positive front direction of the user
  • the positive direction of the Y axis is the positive front direction of the user.
  • the negative direction is the positive back direction of the user
  • the positive direction of the Z axis is the positive upward direction of the user
  • the negative direction of the Z axis is the positive downward direction of the user; it can be understood that the positive directions of the above X, Y and Z axes are All represent the direction in which it is greater than 0, and the negative directions of the X-axis, Y-axis and Z-axis all represent the direction in which it is less than 0.
  • the electronic device can determine the positional relationship between each sound source and the user by analyzing the above-mentioned coordinates of each sound source relative to the user.
  • the coordinates of a sound source relative to the user are (1, 2, 0), which means that the position of the sound source relative to the user is: 1m directly to the right of the user, 2m directly in front of the user, and the position on the same plane as the user's position.
  • the position of the sound source can be accurately located, so as to determine the azimuth relationship between the position and the user.
  • the deviation between each sound source and the first direction is determined.
  • the detected first direction that the user pays attention to is 15 degrees away from the front of the user. If the direction relationship between a certain sound source and the user is 20 degrees away from the front of the user, the deviation between the sound source and the first direction is 5 degrees.
  • a sound source whose deviation from the first direction is greater than a threshold is selected, and the sound source is the first sound source that the user does not pay attention to.
  • a sound source whose deviation from the first direction is less than or equal to a threshold is selected, and the sound source can be the sound source that the user is concerned about; after that, all sound sources near the user are selected. , excluding the sound sources that the user pays attention to, that is, the sound sources that the user does not pay attention to.
  • the deviation between each sound source and the first direction can be determined.
  • the sound source is a sound source that the user does not pay attention to.
  • the deviation of the sound source from the first direction is less than or equal to the threshold, the sound source is considered to be the sound source that the user pays attention to.
  • the sound source S2 is a sound source that the user does not pay attention to, or the like.
  • the above threshold may be set when the electronic device leaves the factory, or may be synchronized or notified to the electronic device through the server of the electronic device later, etc., which is not limited. It should be noted that, in the description of the embodiments of the present application, “concern” and “interested” are not distinguished, and can be replaced with each other.
  • the above-mentioned sound source that the user pays attention to can also be replaced with a sound source that the user is interested in; the sound source that the user does not pay attention to can also be replaced with a sound source that the user is not interested in, and so on.
  • the sound source that the user pays attention to or the sound source that the user is interested in may be the sound source related to the activity that the user is currently engaged in. For example, if the user is currently engaged in watching TV, and the TV plays an audio signal as a sound source, the TV can be a sound source that the user is interested in or concerned about.
  • the user's attention or the sound source that the user is interested in may be a sound source whose deviation from the user's eye gaze direction is less than or equal to a threshold, or the like.
  • the sound source that the user does not pay attention to or is not interested in may be a sound source that is not related to the activity currently engaged by the user, and the sound source is noise to the user.
  • the user is watching TV, and watching TV is the current activity of the user, and listening to music is not the current activity of the user. If there is a music player playing music near the TV, the music player may be a sound source that the user does not pay attention to.
  • the sound source that the user does not pay attention to may be a sound source whose deviation from the user's eye gaze direction is greater than a threshold, or the like.
  • Step 402 The electronic device determines a first HRTF corresponding to the first sound source according to the coordinates of the first sound source and a preset HRTF.
  • the above-mentioned first HRTF may include two HRTFs, as shown in FIG. 7 , which may correspond to the HRTF of the left ear and the HRTF of the right ear respectively.
  • Step 403 The electronic device obtains a noise signal corresponding to the first sound source according to the first HRTF and the preset virtual noise, and plays the noise signal.
  • the electronic device may multiply the first HRTF by a preset virtual noise frequency domain to obtain a noise signal corresponding to the first sound source.
  • the first HRTF includes the HRTF of the left ear and the HRTF of the right ear. Then, the HRTF of the left ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the left ear, and the HRTF of the right ear is multiplied by the preset virtual noise frequency domain to obtain the noise signal of the right ear.
  • the noise signal of the left ear and the noise signal of the right ear can be called the noise signal of both ears.
  • HRTF is an algorithm for sound localization, which corresponds to the head related inpulse response (HRIR) in the time domain.
  • HRIR head related inpulse response
  • the above-mentioned process of multiplying the HRTF and the preset virtual noise in the frequency domain is embodied in the time domain as a process of convolving the above-mentioned HRIR and the preset virtual noise.
  • HRTF In order to facilitate understanding, first introduce HRTF.
  • the concept and meaning of HRTF are as follows: Humans have two ears, but they can locate sounds from three-dimensional space, which is due to the analysis system of human ears for sound signals. HRTF can simulate the human ear's analysis system for sound signals. The HRTF essentially contains the spatial orientation information of the sound source, and the corresponding HRTFs of the sound sources in different spatial orientations are different. For any monaural audio signal, after multiplying the HRTF of the left ear and the HRTF of the right ear respectively in the frequency domain, the audio signal corresponding to the binaural can be obtained, and the 3-dimensional audio can be experienced by playing it with headphones.
  • the HRTFs of multiple locations may be pre-stored, and the above-mentioned first sound can be obtained by using the pre-stored HRTFs of multiple locations (for example, performing an interpolation operation on the HRTFs of the above-mentioned multiple locations, etc.).
  • the first HRTF corresponding to the source, etc. After that, the first HRTF is multiplied by a preset virtual noise frequency domain to obtain a noise signal.
  • the HRTF is related to the position, and the above-mentioned first HRTF is obtained according to the coordinates of the first sound source.
  • the above virtual noise signal can be superimposed on the audio signal of the first sound source, reducing the signal-to-noise ratio of the audio signal of the first sound source, which can make the user more Focus on listening to the audio signal of the sound source of interest, assisting the user to listen.
  • the first HRTF is obtained according to the coordinates of the sound source S2, and then the first HRTF is multiplied by the preset virtual noise frequency domain to obtain a noise signal.
  • noise signal By playing the noise signal, you can The above-mentioned noise signal is superimposed on the sound source S2, thereby reducing the signal-to-noise ratio of the audio signal of the sound source S2, so that the user can listen to the audio signal of the sound source S1 more attentively.
  • the process of determining the first HRTF corresponding to the first sound source according to the coordinates of the first sound source and the preset HRTF in the above step 402 includes but is not limited to the following two implementations:
  • Method 1 Accurately locate the coordinates of each sound source relative to the user, and store a large number of HRTFs in advance.
  • the electronic device precisely locates the position of each sound source relative to the user.
  • the coordinates of a sound source relative to the user are (1, 2, 0), which means the position of the sound source is: 1 meter to the right of the user, 2 meters to the front of the user, and on the same horizontal plane as the user s position. Since the HRTF is related to the position, in this case, the electronic device may require a larger number of HRTFs in advance, so as to calculate the HRTF corresponding to the position of the sound source.
  • the advantage of this method is that a more accurate HRTF can be determined, and the noise signal determined according to the HRTF can be superimposed on the unconcerned sound source more accurately.
  • Mode 2 Roughly locate the coordinates of each sound source relative to the user, and store a small number of HRTFs in advance.
  • the electronic device can roughly locate the direction of the sound source relative to the user, instead of accurately locating the specific position of each sound source.
  • the determined coordinates of a certain sound source relative to the user may be (1, 1, 0), which means that the sound source is located in the right front direction of the user, and the specific position of the right front direction is no longer calculated.
  • the electronic device can store HRTFs in four directions, such as front, rear, left, and right, and calculate the HRTF in the right direction according to the HRTFs in the four directions. The advantage of this is that the storage space of the electronic device can be reduced, the calculation process can be simplified, and the power consumption of the electronic device can be saved.
  • the virtual noise added in the foregoing step 403 may be white noise.
  • the virtual noise added in the above step 403 may also be noise that matches the content of the first audio signal of the first sound source.
  • the electronic device may analyze the first audio signal of the first sound source, determine the content of the first audio signal, and determine the type of virtual noise to be added according to the content of the first audio signal. For example, if the content of the first audio signal is human speech, the electronic device determines that the type of virtual noise to be added is multi-person conversation babble noise or the like. In a possible implementation manner, the electronic device may detect whether the content of the first audio signal contains a human voice.
  • the virtual noise added to the first audio signal may be multi-person conversation babble noise or the like.
  • a voice activity detection (VAD) technology can be used, such as short time energy (STE) and zero-crossing rate (zero-crossing rate). cross counter, ZCC) and other detection methods.
  • the STE and ZCC of the first audio signal of the first sound source may be detected. Since the STE of the speech segment is relatively large and the ZCC is relatively small, the STE of the non-speech segment is relatively small and the ZCC is relatively large. This is mainly due to the fact that most of the energy of the speech signal is contained in the low frequency band, while the noise signal usually has less energy and contains information in the higher frequency band. Therefore, a certain threshold can be set. When the STE of the first audio signal of the first sound source is greater than or equal to the first threshold, and the ZCC is less than or equal to the second threshold, it can be considered that the first audio signal of the first sound source includes human voice.
  • the STE of the first audio signal of the first sound source is less than the first threshold and the ZCC is greater than the second threshold, it may be considered that the first audio signal of the first sound source does not include human speech.
  • ZCC refers to the ratio of the symbol change of a signal, that is, the number of times that a frame of speech time-domain signal passes through the time axis. The calculation method is to translate all the signals in the frame by 1, and then do the product of the corresponding points. If the sign is negative, it means that the zero-crossing is here. Just calculate the value of the product of all the negative numbers in the frame to get the zero-crossing rate of the frame. .
  • STE refers to the energy of a frame of speech signal.
  • the electronic device may also determine the energy of the first audio signal; determine the energy of the virtual noise to be added according to the energy of the first audio signal, and the like.
  • the essence of the method is to control the signal-to-noise ratio of the first audio signal after adding virtual noise.
  • the signal-to-noise ratio of the first audio signal and the added virtual noise is preset to be 50%.
  • the energy of the first audio signal is W1
  • the energy of the virtual noise to be added may be half of the energy of the first audio signal, that is, 0.5W1.
  • the external audio signal collected by the microphone may be a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources.
  • the electronic device can separate the mixed audio signal collected by the microphone to obtain the first audio signal corresponding to the first sound source. Afterwards, the above method is used to determine the audio content in the first audio signal, and/or the energy of the first audio signal, and the like.
  • the domain speech separation algorithm can separate the mixed audio signal collected by the microphone into a plurality of independent audio signals, and the plurality of independent audio signals include the first audio signal corresponding to the first sound source.
  • the length of the mixing filter is P
  • the convolution mixing process can be expressed as:
  • the hybrid network H(n) is a matrix sequence of M*N, which is composed of the impulse response of the hybrid filter.
  • the length of the separation filter be L
  • the separation network W(n) is a matrix of N*M, which is composed of the impulse response of the separation filter, and * represents the matrix convolution operation.
  • the separation network W(n) can be obtained by a frequency-domain blind source separation algorithm.
  • L point short-time Fourier transform short-time fourier transform, STFT
  • STFT short-time fourier transform
  • m is obtained by down-sampling the time index value n by L point
  • X(m, f) and Y(m, f) are respectively x(n) and y(n) and obtained by STFT
  • H(f) and W(f) is the Fourier transform form of H(n) and W(n) respectively
  • f ⁇ [f 0 ,...f L/2 ] is the frequency.
  • the Y(m,f) obtained after blind source separation is inversely transformed back to the time domain, and the estimated sound source signals y 1 (n),...y N (n) are obtained.
  • one or more sound sources may be included in the first sound source that the user does not pay attention to.
  • each sound source is processed according to the above steps 402 and 403, that is, the HRTF corresponding to each sound source is obtained, and each HRTF is multiplied by the preset virtual noise frequency domain , to get the noise signal corresponding to each sound source.
  • the above noise signal can be superimposed on the corresponding sound source, thereby reducing the signal-to-noise ratio of the sound source that the user does not pay attention to, allowing the user to focus more on listening to the audio signal of the sound source of interest, assisting selection Listen.
  • an embodiment of the present application provides an electronic device, the electronic device is used to implement the method for assisting listening provided by the embodiment of the present application, and the electronic device implements at least the following two functions:
  • the modules implementing detection and identification functions in the electronic device may include an environmental information acquisition module, an environmental information processing module and an attention detection module.
  • the environmental information collection module is used to collect audio signals in the environment, and may be environmental information collection sensors such as microphones and cameras deployed in electronic devices.
  • the environment information processing module can determine the position of the sound source corresponding to the audio signal based on the audio signal in the environment collected above.
  • the attention detection module is used to determine the direction of the user's attention. For example, the orientation of the user's head, or the gaze direction of the user's eyes, etc., may be an IMU, a camera, or the like deployed on the electronic device.
  • the modules that implement the functional processing in the electronic device may include an audio processing module and an audio playing module.
  • the audio processing module for adding noise to the audio signal that the user does not pay attention to, may be a processor or the like deployed in the electronic device.
  • the audio playback module is used to play the above-mentioned added noise signal, which can be a speaker or the like deployed in the electronic device.
  • the embodiment of the present application provides a specific example of a method for assisting listening, including at least the following steps:
  • Step 1 Orientation Recognition
  • Detect user focus direction Regarding how to detect the user's attention direction, including but not limited to the following two methods:
  • the first way a camera is deployed on the electronic device, and the camera is used to detect the gaze direction of the user wearing the electronic device, and the detected gaze direction of the user is the user's attention direction.
  • an electroencephalogram detection sensor is deployed on the electronic device, and the electroencephalogram detection sensor can be used to detect the current difference between the user's ears, and determine the user's gaze direction based on the current difference between the two ears in different gaze directions. Likewise, the user's gaze direction, that is, the user's focus direction.
  • Detects and determines sound sources in directions other than the user's attention For example, a microphone array or the like is deployed on the electronic device, and the coordinates of all sound sources near the user can be detected by using the above-mentioned microphone array. According to the above-identified direction of the user's attention, from all the sound sources near the user, a sound source that the user does not pay attention to is selected. Optionally, there may be one or more sound sources that the user does not pay attention to, which is not limited. For example, the number of sound sources that the user does not pay attention to may be n, and the coordinates of the n unconcerned sound sources may be p1(x1, y1, z1), . . . , pn(xn, yn, zn), where n is greater than A positive integer of 1.
  • Step 3 Binaural rendering, as shown in Figure 8.
  • the HRTF of both ears Based on the coordinates of the sound source that the user does not pay attention to, obtain the HRTF of both ears.
  • HRTFs of multiple locations may be pre-stored in the electronic device.
  • the HRTF corresponding to the coordinates of the sound source that the user does not pay attention to can be obtained by performing an interpolation operation on the above-mentioned multiple HRTFs.
  • the coordinates of the n non-concerned sound sources are respectively p1(x1,y1,z1),...,pn(xn,yn,zn), then in the embodiment of the present application, the above n sound sources can be obtained respectively Binaural HRTF with no sound source focus, etc.
  • the playback mode may be an acoustic mode, a bone conduction mode, or the like, which is not limited.
  • the above-mentioned virtual noise may be noise audio files, which may be stored in the electronic device or on the cloud.
  • the electronic device downloads these noise audio files from the cloud to the electronic device and renders them when needed.
  • the above n non-attention sound sources correspond to n binaural HRTFs
  • n binaural HRTFs can correspond to n virtual noises
  • the n binaural HRTFs and the corresponding n virtual noises can be processed to obtain: n binaural noise signals, and play the n binaural noise signals.
  • the above n virtual noises may be respectively n1(n), . . . , nn(n), and the n virtual noises may be the same or different.
  • the methods provided by the embodiments of the present application are introduced from the perspective of an electronic device as an execution subject.
  • the electronic device may include a hardware structure and/or software modules, and implement the above functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above functions is performed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
  • an embodiment of the present application provides an auxiliary listening device, which at least includes a processing unit 901 and a playback unit 902 .
  • the processing unit 901 is configured to determine the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction, where the first direction is the direction determined by detecting the user, and the first sound source is the sound source in other directions outside the first direction; the processing unit 901 is further configured to determine the first HRTF corresponding to the coordinates of the first sound source according to the coordinates of the first sound source and a preset HRTF; The processing unit 901 is further configured to obtain a noise signal corresponding to the first sound source according to the first HRTF and a preset virtual noise; the playing unit 902 is configured to play the noise signal.
  • the determining the coordinates of the first sound source according to the external audio signal collected by the microphone and the first direction includes: determining at least one near the user according to the external audio signal collected by the microphone The coordinates of a sound source; the user is detected to determine the first direction; according to the first direction, the coordinates of the first sound source are determined among the coordinates of at least one sound source near the user .
  • the determining the coordinates of at least one sound source near the user according to the external audio signal collected by the microphone includes: the number of microphones is multiple, and each microphone separately collects the external audio signal , there is a delay between the external audio signals collected by different microphones; the coordinates of at least one sound source near the user are determined according to the time delays of the external audio signals collected by different microphones.
  • the detecting the user and determining the first direction includes: detecting the user's gaze direction; or, detecting the binaural current difference of the user, according to binaural current differences The correspondence between the current difference and the gaze direction determines the user's gaze direction, and the user's gaze direction is the first direction.
  • the determining the coordinates of the first sound source among the coordinates of at least one sound source near the user according to the first direction includes: determining the coordinates of at least one sound source near the user.
  • the coordinates of the sound source are analyzed to determine the directional relationship between each sound source and the user; the deviation between each sound source and the first direction is determined according to the first direction and the directional relationship between the sound source and the user. ; among at least one sound source near the user, a sound source whose deviation from the first direction is greater than a threshold is selected, and the coordinates of the sound source are the coordinates of the first sound source.
  • the processing unit 901 is further configured to: the external audio signal collected by the microphone is a mixed audio signal, and the mixed audio signal includes audio signals output by multiple sound sources; The collected external audio signal is separated to obtain the first audio signal output by the first sound source.
  • the processing unit 901 is further configured to: analyze the separated first audio signal to determine the content of the first audio signal; determine the content of the first audio signal according to the content of the first audio signal The type of dummy noise that needs to be added.
  • the content of the first audio signal is human conversation sound
  • the type of virtual noise that needs to be added is multi-person conversation babble noise
  • the processing unit 901 is further configured to: determine the energy of the separated first audio signal; and determine the energy of the virtual noise to be added according to the energy of the first audio signal.
  • the embodiments of the present application further provide a computer-readable storage medium, including a program, and when the program is executed by a processor, the methods in the above method embodiments are executed.
  • a computer program product comprising computer program code, when the computer program code is run on a computer, causes the computer to implement the methods in the above method embodiments.
  • a chip comprising: a processor, the processor is coupled with a memory, the memory is used for storing a program or an instruction, when the program or instruction is executed by the processor, the device causes the apparatus to perform the above method embodiments Methods.
  • At least one (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that the words “first”, “second” and the like do not limit the quantity and execution order, and the words “first”, “second” and the like are not necessarily different.
  • the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which can be implemented or executed
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the memory may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or may also be a volatile memory (volatile memory), for example Random-access memory (RAM).
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory in this embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data.
  • the methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented in software, it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wire (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available media that can be accessed by a computer, or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVD)), or semiconductor media (eg, SSDs), and the like.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种辅助聆听方法及装置,可辅助用户听音选择,该方法包括:根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,第一方向为通过对用户检测所确定的方向,第一声源为除第一方向外其它方向的声源;根据第一声源的坐标和预设的头相关传递函数HRTF,确定第一声源对应的第一HRTF;根据第一HRTF和预设的虚拟噪声,得到第一声源对应的噪声信号,且播放噪声信号。

Description

一种辅助聆听方法及装置 技术领域
本申请涉及智能终端技术领域,尤其涉及一种辅助聆听方法及装置。
背景技术
当用户身边存在多个声源时,该声源中包括用户想聆听的声源,比如图1所示的声源S1,以及用户不想聆听的声源,比如图1所示的声源S2,该声源S2对用户而言是噪声。如何使得用户更加专注的聆听声源S1的音频内容,是本申请实施例待解决的技术问题。
发明内容
本申请提供一种辅助聆听方法及装置,以使得用户的注意力更加专注于铃听所关注声源的音频内容,辅助用户听音选择。
第一方面,提供一种辅助聆听方法,包括:根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,所述第一方向为通过对用户检测所确定的方向,所述第一声源为除所述第一方向外其它方向的声源;根据所述第一声源的坐标和预设的头相关传递函数HRTF,确定所述第一声源对应的第一HRTF;根据所述第一HRTF和预设的虚拟噪声,得到所述第一声源对应的噪声信号,且播放所述噪声信号。
以第一方向为用户关注方向,且第一方向的声源,即用户关注方向的声源为S1,上述第一声源为用户非关注方向的声源S2为例。通过上述第一方面的设计,可以在用户非关注/非感兴趣的声源S2的音频内容叠加噪声N,从而降低声源S2音频内容的清晰度,使得用户不能清晰的聆听到声源S2的音频内容,进而使得用户注意力更加专注于声源S1音频内容的铃听,辅助选择聆听。
在一种可能的设计中,所述根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,包括:根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标;对所述用户进行检测,确定所述第一方向;根据所述第一方向,在所述用户附近的至少一个声源的坐标中,确定所述第一声源的坐标。
在一种可能的设计中,所述根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标,包括:麦克风的数量为多个,每个麦克风分别采集外部音频信号,不同麦克风所采集的外部音频信号间有时延;根据不同麦克风所采集的外部音频信号的时延,确定所述用户附近的至少一个声源的坐标。
在一种可能的设计中,所述对所述用户进行检测,确定所述第一方向,包括:检测所述用户的凝视方向;或者,检测所述用户的双耳电流差异,根据双耳电流差异与凝视方向的对应关系,确定所述用户的凝视方向,所述用户的凝视方向为所述第一方向。
在一种可能的设计中,所述根据所述第一方向,在用户附近的至少一个声源的坐标中,确定所述第一声源的坐标,包括:对所述用户附近的至少一个声源的坐标进行分析,确定每个声源与所述用户的方向关系;根据所述第一方向以及声源与用户的方向关系,确定每个声源与所述第一方向之间的偏差;在所述用户附近的至少一个声源中,选择与所述第一方向之间的偏差大于阈值的声源,该声源的坐标为所述第一声源的坐标。
在一种可能的设计中,还包括:所述麦克风所采集的外部音频信号为混合音频信号,该混合音频信号中包括多个声源输出的音频信号;对所述麦克风所采集的外部音频信号进行分离,得到所述第一声源输出的第一音频信号。
在一种可能的设计中,还包括:对分离后的所述第一音频信号进行分析,确定所述第一音频信号的内容;根据所述第一音频信号的内容,确定需要添加的虚拟噪声的类型。
通过上述设计,对不同的用户非关注声源,即第一声源,根据第一声源输出的音频信号的内容不同,添加不同类型的噪声,从而有利于掩盖第一声源的音频信号的内容,有助于辅助用户聆听其关注的声源的音频信号。
在一种可能的设计中,所述第一音频信号的内容为人谈话声,则需要添加的虚拟噪声类型为多人谈话babble噪声。
在一种可能的设计中,还包括:确定分离后的所述第一音频信号的能量;根据所述第一音频信号的能量,确定需要添加的虚拟噪声的能量。
通过上述设计,可根据第一声源输出的第一音频信号的能量不同,添加对应能量的虚拟噪声,可以避免添加较多能量的虚拟噪声,降低电子设备的功耗。
第二方面,提供一种辅助聆听装置,包括相应的功能模块或单元,用于实现上述第一方面或第一方面任意一种设计中的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现,硬件或软件包括一个或多个与上述功能相应的模块或单元。
第三方面,提供一种辅助聆听装置,包括处理器和存储器。其中,存储器用于存储计算程序或指令,处理器与存储器耦合;当处理器执行计算机程序或指令时,使得该装置执行上述第一方面或第一方面任一种设计中的方法。
第四方面,提供一种电子设备,该电子设备用于执行上述第一方面或第一方面任一种设计中的方法。可选的,该电子设备可以为耳机(包括有线耳机或无线耳机等)、智能手机、车载设备或可穿戴设备等。无线耳机包括但不限于蓝牙耳机等,可穿戴设备可以为智能眼镜,智能手表,或智能手环等。
第五方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被装置执行时,使得该装置执行上述第一方面或第一方面的任意一设计中的方法。
第六方面,提供一种计算机程序产品,该计算机程序产品包括计算机程序或指令,当计算机程序或指令被装置执行时,使得该装置执行上述第一方面或第一方面任一种设计中的方法。
附图说明
图1为本申请实施例提供的用户选择听音的示意图;
图2为本申请实施例提供的辅助聆听的原理示意图;
图3为本申请实施例提供的电子设备的示意图;
图4为本申请实施例提供的辅助聆听方法的流程图;
图5为本申请实施例提供的坐标系的示意图;
图6为本申请实施例提供的麦克风与声源的模型示意图;
图7为本申请实施例提供的电子设备的功能示意图;
图8为本申请实施例提供的HRTF渲染示意图;
图9为本申请实施例提供的装置的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
我们的环境中存在各种各样的声源,如图1所示,用户附近存在声源S1和声源S2等。对于听力正常的人来讲,我们仍然可以在存在多个声源的场景中有意识的选择想听的音频内容,例如声源S1的音频内容,这种能力称为鸡尾酒会效应。
鸡尾酒会效应(cocktail party effect)是指人的一种听力选择能力。在这种情况下,人的注意力集中在某一个人的谈话之中而忽略背景中其它的对话或噪音。该效应揭示了人类听觉***中令人惊奇的能力,即我们可以在噪声中谈话。鸡尾酒会效应是图形-背景现象的听觉版本。这里的“图形”是我们所注意或者引起我们注意的声音,“背景”是其他的声音。
在嘈杂的室内环境中,比如在鸡尾酒会中,同时存在许多不同的声源,例如:多个人同时说话的声音、餐具的碰撞声、音乐声以及这些声音经墙壁和室内的物体反射所产生的反射声等。在声源的传递过程中,不同声源所发出的声波之间,以及直达声和反射声之间会在传播介质(通常是空气)中相叠加而形成复杂的混合声波。因此,在到达听者外耳道的混合声源中已经不存在各个声源相对应的声波了。而在这种声学环境下,听者却能够在相当的程度上听懂所关注的目标语句。听者是如何从接收到的混合声波中分离出不同说话人的言语信号进而听懂目标语句的呢?这就是在1953年所提出的著名的“鸡尾酒会”问题。关于“鸡尾酒会效应”业界通常利用以下的听觉注意原理和声学原理来阐释和说明。
听觉注意原理:这是因为当人的听觉注意力集中于某一事物时,意识会将一些无关的声音刺激排除在外,而无意识却在始终监察着外界的刺激,一旦有一些与自己有关的特殊刺激,就能立即引起注意,这效应实际上是听觉***的一种适应能力。简单来说,就是我们的大脑对声音都进行了某种程度的判断,然后才决定听或不听。
声学原理:鸡尾酒会效应在声学中是指人耳的掩蔽效应。在鸡尾酒会嘈杂的人群中,两人可以顺利交谈,尽管周围噪声很大,但两人耳中听到的是对方的说话声,人们似乎听不到谈话内容以外的各种噪声了。因为人们已经把各自的关注重点(这就是注意的选择性)放在谈话主题上了。
利用这种鸡尾酒会效应,通过控制环境中的音频对象的清晰程度,即可影响人们的选择听音。一般意义上来说,音频对象的内容越清晰,被感知和倾注的注意力越多。关于辅助用户进行听音选择的方案,目前主要包括以下两种:
第一种方案,基本过程如下:
1)混合音频的频谱图被输送到多个深度神经网络(deep neural networks,DNN),每个DNN经过训练可以分离特定的语音源。
2)当用户正在聆听其中一个语音源时,其大脑的神经可记录用于重建该语音源的声谱图。
3)将重建的声谱图与每个DNN的输出相比较,如果一致,则对该音源进行放大。
该方案用于助听器,可以对用户关注或感兴趣的音频对象进行增强回放,从而使得用户关注或感兴趣的音频对象可以更加清晰的被聆听。但由于该方法基于脑波技术识别当前用户感兴趣或关注的外部音源信号(从脑波中还原和恢复),并将相应的音源信号进行增 强并播放。从技术上看,该技术需要先进行语音分离,然后再与脑波解码信号匹配,效果依赖于前端语音信号分离的准确性,挑战巨大,实现难度较高。
第二种方案,包括:步骤一,检测用户的关注或感兴趣方向,如基于用户凝视或基于用户头部朝向等,确定用户感兴趣或关注方向等。步骤二,聚焦拾音,拾取用户关注方向的音频信号s(n);可选的,拾音是指具体的采集音频信号的动作。聚焦拾音,拾取用户关注的音频信号的方法可包括:对用户周围的音频信号全部拾音,然后从中分离出用户关注的音频信号。或者,仅对用户关注的音频信号进行拾音,例如利用波束成形的方法等,仅获取用户关注的音频信号。步骤三,双耳回放,对拾音方向声音进行双耳渲染回放,结合头相关传递函数(head related transfer function,HRTF),提升s(n)的临场感。
上述第二种方案,其本质是通过对关注声源S1的重点拾音,使得声源S1的音频内容更加清晰,实现用户对S1的更有效注意力感知。但从本质上来讲,并没有改变S2相对于N的信噪比,使得S2仍有可能被感知到,从而影响用户对声源S1的有效注意力感知。举例来说,用户与对象A进行交谈,但用户附近还存在对象B也在交谈,对象B对用户而言是噪声。在上述第二种方案中,通过重点拾音,提升了对象A的谈话内容的音量或清晰度等。但是并没有对对象B的谈话内容进行任何处理。那么如果对象B的谈话内容中涉及用户关注的字眼,仍可能吸引用户的注意力,使得用户不能更好的专注于聆听对象A的谈话内容。举例来说,如果用户对“加薪”较关注,那么在用户与对象A的交谈过程中,若对象B的谈话内容中涉及“加薪”等字眼,将会吸引用户的注意力,使得用户不能专注于聆听对象A的谈话内容。
本申请提供一种辅助聆听方法及装置,如图2所示,该方案的原理为对用户非关注/非感兴趣的声源S2的音频内容叠加噪声(noise,N),从而降低声源S2音频内容的清晰度,使得用户不能清晰的聆听到声源S2的音频内容,进而使得用户注意力更加专注于声源S1音频内容的铃听,辅助选择聆听。沿用上述举例,利用本申请实施例中的方案,可以在上述对象B的谈话内容上叠加噪声N,从而降低对象B的谈话内容的信噪比。此时即便对象B的谈话内容中再涉及用户较关注的“加薪”等字眼,但是通过上述噪声N的叠加,用户已不能清晰的聆听到对象B的谈话内容,该对象B的谈话内容不会再吸引用户的注意力,使得用户更加专注于聆听对象A的谈话内容。
本申请实施例提供的辅助聆听方法,可应用于电子设备中,该电子设备包括但不限于耳机、智能手机、车载设备或可穿戴设备(例如智能眼镜,智能手表或手环等)。以耳机为例,应用场景可以为,耳机中设置有辅助聆听的按键,当用户开启该按键时,耳机可以通过对用户进行检测,确定用户的关注声源。对于用户非关注声源的音频信号添加噪声N,从而降低非关注声源音频信号的信噪比,使得用户可以更加专注于其所关注声源内容的聆听,辅助选择聆听。
如图3所示,本申请实施例提供一种电子设备,该电子设备可以用于实现本申请实施例提供的辅助铃听方法,该电子设备至少包括:处理器301、存储器302、至少一个扬声器304、至少一个麦克风305、和电源306等。
其中,存储器302可以存储程序代码,该程序代码可以包括实现辅助聆听的程序代码。处理器301可以执行上述程序代码,以实现本申请实施例中辅助聆听的功能。例如,处理器301可执行存储器302的程序代码,实现以下功能:根据麦克风所采集的外部音频信号以及检测的用户关注的第一方向,确定用户非关注方向的第一声源的坐标;根据第一声源 的坐标和预定的不同位置的HRTF,确定第一声源对应的第一HRTF;根据第一HRTF和预设的虚拟噪声,得到第一声源对应的噪声信号等。
扬声器304,可以用于将音频电信号转换成声音并播放。例如,扬声器304用于播放上述第一声源对应的噪声信号等。
麦克风305,也可以称为话简、传声器等,用于将声音信号转换为音频电信号。例如,麦克风305可以采集用户附近的声音信号,并将其转换为音频电信号。应理解,音频电信号即为本申请实施例中的音频信号。
电源306,可用于向电子设备包含的各个部件供电。在一些实施例中,该电源306可以是电池,例如可充电电池等。
可选的,若该电子设备300为无线耳机,则电子设备300中还可以包括:传感器303和无线通信模块307等。
传感器303可以为接近光传感器。处理器301可以通过该传感器303确定耳机是否被用户佩戴。例如,处理器301可以利用接近光传感器来检测耳机附近是否有物体,从而确定耳机是否被佩戴等。或者,若耳机在充电盒中充电,则处理器301可以利用接近光传感器确定耳机的充电盒是否开盖,从而确定是否控制耳机处于配对状态。另外,在一些实施例中,该耳机还可以包括骨传导传感器,该骨传导传感器可以获取声部振动骨块的振动信号,处理器301解析出语音信号,实现语音信号对应的控制功能。在另一些实施例中,该耳机还可以包括触摸传感器或压力传感器等,分别用于检测用户在耳机上的触摸操作和按压操作等。在另一些实施例中,该耳机还可包括指纹传感器,用于检测用户指纹,识别用户身份等。
无线通信模块307,用于与其它电子设备建立无线连接,使得耳机能够与其他电子设备进行数据交互。在一些实施例中,该无线通信模块307可以是近场通信(near field communication,NFC)模块,从而使得耳机可以与其它具有NFC模块的电子设备进行近场通信。该NFC模块中可以存储有耳机的相关信息,例如可以存储耳机的名称、地址信息或唯一标识符等,这样,其他具有NFC模块的电子设备可以根据该相关信息建立与耳机的NFC连接,并基于NFC连接进行数据传输。在另一些实施例中,该无线通信模块307也可以是蓝牙模块,该蓝牙模块中存储耳机的蓝牙地址,从而使得其它电子设备能够根据该蓝牙地址建立与耳机的蓝牙连接,通过蓝牙连接传输音频数据等。在本申请实施例中,蓝牙模块可以同时支持多种蓝牙连接类型,例如传统蓝牙技术的串口协议(serial port profile,SPP)或低功耗蓝牙(bluetooth low energy,BLE)通用属性配置协议(generic attribute profile,GAP)等等,这里不做限制。
在另一些实施例中,该无线通信模块307也可以是红外模块或无线保真(wireless fidelity,WIFI)等,在此不对无线通信模块307的具体实现方式进行限制。
另外,在本申请实施例中,无线通信模块307可以只设置一个,或者,也可以根据需求设置多个。例如,可以在耳机中设置两个无线通信模块,其中一个无线通信模块为蓝牙模块,另一个无线通信模块为NFC模块等。这样,耳机可以分别通过这两个无线通信模块进行数据通信,在此不对无线通信模块307的数量进行限制。
可以理解的是,本实施例示意的结构并不构成对电子设备300的具体限定。其可以具有比图3中所示的更多或更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置等。例如,电子设备300还可以包括指示灯(可以指示电量等状态)、防尘网(可 以配合听筒使用)等部件。图3所示的各种部件可以包括一个或多个信号处理或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
如图4所示,提供一种辅助聆听方法的流程,该流程至少包括:
步骤401:电子设备根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,所述第一方向为通过对用户进行检测所确定的,第一声源为除第一方向外其它方向的声源。示例的,第一方向可以为用户关注的方向,所述第一声源可以为用户非关注方向的声源。举例来说,用户附近存在声源S1、S2和S3。若用户关注声源S1,例如正在聆听声源S1的音频输出,那么S1可以为用户关注方向的声源,声源S2和S3为用户非关注方向的声源。
可选的,上述步骤401的实现过程可包括:电子设备根据麦克风所采集的外部音频信号,确定用户附近的至少一个声源的坐标。示例的,电子设备中设置有麦克风阵列,该麦克风阵列中包括至少一个麦克风,每个麦克风分别采集外部音频信号,不同麦克风所采集的外部音频信号之间有时延。电子设备可根据不同麦克风所采集的外部音频信号之间的时延,确定用户附近的至少一个声源的坐标。可选的,上述麦克风可以为矢量麦克风等。举例来说,用户附近存在声源S1和声源S2。用户佩戴电子设备,则电子设备的麦克风阵列可采集用户外部的声音信号,并将该声音信号转换为音频信号,该音频信号中包括声源S1对应的音频信号以及声源S2对应的音频信号。电子设备可根据麦克风阵列中不同麦克风采集音频信号的时延,确定声源S1的坐标,以及声源S2的坐标。
在一种可能的实现方式中,可以引用业界论文《基于麦克风阵列的声源定位算法研究》(南京大学硕士论文,刘超)中描述的模型,实现本申请实施例中上述的利用不同麦克风的时延,确定每个声源的坐标。
具体的,如图6所示,设麦克风阵列由N+1个麦克风组成,该N+1个麦克风分别为(M 0,M 1,….,M n)等。设定第i个麦克风的空间坐标表示为r i=(x i,y i,z i),i=0,1,…,N。麦克风M 0的坐标为空间坐标原点,即r 0=(0,0,0),声源S的空间坐标为r s=(x,y,z),则:
a、声源S到第i个麦克风之间的距离为:
Figure PCTCN2021078222-appb-000001
其中,d i表示声源S到第i个麦克风的距离,(x,y,z)表示声源S的坐标,(x i,y i,z i)表示第i个麦克风的坐标。
b、声源S到各个麦克风之间的距离差为:
d ij=d i-d ji、j=0,1,....,N
其中,d ij表示声源到第i个麦克风的距离d i,与声源到第j个麦克风的距离d j间的差值d ij
c、第i个麦克风与第j个麦克风之间的相对距离为:
Figure PCTCN2021078222-appb-000002
其中,
Figure PCTCN2021078222-appb-000003
表示第i个麦克风与第j个麦克风之间的相对距离,c表示声速,τ ij表示第i个麦克风与第j个麦克风之间的时延。
在上述各个表达式中,各个麦克风之间的距离是已知的,声速也是已知的。对于上述各个表达式进行综合求解,即可得到声源S的空间位置s=(x s,y s,z s)。具体算法可以是最大似然估计方法,或者最小二乘法等,不作限定。
通过上述描述,可以得到用户附近的至少一个声源的坐标。之后,继续描述,检测用户关注的第一方向的过程,以及,如何根据用户关注的第一方向,从至少一个声源中,筛选出用户非关注的第一声源的过程。
示例的,用户可以佩戴电子设备。例如,该电子设备为耳机,则用户可以在耳部位置佩戴该电子设备。电子设备通过对用户进行检测,可确定用户关注的第一方向。例如,电子设备检测用户眼睛的凝视方向,将用户眼睛的凝视方向,作为用户关注的第一方向等。在一种可能的实现方式中,电子设备中可以布署有惯性测量单元(inertial measurement unit,IMU),电子设备可利用该IMU确定用户的头部朝向,根据用户的头部朝向,确定用户的眼睛凝视方向等。例如,通过IMU检测出用户的头部朝向为正前方,那么用户的眼睛凝视方向,即用户关注的第一方向为正前向等。或者,电子设备中可以布署有摄像头,该摄像头可以采集用户的头部图像,根据摄像头采集的用户头部图像,确定用户的双眼凝视方向等。或者,该电子设备中可以布署有脑波检测传感器,该脑波检测传感器可以检测用户的双耳电流差异,根据双耳电流差异与凝视方向的对应关系,可确定用户的凝视方向。
之后,电子设备可根据第一方向,在所述用户附近的至少一个声源的坐标中,确定第一声源的坐标。该过程可以具体有两种实现方式:
第一种,电子设备根据上述第一方向,在用户附近的至少一个声源中,确定用户关注的声源;之后,在用户附近的至少一个声源中,排除用户关注的声源,剩余的声源即为用户非关注的第一声源。例如,用户附近存在5个声源,根据上述用户关注的第一方向,确定上述5个声源中的声源A为用户关注的声源,则5个声源中除声源A之外的剩余4个声源为用户非关注的第一声源。应当理解,在本申请实施例中,用户非关注的第一声源中可包括一个声源,也可以包括多个声源等,不作限定。
第二种,电子设备根据上述第一方向,在用户附近的至少一个声源中,直接确定用户非关注的第一声源。
例如,电子设备可对用户附近的至少一个声源的坐标进行分析,确定每个声源与用户的位置关系。可选的,上述声源的坐标可以为相对于用户的坐标。如图5所示,建立一个坐标系。可选的,该坐标系可以为三维坐标系。其中,该坐标系的原点为用户的头部位置,X轴代表用户的左/右方向,Y轴代表用户的前/后方向,Z轴代表用户的上/下方向。例如,在一种可能的实现方式中,X轴的正方向为用户的正右方向,X轴的负方向为用户的正左方向,Y轴的正方向为用户的正前方向,Y轴的负方向为用户的正后方向,Z轴的正方向为用户的正上方向,Z轴的负方向为用户的正下方向;可以理解的是,上述X轴、Y轴和Z轴的正方向均代表其大于0的方向,而X轴、Y轴和Z轴的负方向均代表其小于0的方向。电子设备通过对上述每个声源相对用户的坐标进行分析,可以确定每个声源与用户的位置关系。例如,一个声源相对用户的坐标为(1,2,0),代表该声源相对用户的位置为:用户正右方1m,用户正前方2m,与用户位置同一个平面的位置。通过声源的上述坐标,可以准确的定位出声源的位置,从而确定该位置与用户的方位关系。
根据上述检测出的第一方向以及声源与用户的方向关系,确定每个声源与第一方向之间的偏差。例如,上述检测出的用户关注的第一方向为偏离用户正前方15度。而某一个声源与用户的方向关系为偏离用户正前方20度,则该声源与第一方向之间的偏差为5度。在用户附近的至少一个声源中,选择与第一方向之间的偏差大于阈值的声源,该声源即为用户非关注的第一声源。或者,在用户附近的至少一个声源中,选择与第一方向之间的偏 差小于或等于阈值的声源,该声源即可为用户关注的声源;之后,在用户附近的所有声源中,排除所述用户关注的声源外,即为用户非关注的声源。
通过上述方式,可以确定出每个声源与第一方向之间的偏差。当一声源与第一方向的偏差大于阈值时,则认为该声源为用户非关注的声源。而当一声源与第一方向的偏差小于或等于阈值时,则认为该声源为用户关注的声源。沿用上述图1所示的举例,用户附近存在声源S1和声源S2。若声源S1与第一方向的偏差小于或等于上述阈值,则认为声源S1为用户关注的声源。声源S2与第一方向的偏差大于上述阈值,则认为声源S2为用户非关注的声源等。当然上述阈值可以为在电子设备出厂时设置的,或者,后续通过电子设备的服务器同步或通知给电子设备的等,不作限定。应当指出,在本申请实施例描述中,“关注”与“感兴趣”两者不作区分,可以相互替换。例如,上述用户关注的声源,也可以替换为用户感兴趣的声源;用户非关注的声源,也可以替换为用户非感兴趣的声源等。在一种理解中,用户关注的声源或用户感兴趣的声源,可以为与用户当前从事的活动相关的声源。比如,用户当前从事看电视的活动,电视机作为一个声源播放音频信号,那么电视机可以为用户感兴趣或关注的声源。从技术上来,上述用户的关注或用户感兴趣的声源,可以为与用户的眼睛凝视方向之间的偏差小于或等于阈值的声源等。同理,用户非关注或非感兴趣的声源可以为与用户当前从事的活动不相关的声源,该声源对用户而言是噪声。沿用上述举例,用户在看电视,看电视为用户当前从事的活动,听音乐并非用户当前所从事的活动。如果电视机附近还有音乐播放器在播放音乐,那么该音乐播放器可以为用户非关注的声源。从技术上而言,用户非关注的声源可以为与用户的眼睛凝视方向之间的偏差大于阈值的声源等。
步骤402:电子设备根据第一声源的坐标和预设的HRTF,确定第一声源对应的第一HRTF。可选的,上述第一HRTF中可包括两个HRTF,如图7所示,可分别对应于左耳的HRTF和右耳的HRTF。
步骤403:电子设备根据第一HRTF和预设的虚拟噪声,得到第一声源对应的噪声信号,且播放所述噪声信号。
可选的,在频域上,电子设备可将第一HRTF与预设的虚拟噪声频域相乘,得到第一声源对应的噪声信号。例如,沿用上述举例,第一HRTF包括左耳的HRTF与右耳的HRTF。则左耳的HRTF与预设的虚拟噪声频域相乘,可以得到左耳的噪声信号,右耳的HRTF与预设的虚拟噪声频域相乘,可以得到右耳的噪声信号。左耳的噪声信号和右耳的噪声信号,可称为双耳的噪声信号。HRTF是一种关于音效定位的算法,对应时域的头相关冲击响应(head related inpulse response,HRIR)。上述在频域上,对HRTF与预设的虚拟噪声进行频域相乘的过程,在时域上,体现为将上述HRIR与预设的虚拟噪声卷积的过程。
为了便于理解,首先介绍HRTF,HRTF的概念与意义如下:人有两个耳朵,却能定位来自三维空间的声音,这得力于人耳对声音信号的分析***。HRTF可以模拟人耳对声音信号的分析***。HRTF本质上包含了声源的空间方位信息,不同空间方位的声源其对应的HRTF是不同的。对于任意一个单声道的音频信号,分别与左耳的HRTF与右耳的HRTF频域相乘后,可以得到双耳对应的音频信号,用耳机进行播放,就可以体验3维音频。
在一种可能的实现方式中,可以预先存储多个位置的HRTF,利用预先存储的多个位置的HRTF(例如,对上述多个位置的HRTF进行插值运算等),即可得到上述第一声源对 应的第一HRTF等。之后,将第一HRTF与预设的虚拟噪声频域相乘,得到噪声信号。需要说明的是,HRTF是与位置相关的,上述第一HRTF是根据第一声源的坐标获得的。通过第一HRTF与预设的虚拟噪声信号频域相乘,可以使得上述虚拟噪声信号叠加到第一声源的音频信号中,降低第一声源的音频信号的信噪比,可以使得用户更加专注的聆听关注的声源的音频信号,辅助用户聆听。举例来说,沿用上述图1的举例,根据声源S2的坐标,获取上述第一HRTF,然后将第一HRTF与预设的虚拟噪声频域相乘,得到噪声信号,通过播放上述噪声信号可以使得上述噪声信号叠加于声源S2上,从而降低声源S2的音频信号的信噪比,从而使得用户更加专注的聆听声源S1的音频信号。
在本申请实施例中,关于上述步骤402中根据第一声源的坐标和预设的HRTF,确定第一声源对应的第一HRTF的过程,包括但不限于以下两种实现方式:
方式1:精确定位每个声源相对用户的坐标,预先存储较多数量的HRTF。
在该方式中,电子设备精确定位每个声源相对用户的位置。沿用上述举例,某一个声源相对用户的坐标为(1,2,0),代表该声源的位置为:距离用户正右方向1米,距离用户正前方向2米,与用户位于同一水平面的位置。由于HRTF是与位置相关的,在此种情况下,电子设备可能需要预先较多数量的HRTF,从而推算出上述声源的位置对应的HRTF。该方式的好处是,可以确定较精确的HRTF,根据该HRTF确定的噪声信号可以较精准的叠加于非关注声源上。
方式2:粗略定位每个声源相对用户的坐标,预先存储较少数量的HRTF。
在该方式中,电子设备可以大概定位出声源相对用户的方向即可,不再精确定位每个声源的具***置。例如,采用上述方式,确定的某一个声源相对用户的坐标可以为(1,1,0),代表该声源位于用户的右前方向,而关于右前方向的具***置不再推算。电子设备可以存储前、后、左、右等四个方向的HRTF,根据该四个方向的HRTF推算出上述正右方向的HRTF。这样做的好处是,可以减少电子设备的存储空间,简化计算过程,节省电子设备的功耗。
可选的,上述步骤403中添加的虚拟噪声可以是白噪声。或者,上述步骤403中添加的虚拟噪声,也可以是与第一声源的第一音频信号的内容相匹配的噪声。电子设备可对第一声源的第一音频信号进行分析,确定第一音频信号的内容,根据第一音频信号的内容,确定需要添加的虚拟噪声的类型。例如,若第一音频信号的内容为人谈话声,则电子设备确定需要添加的虚拟噪声类型为多人谈话babble噪声等。在一种可能的实现方式中,电子设备可检测第一音频信号的内容中是否包含人的语音。若第一音频信号的内容中包含人的语音,则在第一音频信号中添加的虚拟噪声可以是多人谈话babble噪声等。关于检测第一声源的第一音频信号的内容,是否包含人的语音,可以利用语音活动检测(voice activity detection,VAD)技术,例如短时能量(short timeenergy,STE)和过零率(zero cross counter,ZCC)等检测方法。
在一种可能的实现方式中,可检测第一声源的第一音频信号的STE和ZCC等。由于语音片段的STE相对较大,ZCC相对较小,而非语音片段的STE相对较小,ZCC相对较大。这主要是由于语音信号能量绝大部分包含在低频带内,而噪音信号通常能量较小且含有较高频段的信息造成的。因此,可设置一定的门限,当第一声源的第一音频信号的STE大于或等于第一阈值,ZCC小于或等于第二阈值时,可认为该第一声源的第一音频信号中包括人的语音。而当第一声源的第一音频信号的STE小于第一阈值,ZCC大于第二阈值时, 可认为第一声源的第一音频信号中不包括人的语音。其中,ZCC是指一个信号的符号变化的比率,即一帧语音时域信号穿过时间轴的次数。其计算方法是,将帧内所有信号平移1,再对应点做乘积,符号为负,则说明此处过零,只需将帧内所有负数乘积数值求出,则得到该帧的过零率。STE是指一帧语音信号的能量。
可选的,电子设备还可以确定第一音频信号的能量;根据第一音频信号的能量,确定需要添加的虚拟噪声的能量等。该方法本质是用于控制第一音频信号添加虚拟噪声后的信噪比。例如,预先设定第一音频信号与添加的虚拟噪声的信噪比为50%。举例来说,第一音频信号的能量为W1,那么待添加虚拟噪声的能量可以为第一音频信号能量的一半,即0.5W1。
可选的,在本申请实施例中,麦克风所采集的外部音频信号可为混合音频信号,该混合音频信号中包括多个声源输出的音频信号。电子设备可以对麦克风所采集的混合音频信号进行分离,得到第一声源对应的第一音频信号。之后,再采用上述方式,确定第一音频信号中的音频内容,和/或,第一音频信号的能量等。
在一种可能的实现方式中,可以采用业界论文《多通道语音信号处理中的关键技术研究-声场重构与语音分离》(大连理工大学博士论文,汪林)中介绍的一种简单的频域语音分离算法,可将麦克风所采集的混合音频信号,分离成多个独立的音频信号,该多个独立的音频信号中包括第一声源对应的第一音频信号。
设有N个独立声源和M个麦克风,其声源矢量为(n)=[s 1(n),...s N(n)] T,观测矢量为x(n)=[x 1(n),...x M(n)] T,混合滤波器长度为P,则卷积混合过程可表达为:
Figure PCTCN2021078222-appb-000004
其中,混合网络H(n)为M*N的矩阵序列,由混合滤波器的冲击响应构成。设分离滤波器的长度为L,估计出的声源矢量为y(n)=[y 1(n),...y N(n)] T,其表达式为:
Figure PCTCN2021078222-appb-000005
其中,分离网络W(n)为N*M的矩阵,它由分离滤波器的冲激响应构成,*表示矩阵卷积操作。
分离网络W(n)可以通过频域盲源分离算法获得。经过L点短时傅立叶变换(short-time fourier transform,STFT),将时域卷积变换为频域乘积,即
X(m,f)=H(f)S(m,f)
Y(m,f)=W(f)X(m,f)
其中,m由对时间索引值n作L点降采样后得到,X(m,f)和Y(m,f)分别为x(n)和y(n)作STFT得到,H(f)和W(f)分别为H(n)和W(n)的傅立叶变换形式,f∈[f 0,...f L/2]为频率。
将盲源分离后获得的Y(m,f)反变换回时域,就得到了估计后的声源信号y 1(n),...y N(n)。
应当指出,在本申请实施例中,用户非关注的第一声源中可包括一个或多个声源。当包括多个声源时,每个声源均按照上述步骤402和步骤403的步骤处理一遍,即获取每个声源对应的HRTF,且将每个HRTF与预设的虚拟噪声频域相乘,得到每个声源对应的噪声信号。通过播放每个声源的噪声信号,可将上述噪声信号叠加到对应的声源上,从而降 低用户非关注声源的信噪比,使得用户更加专注于聆听关注声源的音频信号,辅助选择聆听。
示例的,本申请实施例提供一种电子设备,该电子设备用于实现本申请实施例提供的辅助聆听的方法,该电子设备至少实现以下两部分功能:
第一部分:检测与识别
电子设备中实现检测与识别功能的模块可包括环境信息采集模块、环境信息处理模块和注意力检测模块。其中,环境信息采集模块,用于采集环境中的音频信号,可以是部署在电子设备中的麦克风、摄像头等环境信息采集传感器等。环境信息处理模块,可以基于上述采集的环境中的音频信号,确定该音频信号对应的声源的位置。注意力检测模块用于确定用户关注的方向。例如,用户头部的朝向,或者用户眼睛的凝视方向等,可以是部署在电子设备上的IMU、摄像头等。
第二部分:功能处理
电子设备中实现该功能处理的模块可包括音频处理模块和音频播放模块。其中,音频处理模块,用于在用户非关注的音频信号上添加噪声,可以是部署在电子设备中的处理器等。音频播放模块,用于将上述添加的噪声信号播放出来,可以是部署在电子设备中的扬声器等。
本申请实施例提供一种辅助聆听方法的具体示例,至少包括以下步骤:
步骤一:方向识别
检测用户关注方向。关于如何检测用户关注方向,包括但不限于以下两种方式:
第一种方式:电子设备上布署有摄像头,利用摄像头检测佩戴电子设备的用户的凝视方向,该检测的用户的凝视方向,即为用户关注方向。
第二种方式:电子设备上部署有脑波检测传感器,利用该脑波检测传感器可以检测用户双耳的电流差异,基于不同凝视方向的双耳电流差异,确定用户凝视方向。同样,该用户凝视方向,即用户的关注方向。
步骤二:位置识别
检测和确定用户非关注方向的声源。示例的,电子设备上部署有麦克风阵列等,利用上述麦克风阵列可检测用户附近的全部声源的坐标。根据上述识别的用户关注方向,从用户附近的全部声源中,选择出用户非关注的声源。可选的,用户非关注的声源可以为一个或多个,不作限定。例如,上述用户非关注的声源可以为n个,该n个非关注声源的坐标可分别为p1(x1,y1,z1),….,pn(xn,yn,zn),n为大于1的正整数。
步骤三:双耳渲染,可参见图8所示。
1、基于用户非关注声源的坐标,获取双耳的HRTF。可选的,电子设备中可预先存储多个位置的HRTF。对上述多个HRTF进行插值运算,即可获取用户非关注声源的坐标对应的HRTF。沿用上述举例,上述n个非关注声源的坐标分别为p1(x1,y1,z1),….,pn(xn,yn,zn),那么在本申请实施例中,可分别获取上述n个非关注声源的双耳HRTF等。
2、将虚拟噪声与上述双耳HRTF做处理,例如时域卷积,频域乘积等,得到双耳音频信号,并实时播放。该播放方式可以声学方式,也可以是骨传导方式等,不作限定。
其中,上述虚拟噪声可以为噪声音频文件,可以保存在电子设备中,也可以保存在云上,电子设备需要时从云上下载这些噪声音频文件到电子设备中并渲染播放。沿用上述举例,上述n个非关注声源,对应n个双耳HRTF,n个双耳HRTF可对应n个虚拟噪声, 可将n个双耳HRTF与对应的n个虚拟噪声做处理,得于n个双耳噪声信号,并播放该n个双耳噪声信号。可选的,上述n个虚拟噪声可分别为n1(n),…,nn(n),该n个虚拟噪声可以相同或不同等。
通过上述方法,通过在非关注的声源的音频信号中叠加虚拟噪声,降低非关注声源的音频信号的清晰度,从而提升了用户关注声源的音频信号的感知清晰度等。
上述本申请提供的实施例中,从电子设备作为执行主体的角度对本申请实施例提供的方法进行了介绍。为了实现上述本申请实施例提供的方法中的各功能,电子设备可以包括硬件结构和/或软件模块,以硬件结构、软件模块、或硬件结构加软件模块的形式来实现上述各功能。上述各功能中的某个功能以硬件结构、软件模块、还是硬件结构加软件模块的方式来执行,取决于技术方案的特定应用和设计约束条件。
如图9所示,本申请实施例提供一种辅助聆听装置,该装置至少包括处理单元901和播放单元902。
其中,处理单元901,用于根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,所述第一方向为通过对用户检测所确定的方向,所述第一声源为所述第一方向外其它方向的声源;处理单元901,还用于根据所述第一声源的坐标和预设的HRTF,确定所述第一声源的坐标对应的第一HRTF;处理单元901,还用于根据所述第一HRTF和预设的虚拟噪声,得到所述第一声源对应的噪声信号;播放单元902,用于播放所述噪声信号。
在一种可能的实现方式中,所述根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,包括:根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标;对所述用户进行检测,确定所述第一方向;根据所述第一方向,在所述用户附近的至少一个声源的坐标中,确定所述第一声源的坐标。
在一种可能的实现方式中,所述根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标,包括:麦克风的数量为多个,每个麦克风分别采集外部音频信号,不同麦克风所采集的外部音频信号间有时延;根据不同麦克风所采集的外部音频信号的时延,确定所述用户附近的至少一个声源的坐标。
在一种可能的实现方式中,所述对所述用户进行检测,确定所述第一方向,包括:检测所述用户的凝视方向;或者,检测所述用户的双耳电流差异,根据双耳电流差异与凝视方向的对应关系,确定所述用户的凝视方向,所述用户的凝视方向为所述第一方向。
在一种可能的实现方式中,所述根据所述第一方向,在用户附近的至少一个声源的坐标中,确定所述第一声源的坐标,包括:对所述用户附近的至少一个声源的坐标进行分析,确定每个声源与所述用户的方向关系;根据所述第一方向以及声源与用户的方向关系,确定每个声源与所述第一方向之间的偏差;在所述用户附近的至少一个声源的中,选择与所述第一方向之间的偏差大于阈值的声源,该声源的坐标为所述第一声源的坐标。
在一种可能的实现方式中,处理单元901还用于:所述麦克风所采集的外部音频信号为混合音频信号,该混合音频信号中包括多个声源输出的音频信号;对所述麦克风所采集的外部音频信号进行分离,得到第一声源输出的第一音频信号。
在一种可能的实现方式中,处理单元901还用于:对分离后的所述第一音频信号进行分析,确定所述第一音频信号的内容;根据所述第一音频信号的内容,确定需要添加的虚拟噪声的类型。
在一种可能的实现方式中,所述第一音频信号的内容为人谈话声,则需要添加的虚拟噪声类型为多人谈话babble噪声。
在一种可能的实现方式中,处理单元901还用于:确定分离后的所述第一音频信号的能量;根据所述第一音频信号的能量,确定需要添加的虚拟噪声的能量。
本申请实施例还提供一种计算机可读存储介质,包括程序,当所述程序被处理器运行时,上文方法实施例中的方法被执行。
一种计算机程序产品,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机实现上文方法实施例中的方法。
一种芯片,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得装置执行上文方法实施例中的方法。
在本申请的描述中,除非另有说明,“/”表示前后关联的对象是一种“或”的关系,例如,A/B可以表示A或B;本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
本申请实施例中,处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
在本申请实施例中,存储器可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
本申请实施例提供的方法中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,简称DSL))或无线(例 如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,简称DVD))、或者半导体介质(例如,SSD)等。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
需要指出的是,本专利申请文件的一部分包含受著作权保护的内容。除了对专利局的专利文件或记录的专利文档内容制作副本以外,著作权人保留著作权。

Claims (20)

  1. 一种辅助聆听方法,其特征在于,包括:
    根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,所述第一方向为通过对用户检测所确定的方向,所述第一声源为除所述第一方向外其它方向的声源;
    根据所述第一声源的坐标和预设的头相关传递函数HRTF,确定所述第一声源对应的第一HRTF;
    根据所述第一HRTF和预设的虚拟噪声,得到所述第一声源对应的噪声信号,且播放所述噪声信号。
  2. 如权利要求1所述的方法,其特征在于,所述根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,包括:
    根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标;
    对所述用户进行检测,确定所述第一方向;
    根据所述第一方向,在所述用户附近的至少一个声源的坐标中,确定所述第一声源的坐标。
  3. 如权利要求2所述的方法,其特征在于,所述根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标,包括:
    麦克风的数量为多个,每个麦克风分别采集外部音频信号,不同麦克风所采集的外部音频信号间有时延;
    根据不同麦克风所采集的外部音频信号的时延,确定所述用户附近的至少一个声源的坐标。
  4. 如权利要求2或3所述的方法,其特征在于,所述对所述用户进行检测,确定所述第一方向,包括:
    检测所述用户的凝视方向;或者,
    检测所述用户的双耳电流差异,根据双耳电流差异与凝视方向的对应关系,确定所述用户的凝视方向,所述用户的凝视方向为所述第一方向。
  5. 如权利要求2至4中任一项所述的方法,其特征在于,所述根据所述第一方向,在用户附近的至少一个声源的坐标中,确定所述第一声源的坐标,包括:
    对所述用户附近的至少一个声源的坐标进行分析,确定每个声源与所述用户的方向关系;
    根据所述第一方向以及声源与用户的方向关系,确定每个声源与所述第一方向之间的偏差;
    在所述用户附近的至少一个声源中,选择与所述第一方向之间的偏差大于阈值的声源,该声源的坐标为所述第一声源的坐标。
  6. 如权利要求1至5中任一项所述的方法,其特征在于,还包括:
    所述麦克风所采集的外部音频信号为混合音频信号,该混合音频信号中包括多个声源输出的音频信号;
    对所述麦克风所采集的外部音频信号进行分离,得到所述第一声源输出的第一音频信号。
  7. 如权利要求6所述的方法,其特征在于,还包括:
    对分离后的所述第一音频信号进行分析,确定所述第一音频信号的内容;
    根据所述第一音频信号的内容,确定需要添加的虚拟噪声的类型。
  8. 如权利要求7所述的方法,其特征在于,所述第一音频信号的内容为人谈话声,则需要添加的虚拟噪声类型为多人谈话babble噪声。
  9. 如权利要求6至8中任一项所述的方法,其特征在于,还包括:
    确定分离后的所述第一音频信号的能量;
    根据所述第一音频信号的能量,确定需要添加的虚拟噪声的能量。
  10. 一种辅助聆听装置,其特征在于,包括:
    处理单元,用于根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,所述第一方向为通过对用户检测所确定的方向,所述第一声源为除所述第一方向外其它方向的声源;
    所述处理单元,还用于根据所述第一声源的坐标和预设的HRTF,确定所述第一声源的坐标对应的第一HRTF;
    所述处理单元,还用于根据所述第一HRTF和预定的虚拟噪声,得到所述第一声源对应的噪声信号;
    播放单元,用于播放所述噪声信号。
  11. 如权利要求10所述的装置,其特征在于,所述根据麦克风所采集的外部音频信号和第一方向,确定第一声源的坐标,包括:
    根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标;
    对所述用户进行检测,确定所述第一方向;
    根据所述第一方向,在所述用户附近的至少一个声源的坐标中,确定所述第一声源的坐标。
  12. 如权利要求11所述的装置,其特征在于,所述根据麦克风所采集的外部音频信号,确定所述用户附近的至少一个声源的坐标,包括:
    麦克风的数量为多个,每个麦克风分别采集外部音频信号,不同麦克风所采集的外部音频信号间有时延;
    根据不同麦克风所采集的外部音频信号的时延,确定所述用户附近的至少一个声源的坐标。
  13. 如权利要求11或12所述的装置,其特征在于,所述对所述用户进行检测,确定所述第一方向,包括:
    检测所述用户的凝视方向;或者,
    检测所述用户的双耳电流差异,根据双耳电流差异与凝视方向的对应关系,确定所述用户的凝视方向,所述用户的凝视方向为所述第一方向。
  14. 如权利要求11至13中任一项所述的装置,其特征在于,所述根据所述第一方向,在用户附近的至少一个声源的坐标中,确定所述第一声源的坐标,包括:
    对所述用户附近的至少一个声源的坐标进行分析,确定每个声源与所述用户的方向关系;
    根据所述第一方向以及声源与用户的方向关系,确定每个声源与所述第一方向之间的偏差;
    在所述用户附近的至少一个声源的中,选择与所述第一方向之间的偏差大于阈值的声 源,该声源的坐标为所述第一声源的坐标。
  15. 如权利要求10至14中任一项所述的装置,其特征在于,所述处理单元还用于:
    所述麦克风所采集的外部音频信号为混合音频信号,该混合音频信号中包括多个声源输出的音频信号;
    对所述麦克风所采集的外部音频信号进行分离,得到所述第一声源输出的第一音频信号。
  16. 如权利要求15所述的装置,其特征在于,所述处理单元还用于:
    对分离后的所述第一音频信号进行分析,确定所述第一音频信号的内容;
    根据所述第一音频信号的内容,确定需要添加的虚拟噪声的类型。
  17. 如权利要求16所述的装置,其特征在于,所述第一音频信号的内容为人谈话声,则需要添加的虚拟噪声类型为多人谈话babble噪声。
  18. 如权利要求15至17中任一项所述的装置,其特征在于,所述处理单元还用于:
    确定分离后的所述第一音频信号的能量;
    根据所述第一音频信号的能量,确定需要添加的虚拟噪声的能量。
  19. 一种电子设备,其特征在于,所述电子设备包括存储器和一个或多个处理器,其中,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述计算机指令被所述处理器执行时,使得所述电子设备执行如权利要求1至9中任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,包括程序或指令,当所述程序或指令在计算机上运行时,使得如权利要求1至9中任一项所述的方法被执行。
PCT/CN2021/078222 2021-02-26 2021-02-26 一种辅助聆听方法及装置 WO2022178852A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180004382.3A CN115250646A (zh) 2021-02-26 2021-02-26 一种辅助聆听方法及装置
PCT/CN2021/078222 WO2022178852A1 (zh) 2021-02-26 2021-02-26 一种辅助聆听方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/078222 WO2022178852A1 (zh) 2021-02-26 2021-02-26 一种辅助聆听方法及装置

Publications (1)

Publication Number Publication Date
WO2022178852A1 true WO2022178852A1 (zh) 2022-09-01

Family

ID=83047666

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078222 WO2022178852A1 (zh) 2021-02-26 2021-02-26 一种辅助聆听方法及装置

Country Status (2)

Country Link
CN (1) CN115250646A (zh)
WO (1) WO2022178852A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162254A1 (en) * 2014-12-05 2016-06-09 Stages Pcs, Llc Communication system for establishing and providing preferred audio
CN108601519A (zh) * 2016-02-02 2018-09-28 电子湾有限公司 个性化的实时音频处理
CN108810719A (zh) * 2018-08-29 2018-11-13 歌尔科技有限公司 一种降噪方法、颈带式耳机及存储介质
WO2020159557A1 (en) * 2019-01-29 2020-08-06 Facebook Technologies, Llc Generating a modified audio experience for an audio system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162254A1 (en) * 2014-12-05 2016-06-09 Stages Pcs, Llc Communication system for establishing and providing preferred audio
CN108601519A (zh) * 2016-02-02 2018-09-28 电子湾有限公司 个性化的实时音频处理
CN108810719A (zh) * 2018-08-29 2018-11-13 歌尔科技有限公司 一种降噪方法、颈带式耳机及存储介质
WO2020159557A1 (en) * 2019-01-29 2020-08-06 Facebook Technologies, Llc Generating a modified audio experience for an audio system

Also Published As

Publication number Publication date
CN115250646A (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
US10187740B2 (en) Producing headphone driver signals in a digital audio signal processing binaural rendering environment
US10659908B2 (en) System and method to capture image of pinna and characterize human auditory anatomy using image of pinna
US10585486B2 (en) Gesture interactive wearable spatial audio system
JP6665379B2 (ja) 聴覚支援システムおよび聴覚支援装置
TW201820315A (zh) 改良型音訊耳機裝置及其聲音播放方法、電腦程式
JP2017521902A (ja) 取得した音響信号のための回路デバイスシステム及び関連するコンピュータで実行可能なコード
US11184723B2 (en) Methods and apparatus for auditory attention tracking through source modification
US11221820B2 (en) System and method for processing audio between multiple audio spaces
US10979236B1 (en) Systems and methods for smoothly transitioning conversations between communication channels
WO2019109420A1 (zh) 左右声道确定方法及耳机设备
CN104168534A (zh) 一种全息音频装置及控制方法
US20200413190A1 (en) Processing device, processing method, and program
CN114339582B (zh) 双通道音频处理、方向感滤波器生成方法、装置以及介质
WO2022178852A1 (zh) 一种辅助聆听方法及装置
US11217268B2 (en) Real-time augmented hearing platform
JP6587047B2 (ja) 臨場感伝達システムおよび臨場感再現装置
US20220021998A1 (en) Method for generating sound and devices for performing same
Amin et al. Impact of microphone orientation and distance on BSS quality within interaction devices
WO2021086559A1 (en) Systems and methods for classifying beamformed signals for binaural audio playback

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927288

Country of ref document: EP

Kind code of ref document: A1