CN113784274A

CN113784274A - Three-dimensional audio system

Info

Publication number: CN113784274A
Application number: CN202110585625.9A
Authority: CN
Inventors: 李齐; 丁寅; J·奥兰; J·泰
Original assignee: American Lct Co
Current assignee: American Lct Co; Li Creative Tech Inc
Priority date: 2020-06-09
Filing date: 2021-05-27
Publication date: 2021-12-10

Abstract

The specification includes a specification that may define a filter and its distribution network on a grid of a three-dimensional space presented in a user interface. Multiple audio tracks may be separated from the input mono or stereo or determined from the output of the microphone array, where each audio track is associated with a respective sound source location in three-dimensional space. The relative position may be set by the listener at the user interface. One or more pairs of filters are selected in the distribution grid of filters based on the relative positions of the listener and the sound source in three-dimensional space. Each pair of filters is applied to a respective audio track to generate a plurality of filtered audio tracks, and then three-dimensional sound is generated from the plurality of filtered audio tracks for listening by headphones. If multiple speakers are used, each of the separate or determined tracks may directly (i.e., all-pass filtered) drive each corresponding speaker through an amplifier. The sound of a plurality of loudspeakers can directly form a three-dimensional sound field, and the three-dimensional sound can be heard without wearing earphones.

Description

Three-dimensional audio system

Cross Reference to Related Applications

This application claims the benefit of the following patent applications: U.S. provisional application No.63/008,723, filed on 11/4/2020; and U.S. provisional application No.63/036,797, filed on 9/6/2020, each of which is incorporated herein by reference in its entirety.

Technical Field

The present description relates to the generation of three-dimensional sound, and in particular to a system and method for picking up or processing a mixed soundtrack into separate sound types and then applying a transfer function to the separate sounds to generate a three-dimensional sound containing spatial information about the sound source, and a three-dimensional (3D) sound field recreated according to user settings.

Technical Field

Billions of people around the world listen to music, but most listeners can only listen to music in mono or stereo sound format, stereo being a method of sound reproduction. However, stereo sound typically refers to only two audio channels played using two speakers or headphones. More immersive sound technologies such as surround sound require that multiple tracks (e.g., 5.1 or 7.1 surround sound settings) be recorded and saved and that sound must be played through an equal number of speakers. Each of the audio channels or tracks contains mixed sound from a plurality of sound sources. Thus, stereo sound differs from "real" sound (e.g., sound in front of a concert stage) in that spatial information about individual sound sources (e.g., the orientation of instruments and human voices) is not reflected in the sound. The method of the present description may use multiple independent audio channels played by multiple speakers or headphones so that the sound from the speakers or headphones comes from various directions as in natural hearing.

A person may perceive spatial information with both ears and hear a "real" three-dimensional (3D) sound as a Binaural (Binaural) sound (e.g., a sound perceived by the left and right ears), such as through Binaural perception of music in a concert hall, theater, or sporting event at a stadium or arena. However, as mentioned above, today's music technology typically only provides mono or stereo sound without spatial cues or spatial information. Thus, the latter is often more pleasing than music played through headphones or earphones or over speakers or even multi-track, multi-speaker surround systems and experienced in theaters, arenas and concert halls. Currently, the generation of 3D sound can be achieved, for example, by a number of loudspeakers mounted on the wall of a cinema and each loudspeaker being driven by a separate audio track recorded during the production of the movie. However, such 3D audio systems can be very expensive and cannot be implemented in mobile devices (application software) or even in most home theaters or car stereos. Thus, in today's music and entertainment industry, most music or other audio data is stored and played as mono or stereo sound, where all sound sources (such as human voice and various instruments) are pre-mixed into only one (mono) or two (stereo) tracks.

Most audio/sound from a video conferencing device (such as a computer, laptop, smartphone, or tablet) is mono sound. Although on the display screen, a user (e.g., attendees or participants) can see all the participants of the conference in separate windows, the audio is typically only one mono with a narrow bandwidth. Using videos of various attendees, a virtual meeting room may be achieved, but the audio component cannot match the video component because the 3D sound needed for a more accurate (e.g., spatial) virtual reality sound experience cannot be provided. Furthermore, when two attendees have similar spoken speech, the user may not be able to discern the speech when they are speaking simultaneously or even separately. This may occur, for example, when the user is looking at the shared document on another screen or video window without the user looking at the face of the attendee. The problem may be exacerbated when more attendees are participating in a video conference (e.g., a remote learning classroom). The user needs spatial information, such as 3D sound, to help identify which attendees are speaking.

Drawings

The present description will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the description. The drawings, however, should not be taken to limit the description to the specific implementation examples, but are for explanation and understanding only.

FIGS. 1A-1B illustrate a system for generating three-dimensional sound implemented in accordance with the present description.

Fig. 2A to 2B illustrate a spatial relationship between a sound source and a listener in a three-dimensional space and a data structure and retrieval of HRTF filters for generating 3D sounds reflecting the spatial relationship, implemented according to the present description.

FIG. 3 illustrates a system implemented in accordance with the present description for training a machine learning model to separate mixed audio tracks.

Fig. 4 illustrates a system for separating and filtering mixed audio tracks using transformed sound signals, implemented in accordance with the present description.

Fig. 5A-5E illustrate in waveforms and spectrogram an original mixed sound implemented in accordance with the present description and the separation of the mixed sound into human voice, drum, bass, and other sounds, respectively.

Fig. 6 illustrates far-field speech control of a 3D binaural music system with music retrieved by speech and sound separation, implemented in accordance with the present description.

7A-7D illustrate GUIs for user settings for 3D sound, including listener positions in (7A-7C) and in front of (7D) the band, implemented in accordance with the present description.

Fig. 8 illustrates a system for generating 3D sound using a microphone array, implemented in accordance with the present description.

Fig. 9A-9B illustrate beam patterns of a 3D microphone and a 3D microphone array with spatial noise cancellation functionality implemented according to the present description.

FIG. 10 illustrates a conference or virtual concert system for generating three-dimensional sound implemented in accordance with the present description.

FIG. 11 illustrates a virtual conference room implemented in accordance with the present description as a GUI display for a conferencing system for generating three-dimensional sound.

FIG. 12 illustrates a method for generating three-dimensional sound implemented in accordance with the present description.

FIG. 13 illustrates a method for generating three-dimensional sound implemented in accordance with the present description.

FIG. 14 illustrates a block diagram of hardware of a computer system in accordance with one or more implementations of the present description.

Detailed description of the invention

Three-dimensional (3D) settable soundstage (soundstage) audio systems and applications and implementations are described herein. A three-dimensional (3D) sound field (sound field) refers to sound that includes discrete sound sources located at different spatial locations. A 3D sound stage is a sound representing a 3D sound field. For example, sound stage music may give a listener the auditory perception of the independent location of instruments and human sound sources when listening to a given piece of music through headphones, or speakers. In general, a 3D sound stage may give a listener the perception of spatial information. The sound stage is configurable, either by listener, DJ, software or audio system. For example, the position of each instrument in the 3D sound field may be moved, and the position of the listener in the 3D sound field may be moved statically or dynamically near the preferred instrument position.

For listening or playing a 3D sound field, a listener may use binaural sound represented by two tracks, one for the left ear and one for the right ear, allowing the listener to perceive spatial position information associated with the sound source. With headphones, headsets, or other such devices, binaural sound can have a 3D sound-like experience (e.g., feel from different locations). In addition, 3D sound can also be used directly to play a 3D sound stage. In direct 3D sound, sound is played from a set of speakers located at different 3D locations (e.g., corresponding to the desired locations of the various sound sources in the 3D sound field). Each speaker may play a separate, separate audio track, e.g., one speaker for the drum and another speaker for the bass. Listeners can hear the 3D sound field directly from the loudspeakers because they are in the real world 3D sound field. In both binaural and direct 3D sound cases, the listener's brain can perceive a 3D sound field and can recognize and track discrete individual sound sources, as in the real world, which may be referred to in this description as acoustic virtual reality.

Furthermore, another way to implement a 3D sound field may be to record binaural sounds directly with dedicated binaural/3D microphones. Most existing binaural microphones are only dummy heads with microphones in the ears, which may be too large in size and/or too expensive for many applications. Thus, a 3D microphone is described herein that can have a small size by using a very small microphone array and signal processing techniques. Such a small 3D microphone may be used with any handheld recording device, such as a smartphone or tablet. The sound output picked up by the 3D microphone may be a binaural, stereo or multitrack recording, where each track is for one spatial direction associated with a sound source of the 3D sound field.

In this description, the following three techniques are used to enhance the signal-to-noise ratio (SNR) of an audio signal. Where noise reduction is the process of reducing background noise in an audio channel based on temporal information, such as statistical properties between signals and noise or frequency distribution of heterogeneous signals. Microphone arrays use one or more sound beam patterns to enhance sound from one beam direction while canceling sound from outside the beam direction. Acoustic Echo Cancellation (AEC) uses one or more reference signals to cancel respective signals mixed in the signals acquired by the microphones. The reference signal is related to the signal that the AEC is to cancel.

System for controlling a power supply

FIGS. 1A-1B illustrate

systems

100A and 100B for generating three-dimensional sound implemented in accordance with the present description.

Systems

100A and 100B may be stand-alone computer systems or networked computing resources implemented in a computing cloud.

Referring to fig. 1A, a system 100A may include a sound separation unit 102A, a storage unit 104A for storing a plurality of filters, such as Head Related Transfer Function (HRTF) filters, all-pass filters, or equalization filters, a signal processing unit 106A, and a 3D sound field setting unit 108A having a Graphical User Interface (GUI)110A to receive user input. For simplicity of discussion, the filters are referred to hereinafter as HRTF filters, although it will be appreciated that the filters may be any type of suitable filter, including all-pass filters or equalizer filters. The sound separation unit 102A, the storage unit 104A, and the 3D sound field setting unit 108A may be connected to the signal processing unit 106A. The signal processing unit 106A may be a programmable device that can be programmed according to settings on the GUI 110A of a user interface device (not shown) to generate three-dimensional sound.

In the example of fig. 1A, the input to the sound separation unit 102A is a mono or stereo signal or an original mixed track of audio components, while the output of the signal processing unit 106A is 3D binaural audio for the left and right ears, respectively. Each track in the input of a mixed track or channel may first be separated by the sound separation unit 102A into a set of separated tracks (e.g., sound sources associated with one or a set of sound types), where each track represents a type (or category) of sound, such as human voice, drum, bass, or others (e.g., based on the properties of the respective sound source).

The signal processing unit 106A may then process each of the separated audio tracks using the paired HRTF filters from the storage unit 104A to output, for each of the separated audio tracks, two audio channels representing a left-ear channel and a right-ear channel, respectively. In one implementation, the above process may process the various input mixed tracks in parallel.

Each pair of HRTF filters (e.g., the pair of left and right HRTF filters 200B of fig. 2B described below) may be associated with a point on a grid in three-dimensional space (e.g., these HRTF filters may be stored in a database as an array of grid points), and each grid point may be represented by two parameters: azimuth angle θ and attitude angle γ (e.g., 202B and 204B in 2B, respectively). The distribution network (mesh) of HRTF filters (e.g., 200B) may be an array of pre-computed or pre-measured left and right HRTF filter pairs defined on a grid in three-dimensional space (e.g., 200A), where each point of the grid has an associated left and right HRTF filter pair. The HRTF filter pairs may be retrieved by applying an activation function, wherein inputs of the activation function may include a relative position and a distance/range between the sound source and the listener, and outputs of the activation function may be HRTF database indices determined for retrieving HRTF filter pairs defined at the grid points. For example, in one activation function implementation, the inputs to the activation function may be the azimuth angle θ and the attitude angle γ, while the output is a database index used to retrieve the left and right HRTF filter pairs. The retrieved HRTF filters can then be used to filter the separate audio tracks. For each separate audio track, an activation function needs to be called to retrieve the corresponding HRTF filter pair. The values of the azimuth angle θ and the attitude angle γ can be determined according to user setting specifications. For example, as shown in fig. 7A, the azimuth angle θ has the values 0 ° (vocal), 30 ° (drum), 180 ° (bass), and 330 ° (keyboard instrument) and the attitude angle γ is 0, four pairs of HRTF filters need to be retrieved by the activation function to filter four separate audio tracks, respectively.

As shown in fig. 2A and 2B, the listener (e.g., 202A) and/or the sound source (e.g., 204A) may move, with the angles θ and γ changing over time. It may then be necessary to dynamically retrieve a new series of HRTF filter pairs (e.g., 200B) in order to output the correct binaural sound to virtually represent the sound received by the listener (e.g., 202A) in the 3D sound space (e.g., 200A). Dynamic retrieval of the HRTF filters can be facilitated by storing these filters as a mesh, since the stored HRTF filter pairs may have been associated to any point on the mesh in a three-dimensional space where the listener and/or sound source can be located during movement. The range R (210A) may be represented by the volume of the filtered sound. Thus, the closer the distance between the listener and the sound source, the greater the volume.

The left audio tracks of all filter outputs may then be mixed to generate a left channel of binaural sound (e.g., binaural L), while all right channels may be mixed to generate a right channel of binaural sound (e.g., binaural R). When the L and R channels are played simultaneously through headphones or headsets, the listener can experience 3D binaural sound and perceive the spatial position of the sound source in the 3D sound field.

Further, the listener can set the position and/or volume of each sound source and/or the listener in the 3D sound field through the GUI 110A. Virtually (e.g., in acoustic virtual reality), the listener and sound sources may be positioned anywhere within the 3D sound field, and the volume of each sound source may be proportional to the distance in the 3D sound field from the location of the listener to the location of the sound source. For example, the sound source location and/or volume may be set via a GUI 110A that may be presented via a user interface device. The user interface device may for example take the form of a touch screen on a smartphone (fig. 7A-7D) or tablet computer. In one implementation, the virtual location of the human voice sound source may be in front of the listener, the drum sound source may be in front of the listener on the right, the bass sound source may be behind other sound sources (e.g., distant) with respect to the listener, and the "other" instrument (e.g., unidentified sound type or category) may be in front of the listener on the left, and the human voice and the "other" sound sources are set to be softer by positioning the listener (virtual head) in the vicinity of the drum and bass to set the drum and bass sound sources to be louder (fig. 7C). The listener can then hear the 3D sound field from the binaural outputs (e.g., binaural L and binaural R) according to the listener's own settings. If the virtual head and the musical instrument are placed at the same position (for example, fig. 7B), the listener hears the solo sound.

In one implementation, to generate binaural outputs (e.g., binaural L + R), as shown in fig. 1A, for each separate track associated with a respective sound source position, a pair of respective HRTF filters may be selected (e.g., from storage unit 104A) to process (e.g., by signal processing unit 106A) the separate track into two outputs: l and R audio. Finally, a mixer (not shown) may mix all L tracks and all R tracks, respectively, to output a binaural L, R signal. The selection of the corresponding HRTF filters will be discussed in further detail below (e.g., see the description of fig. 2 below). If the mixed track is stereo (two-track), each track needs to be processed as described above to generate mixed binaural sound. When the L and R channels are played simultaneously through headphones or a headset, the listener can experience 3D binaural sound and perceive a 3D sound field.

Referring to fig. 1B, the system 100B may include a sound separation unit 102B, a 3D signal processing unit 104B, an amplifier 106B, a speaker 108B, and a 3D sound field setting unit 110B having a Graphical User Interface (GUI)112B for receiving user input. The sound separation unit 102B and the 3D sound field setting unit 110B may be connected to the signal processing unit 104B. The signal processing unit 104B may be a programmable device that may be programmed to implement three-dimensional sound generation according to settings received via a GUI 112B presented on a user interface device (not shown).

In the example of fig. 1B, the input to the sound separation unit 102B is an original stereo or mono or stereo signal or a mixed track of audio, while output from the 3D signal processing unit 104B is a set of tracks to drive a plurality of speakers 108B through an amplifier 106B. Each of the inputs of the mixed tracks or channels may first be separated by the sound separation unit 102B into a set of separate tracks (e.g., for one respective sound source or type), where each track represents one type (or class) of sound, such as human voice, drum, bass, or others (e.g., based on the properties of the respective sound source). Each separate track may then be processed by the 3D signal processing unit 104B to output a single track to drive a speaker 108B through an amplifier 106B for each processed track, respectively. In one implementation, the above process may be performed in parallel for each input mixed track. All output tracks can then be played through speakers 108B (e.g., at different locations in the real world) to form a real world 3D sound field for the listener's real world location.

As noted above with respect to fig. 1A, the listener can set the position and/or volume of each sound source and/or listener in the 3D sound field via GUI 112B. Virtually (e.g., in acoustic virtual reality), the listener and sound sources may be positioned anywhere within the 3D sound field, and the volume of each sound source may be proportional to the distance in the 3D sound field from the location of the listener to the location of the sound source. For example, the sound source location and/or volume may be set through the GUI 112B, which GUI 112B may be presented via a user interface device. The user interface means may for example be in the form of a touch screen on a smartphone or tablet computer. The listener can then hear the 3D sound field from the output of speaker 108B according to the listener's own settings.

An implementation of GUI 110A or GUI 112B can be seen in FIGS. 7A-7D. These will be described in detail below.

Fig. 2A-2B illustrate a spatial relationship between a sound source 204A and a listener 202A in a 3D space 200A implemented in accordance with the present description, and the selection of HRTF filters 200B for generating 3D sounds reflecting the spatial relationship.

Head-related transfer function (HRTF) filters (e.g., similar to those stored in storage unit 104A of fig. 1A) can characterize how a human listener (with an external human ear on the head) receives sound from a sound source at a second specified location in the same 3D space at a first specified location in three-dimensional space. When sound waves encounter a listener, the size and shape of the head, ears, ear canal, hair density, nasal and oral cavity all alter the sound and affect the listener's perception by enhancing some frequencies and attenuating other frequencies. But the envelope of the response spectrum may be more complex than a simple enhancement or attenuation: it may affect a broad frequency spectrum and/or it may cause significant changes from different sound directions.

With two ears (e.g., binaural hearing), a listener can localize sound in three dimensions: range (distance); in the up-down direction; as well as the front and back, and either side. This is possible because the brain, inner ear and outer ear (pinna) work together to infer location. The listener can estimate the location of the sound source by acquiring cues from each ear (monaural cues) and by comparing cues received at both ears (difference cues or binaural cues). What is perceived in the human brain is the difference in time and intensity of sound arriving at each ear. Monophonic cues come from the interaction between the sound source and the listener's human anatomy, where the original sound source is altered by the inner and outer ear (pinna) before entering the ear canal for processing by the cochlea and brain. These changes encode the sound source position and can be picked up via the relationship between the sound source position and the listener position. Soundtrack filters based on this relationship are referred to herein as HRTF filters. Sound is convolved with a pair of HRTF filters to transform the sound of a track to generate binaural signals for the left and right ears, respectively, wherein the binaural sound signals (e.g., binaural L + R of fig. 1A) correspond to real-world 3D sound field signals that would be heard at a listener location if the source sound were played at a location associated with the pair of HRTF filters.

A pair of binaural tracks for the left and right ears of a listener may be used to generate binaural sound from mono or stereo sound, which appears to be from a particular location in space. HRTF filters are transfer functions that describe how sound from a particular location in 3D space reaches the location of the listener (usually at the outer end of the listener's ear canal). The HRTF filters can be implemented as convolution calculations in the time domain or multiplications in the frequency domain to save computation time, as shown in fig. 4 (described more fully below). Pairs of HRTF filters may be applied to multiple audio tracks from multiple sound sources to generate a 3D sound field represented as binaural sound signals. The corresponding HRTF filter can be selected based on the listener's settings (i.e., the relative position of the desired sound source to the listener).

Referring to fig. 2A, a 3D sound space 200A in which a sound source (e.g., 204A) and a listener 202A are located may be represented as a grid having a polar coordinate system. The relative position and distance from the listener 202A to the sound source 204A can be determined from the following three parameters: azimuth θ (202B of fig. 2B), attitude angle γ (204B of fig. 2B), and radius R (210A).

Referring to fig. 2B, the listener's corresponding HRTF filter 200B at each position in the 3D space 200A may be measured, generated, stored, and stored as a function of a polar coordinate system representing the 3D space 200A. Each HRTF filter 200B (e.g., a pair of left and right HRTF filters) may be associated with a point on a mesh (e.g., HRTF filters stored as a mesh) and each mesh point may be represented by two parameters: an azimuth angle θ 202B and an attitude angle γ 204B. Depending on the user's settings, the system (e.g., 100A of FIG. 1A) will know the spatial relationship between each sound source (e.g., 204A) and the listener 202A, i.e., the system will know the spatial relationship between each sound source (e.g., 204A) and the listener 202Aα 206A, β 208A and R210A will be known. Thus, based on θ ═ α and γ ═ β, the system can retrieve, for the separate audio tracks associated with the sound source 204A, the corresponding HRTF filter pairs 200B (e.g., HRTFs) for the listener's left and right ears (e.g., HRTFs)_Rightand HRTF_Left). The tracks of sound source 204A may then be processed (e.g., by signal processing unit 106A of fig. 1A) using retrieved HRTF filters 200B. The output volume of the generated 3D sound may be a function of the radius R210A. The shorter the length of R210A, the greater the output 3D volume.

In one embodiment, for multiple sound sources (e.g., sound source 204A), the system may repeat the above-described filter retrieval and filtering operations for each sound source, and then combine (e.g., mix) the filtered soundtracks together for final binaural output or stereo (rather than mono) output to both speakers.

As noted above with respect to fig. 1A, the listener 202A and/or the sound source 204A may move, with the angles θ and γ changing over time. Next, it may be necessary to dynamically retrieve a new series of HRTF filter pairs 200B in order to output the correct binaural sound to virtually represent the sound received by the listener 202A in the 3D sound space 200A. Dynamic retrieval of HRTF filter 200B may be facilitated by storing these filters as a mesh, since the stored HRTF filter pairs may have been associated to any point on the mesh in three-dimensional space where the listener and/or sound source may be located during the movement.

Fig. 3 illustrates a system 300 for training a machine learning model 308 to separate mixed audio tracks according to one embodiment of the present description.

Although multiple microphones may be used to record music on multiple tracks, with each individual track representing each instrument or person's voice recorded in the studio, the music streams most often available to the consumer are mixed into stereo sound. The cost of recording, storing, bandwidth, transmission and playback of multi-track audio can be very high, and therefore most existing music recording and transmission devices (radios or smart phones) are set to use only mono or stereo sound. To generate a 3D sound stage from conventional mixed track formats (mono and stereo), the system (e.g., system 100A of fig. 1A or 100B of fig. 1B) requires separation of each mixed track into multiple tracks, where each track represents or separates a type (or category) of sound or instrument. The separation may be performed according to a mathematical model and corresponding software or hardware implementation, wherein the input is a mixed audio track and the output is a separated audio track. In one embodiment, for stereo input, the left and right tracks may be processed jointly or separately (e.g., by sound separation unit 102A of fig. 1A or sound separation unit 102B of fig. 1B).

Machine learning in this description refers to a software-implemented method on a hardware processing device that uses statistical techniques and/or artificial neural networks to give a computer the ability to "learn" from data (i.e., gradually improve the performance of a particular task) without having to be explicitly programmed. Machine learning may use parameterized models (referred to as "machine learning models"), which may be implemented using supervised/semi-supervised learning, unsupervised learning, or reinforcement learning methods. The supervised/semi-supervised learning approach may train a machine learning model using labeled training samples. To perform a task using a supervised machine learning model, a computer may train the machine learning model using samples (often referred to as "training data") and adjust parameters of the machine learning model based on performance measurements (e.g., error rates). The process of adjusting the parameters of a machine learning model (often referred to as "training the machine learning model") may generate a particular model to perform the actual task for which it is trained. After training, the computer may receive new data inputs associated with the task and compute an estimated output of the machine learning model for predicting task results based on the trained machine learning model. Each training sample may comprise input data and corresponding desired output data, wherein the data may be represented in a suitable form (such as an alphanumeric symbol or a vector value of a numeric value) as an audio track.

The learning process may be an iterative process. The process may include a forward propagation process to compute an output based on the machine learning model and input data fed into the machine learning model, and then compute a difference between the desired output data and the computed output data. The process may further include a back propagation process to adjust parameters of the machine learning model based on the calculated differences.

The parameters of the machine learning model 308 for separating mixed audio tracks may be trained by machine learning, statistical, or signal processing techniques. As shown in FIG. 3, the machine learning model 308 may have two phases: a training period and a separation period. In the training period of the machine learning model 308, an audio or music recording of the mixed sound may be used as an input to the feature extraction unit 302, and the corresponding separated tracks may be used as targets of the separation model training unit 304, i.e., as samples of the desired separated output. The separation model training unit 304 may include a data processing unit including a data normalization/data perturbation unit 306, a feature extraction unit 302. Data normalization normalizes the input training data so that they have similar dynamic ranges. Data perturbation generates reasonable data variations to cover more signal cases than are available in the training data, so that more data is available for more training. Data normalization and perturbation may be optional, depending on the amount of data available.

Feature extraction unit 302 may extract features from raw input data (e.g., mixed sounds) to aid in training and separation calculations. The training data may be processed in the time domain (raw data), frequency domain, feature domain, or time-frequency domain by a Fast Fourier Transform (FFT), a Short Time Fourier Transform (STFT), a spectrogram, an auditory transform, a wavelet, or other transform. Fig. 4 (described more fully below) shows how the soundtrack separation and HRTF filtering is done in the transform domain.

The model structure and training algorithm for the machine learning model 308 may be a Neural Network (NN), Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), or any model and/or algorithm that may be used to separate sound sources in a mixed soundtrack. After training, in a separation period, the input music data may be separated into a plurality of tracks, each separated track corresponding to a separate sound, by the trained separation model calculation unit 310. In one embodiment, multiple separate audio tracks may be mixed in different ways by user settings (e.g., via GUI 110A of fig. 1A) for different sound effects.

In one embodiment, the machine learning model 300 may be a DNN or CNN, which may include multiple layers, particularly an input layer for receiving data inputs (e.g., a training period), an output layer for generating outputs (e.g., a separation period), and one or more hidden layers, each including linear or non-linear computational elements (referred to as neurons) to perform DNN or CNN computations that propagate from the input layer to the output layer, which may transform the data inputs into outputs. Two adjacent layers may be connected by a wire. Each connection may be associated with a parameter value (referred to as a synaptic weight value) that provides a scaling factor for an output value of a neuron in a layer preceding an input of one or more neurons in a subsequent layer.

Fig. 5 illustrates (described more fully below) waveforms and corresponding spectrograms associated with mixed tracks of music (e.g., mixed sound inputs) separated by a trained machine learning model 308, as well as separate tracks of human, drum, bass, and other sounds. The separation calculation may be performed in accordance with the system 400 shown in fig. 4.

Fig. 4 shows a system 400 for separating and filtering mixed tracks using transform domain sound signals according to one implementation of the present description.

Training data (e.g., a time-domain mixed sound signal) may be processed by separation unit 404 (e.g., sound separation unit 102A of fig. 1A) in the time domain (e.g., raw data) or forward transform 402 may be used such that training data may be processed in the frequency, feature, or time-frequency domain by a Fast Fourier Transform (FFT), a Short Time Fourier Transform (STFT), a spectrogram, an auditory transform, a wavelet, or other transform. HRTF filters 406 (e.g., those stored in storage unit 104A of fig. 1A) may be implemented as convolution calculations in the time domain, or inverse transforms 408 may be used so that HRTF filters 406 may be implemented as multiplications in the frequency domain to save computation time. Thus, both soundtrack separation and HRTF filtering can be done in the transform domain.

Fig. 5A-5E illustrate in waveform and spectrogram the original mixed sound implemented in accordance with the present description and the separation of the mixed sound into human voice, drum, bass (bass), and other sounds, respectively.

Shown in fig. 5A are waveforms and corresponding spectrograms (e.g., mixed sound inputs of the system 100A of fig. 1A) associated with mixed tracks of music.

Shown in fig. 5B are waveforms and corresponding spectrograms associated with separate tracks of human voice from a mixed track of music.

Shown in fig. 5C are waveforms and corresponding spectrograms associated with separate tracks from the drumbeats of the mixed track of music.

Shown in fig. 5D are waveforms and corresponding spectrograms associated with separate tracks of bass sounds from a mixed track of music.

Shown in fig. 5E are waveforms and corresponding spectrograms associated with separate tracks of other sounds (e.g., unrecognized sound types) from a mixed track of music.

In one embodiment of the present description, the mixed audio tracks are separated using a trained machine learning model 308. The separation calculation may be performed in accordance with the system 400 described above with respect to fig. 4.

Fig. 6 illustrates far-field speech control of a 3D binaural music system 600 with sound separation implemented in accordance with the present description.

First, the microphone array 602 may pick up a voice command. Preamplifier/analog-to-digital converter (ADC)604 may amplify the analog signal and/or convert it to a digital signal. Both the preamplifier and ADC are optional depending on what kind of microphone is used in the microphone array 602. For example, digital microphones may not require them.

The acoustic beamformer 606 forms the acoustic beam(s) to enhance the speech or speech commands and suppress any background noise. The Acoustic Echo Canceller (AEC)608 also uses the reference signal to cancel speaker sound picked up by the microphone array 602 (e.g., from the speaker 630). The reference signal may be picked up by one or more reference microphones 610 near the speaker 630 or obtained from an audio signal (e.g., from the setup/equalizer unit 624) before sending the audio signal to the amplifier 608 for the speaker 630. The output of the AEC may then be sent to a noise reduction unit 612 to further reduce background noise.

The processed clean speech is then sent to a wake phrase recognizer 614, which may recognize a wake phrase predefined by the system 600. The system 600 may mute the speaker 630 to further improve the speech quality. Next, an Automatic Speech Recognizer (ASR) 616 may recognize a speech command (e.g., song music title) and then instruct music retrieval unit 618 to retrieve music from music library 620. In one embodiment, the wake phrase recognizer 614 and the ASR616 may be combined into one unit. Furthermore, the retrieved music may then be separated by a sound separation unit 622, which sound separation unit 622 may be similar to sound separation unit 102A in fig. 1A.

Then, the setting/equalizing unit 624 may adjust the volume of each sound source and/or perform equalization (gain per frequency band or per instrument or human voice) for each track. Finally, as shown in the system 100B of fig. 1B, the separate music track may be played as direct 3D sound (via amplifier 628) from speaker 630, or as shown in the system 100A of fig. 1A, HRTF filters 626 may be used to process the separate track to generate binaural sound.

7A-7D illustrate a GUI 700 for user setting of 3D sounds with selected listener positions within a band (FIGS. 7A-7C) and before the band (FIG. 7D), respectively, implemented in accordance with the present description.

In one implementation, the GUI 700 may be arranged such that all sound sources (e.g., from a music band on a stage) are represented by band member icons on a virtual stage, and the listener is represented by a listener head icon (wearing headphones to highlight the positions of the left and right ears) that may be freely moved around the stage by the user of the GUI 700. In another embodiment, all of the icons in FIGS. 7A-7D may be freely moved on the stage by user touch of the GUI 700.

In fig. 7A, when the listener head icon is placed in the center of the virtual stage, the listener can hear binaural sound and perceive the sound field: the vocal sounds are perceived as coming from the front, the drumbeats from the right, the bass sounds from the back, and other instruments (e.g., keyboard instruments) from the left.

In fig. 7B, when the listener's head icon is placed over the band drummer icon, the listener will be able to hear the separate drum solo track.

In fig. 7C, when the listener's head icon is placed close to the drummer and bass hand icons, the sounds of the drums and bass may be enhanced (e.g., increased volume) while the sounds from other instruments (e.g., human voices and others) may be relatively decreased (e.g., decreased volume), and thus, the listener may experience enhanced bass and beat effects through settings via GUI 700.

In fig. 7D, another virtual 3D sound field setup is shown. In this arrangement, the listener can virtually feel and hear that the band is in front of him or her, even though this is not the case in real-world music stage recordings. The positions of all band member icons and listener head icons can be moved anywhere on the display of GUI 700 to set up and alter the virtual sound field and listening experience.

GUI 700 may also be adapted for use with a remote control to control a television with a direct 3D sound system, or other similar applications. For example, when a user is watching a movie (movie), she or he may move the listener's head icon closer to the vocal icon so that the volume of the voice is enhanced and the volume of other background sounds (e.g., music) may be lowered so that the user can hear clearer voice.

Fig. 8 illustrates a system 800 for generating 3D sound with a microphone array 802 according to one implementation of the present description.

System 800 may be described as a 3D microphone system that may directly pick up and output 3D and binaural sound. As mentioned herein, a 3D microphone system may comprise a microphone array system that may pick up sound from different directions and spatial information about the location of the sound source. The system 800 can produce 2 outputs: (1) a plurality of tracks, each corresponding to sound from one direction, wherein each of the plurality of tracks can drive one or a set of speakers to generate a three-dimensional sound field; (2) the binaural L and R tracks for the earplugs or headphones represent the 3D sound field in a virtual form.

Each microphone of the microphone array 802 may have their signals processed by the preamplifier/ADC unit 804. Preamplifiers and analog-to-digital converters (ADCs) may amplify and/or convert analog signals to digital signals. Both the preamplifier and ADC are optional, depending on the microphone assembly selected for the microphone array 802. For example, they may not be necessary for digital microphones.

The acoustic beamformer 806 may simultaneously form acoustic beampatterns that point in different directions or different sound sources, as shown in fig. 9B. Each beam enhances sound from the "look" direction while suppressing sound from other directions, thereby improving the signal-to-noise ratio (SNR) and isolating sound from the "look" direction from sound from other directions. The noise reduction unit 808 may further reduce the background noise at the beamformer output if desired. The output of the beamformer may include a plurality of tracks corresponding to sounds from different directions.

To generate direct 3D sound, multiple tracks may drive multiple amplifiers and speakers to construct a 3D sound field for the listener.

To generate binaural outputs, multiple tracks may pass through multiple pairs of selected HRTF filters 810 to convert spatial tracks to binaural sound. The HRTF filters may be selected based on user settings (e.g., via the output audio setting unit 814) or based on the actual spatial location of the sound source in the real world. Further, mixer 812 may then combine the HRTF outputs into a pair of binaural outputs for the left and right ears, respectively. The final binaural output represents the 3D sound field recorded by the microphone array 802.

This is a special case of 3D microphones, based on the fact that the microphone array 802 has only two acoustic beam patterns pointing to the left and right, respectively, the microphone array works as a stereo microphone.

Fig. 9A-9B illustrate beam patterns of a 3D microphone 902 and a 3D microphone array 904, respectively, with spatial noise cancellation implemented in accordance with the present description.

Fig. 9A shows a beam pattern of a 3D microphone 902, which can pick up sound from different directions and spatial information about a sound source.

Fig. 9B shows a microphone array 904 (e.g., comprising a plurality of microphones 902) having beam patterns a and B formed by beamformers a and B, respectively, the microphone array 904 being arranged to pick up sound from two different sound sources a and B. Sounds picked up from a sound source a in the "look" direction of one acoustic beam (e.g., beam pattern a) are often mixed with sounds picked up from other directions (e.g., the direction of sound source B). To cancel sound from other directions, the 3D microphone array 904 may form another beam pattern(s), such as beam pattern B, using the same microphone array 904. The sound picked up by the beam pattern B can be used to cancel sound that is not wanted to be mixed out of the sound picked up by the beam pattern a. Sound from the direction of the sound source B that has been mixed with sound from the "look" direction of the beam pattern a can then be eliminated from the output of the beam pattern a. The cancellation algorithm may be provided by an Acoustic Echo Canceller (AEC) unit 906.

FIG. 10 illustrates a conferencing system 1000 for generating three-dimensional sound implemented in accordance with the present description.

The conferencing system 1000 may include a signal processing and computation unit 1002, a repository 1004 of head form related transfer function (HRTF) filters, a display unit 1006 with a Graphical User Interface (GUI), an amplifier 1008, a headset or headphones 1010, and speakers 1012. For example, the system 1000 may be implemented as software on a user's laptop, tablet, computer, or smartphone with a headset connected thereto. Video and audio conferencing (hereinafter "conferencing") may also be referred to as teleconferencing, virtual conferencing, web seminars, or video conferencing. One such meeting may include multiple local and/or multiple remote attendees. In one implementation, the attendees may be connected through the Internet and a telephone network 1014. In another implementation, the conference may be controlled by a cloud server or a remote server via the internet and the telephone network 1014.

A user of system 1000 may be one of the attendees of a meeting or virtual concert. She or he is the owner of a laptop, tablet, computer or smartphone running conferencing software with video and audio, and may be wearing a headset 1010. The term "speaker" or "attendee" refers to a person participating in a meeting. The speaker 1012 may be any device that can convert audio signals into audible sound. Within the amplifier 1008 may be electronics or circuitry for increasing the signal power to drive a speaker 1012 or a headset 1010. The headset 1010 may be a headphone, earmuff, or in-ear audio device.

The input signal (e.g., from the cloud via 1014) may include video, audio, and Identification (ID) of the speaker. The speaker's ID may associate video and audio inputs to the attending person who is speaking. When there is no speaker ID, a new speaker ID may be generated by speaker ID unit 1016, as described below.

Speaker ID unit 1016 may obtain a speaker ID from conferencing software based on the speaker ID for the speaker's video conferencing session. Further, the speaker ID unit 1016 may obtain the speaker ID from a microphone array (e.g., the microphone array 802 of fig. 8 or 904 of fig. 9). For example, the microphone array beam patterns (e.g., beam patterns a and B) in fig. 9B may detect the direction of a speaker relative to the microphone array. Based on the detected direction, system 1000 can detect a speaker ID. Still further, speaker ID unit 1016 may obtain the speaker ID based on a speaker ID algorithm. For example, based on a track consisting of the voices of multiple speakers, a speaker ID system may have two periods: training and inference. During training, the speech of each speaker is used to train speaker-dependent models, one for each speaker, using the available labels. If not tagged, the speaker ID system may first perform unsupervised training, then annotate the speech from the soundtrack with a speaker ID, followed by supervised training to generate a model for each speaker. During inference, given the conference audio, the speaker recognition unit 1016 may process the input sound using a trained model and recognize the corresponding speaker. The model may be a Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), DNN, CNN, LSTM, or RNN.

For an attendee who is speaking, the video window associated with that attendee may be visually highlighted in the display/GUI 1006 so the user knows which attendee of the meeting is speaking, such as attendee 2 in FIG. 11 as described below. From the speaker's position, e.g., from the user's 50 degree angle, the system 1000 may retrieve a pair of corresponding HRTF filters from a pre-stored database or memory 1004. The signal processing unit 1002 may perform convolution calculation on the input mono signal using HRTF filters from a pre-stored database or a memory 1004. The output of the signal processing and calculation unit 1002 may have two channels for binaural sound for the left and right ear, respectively. The user or attendee may wear the headset unit 1010 in order to hear binaural sound and experience 3D sound effects. For example, the user is not looking at the display 1006, but is wearing a headset 1010, yet can perceive from 3D sound which attendee is speaking, so that the user can feel as if she or he is in a real meeting room.

Based on multiple displays/GUIs 1006 and multiple speakers 1012 used in a real conference room, each speaker 1012 may be dedicated to the sound of one speaker in one display/GUI 1006 at one location. In this case, the user does not need to use the headset 1010, and she or he can experience 3D sound from the speakers 1012. The plurality of speakers may be placed in a home theater, movie theater, sound bar, television, smart speaker, smart phone, mobile device, handheld device, laptop, PC, automobile, or any location having more than one speaker or sound generator.

Fig. 11 illustrates a virtual conference room 1100 displayed by a GUI1006 of a conferencing system 1000 for generating three-dimensional sound according to an implementation of the invention.

Virtual meeting room 1100 can have multiple windows that include videos of users and meeting attendees (1102- 1112). The location of the window (1102- 1112) may be arranged by the conferencing software (e.g., running on a laptop) or by the user (e.g., via the display/GUI 1006 of fig. 10). For example, the user may move the window around (1102 and 1112) to arrange the virtual meeting room 1100. In one embodiment, the center of conference room 1100 may include a virtual conference table.

As noted above, the virtual meeting room 1100 can also be set by the user such that the attendee's video window (1104 + 1112) can be virtually placed anywhere in the virtual meeting room 1100 using a mouse, keypad, touch screen, or the like. From the location of the speaker (e.g., attendee 2) relative to the user (e.g., from the perspective of the video window 1106 of attendee 2 to the user's video window 1102), when they speak, the relevant HRTF may be selected and automatically applied to attendee 2.

Method

For purposes of simplicity of explanation, the methodologies of the present description are depicted and described as a series of acts. However, acts in accordance with this description may occur in various orders and/or concurrently, and with other acts not presented and described herein. Moreover, not all illustrated acts may be required to implement a methodology in accordance with the described subject matter. In addition, those skilled in the art will understand and appreciate that the methodologies could alternatively be represented as a series of interrelated states via a state diagram or events. In addition, it should be appreciated that the methodologies described in this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methodologies to computing devices. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device or storage media.

The described methods may be performed by a processing device (e.g., circuitry, dedicated logic), which may comprise hardware, computer readable instructions (e.g., as run on a general purpose computer system or a dedicated machine), or a combination of both. The method and its individual functions, routines, subroutines, or operations may be performed by one or more processors of a computer device executing the method. In some embodiments, the method may be performed by a single processing thread. Alternatively, the method may be performed by two or more processing threads, each thread performing one or more separate functions, routines, subroutines, or operations.

FIG. 12 illustrates a method 1200 for generating three-dimensional sound according to one implementation of the invention.

In one implementation, the method 1200 may be performed by a signal processing unit of the system 100A of fig. 1A or the subsystem 100B of fig. 1B.

At 1202, the method includes receiving a specification (specification) of a three-dimensional space (e.g., 200A of fig. 2A) and a head form related transfer function (HRTF) filter (e.g., 200B of fig. 2B) defined on a grid in the three-dimensional space, wherein the three-dimensional space is presented in a user interface (e.g., GUI 110A of fig. 1A) of a user interface device.

At 1204, the method includes determining (e.g., by the sound separation unit 102A of fig. 1A) a plurality of audio tracks (e.g., separated audio tracks), wherein each of the plurality of audio tracks corresponds to a respective sound source (e.g., a human voice).

At 1206, the method includes representing a listener (e.g., the listener 202A of fig. 2A) and sound sources of a plurality of tracks (e.g., the sound source 204A of fig. 2A) in three-dimensional space.

At 1208, the method includes generating a plurality of HRTF filters (e.g., 200B of fig. 2B) based on a network of HRTF filters (e.g., stored in the storage unit 104A of fig. 1A) and locations of the sound source and the listener in the three-dimensional space, corresponding to user settings (e.g., via the GUI 110A of fig. 1A) for at least one of a listener position or a sound source position in the three-dimensional space.

At 1210, the method includes applying each filter of a plurality of HRTF filters (e.g., 200B of fig. 2B) to a respective one of a plurality of separate audio tracks to generate a plurality of filtered audio tracks; and

at 1212, the method includes generating a three-dimensional sound based on the filtered soundtrack.

FIG. 13 illustrates a method 1300 for generating three-dimensional sound according to one embodiment of the present description.

At 1302, the method includes picking up sound from a plurality of sound sources with a microphone array (e.g., microphone array 802 of fig. 8) that includes a plurality of microphones (e.g., microphone 902 of fig. 9A).

At 1304, the method includes rendering three-dimensional sound with one or more speakers (e.g., speaker 108B of fig. 1B).

At 1306, the method includes removing echoes in the plurality of tracks with an acoustic echo cancellation unit (e.g., AEC 608 of fig. 6).

At 1308, the method includes reducing noise components in the plurality of audio tracks with a noise reduction unit (e.g., noise reduction unit 612 of fig. 6).

At 1310, the method includes processing the plurality of audio tracks with a sound equalizer unit (e.g., the setup/equalizer unit 624 of fig. 6).

At 1312, the method includes picking up a reference signal with a reference sound pickup circuit (e.g., reference microphone 610 of fig. 6) located near one or more speakers (e.g., speaker 630 of fig. 6), where an acoustic echo cancellation unit (e.g., AEC 608 of fig. 6) will remove echo with the picked-up reference signal.

At 1314, the method includes recognizing the voice command with a voice recognition unit (e.g., voice recognizer 616 of fig. 6).

Hardware

FIG. 14 depicts a block diagram of a computer system 1400 that operates according to one or more aspects of the present description. In various examples, the computer system 1400 may correspond to any signal processing unit/apparatus described in relation to systems presented herein (such as the system 100A of fig. 1A or the system 100B of fig. 1B).

In some implementations, the computer system 1400 can be connected to other computer systems (e.g., via a network, such as a Local Area Network (LAN), an intranet, an extranet, or the internet). Computer system 1400 can execute in the capacity of a server or a client computer in a client server environment, or as a peer computer in a peer-to-peer or distributed network environment. Computer system 1400 may be provided by: a Personal Computer (PC), a tablet, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a Web appliance, a server, a network router, switch or bridge, a computing device in a vehicle, home, room or office, or any device capable of executing a set of instructions (sequential or other instructions) that specify actions to be taken by that device. Furthermore, the term "computer" shall include any collection of computers, processors, chips, or socs that individually or collectively execute a set (or multiple sets) of instructions to perform any one or more of the methodologies described herein.

In another implementation, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client machine in server-client or cloud network environment, or as a peer computer in a peer-to-peer (or distributed) network environment. The machine may be an in-vehicle system, a wearable device, a Personal Computer (PC), a tablet PC, a hybrid tablet PC, a Personal Digital Assistant (PDA), a mobile telephone, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. Similarly, the term "processor-based system" shall be taken to include any collection of one or more machines controlled or operated by a processor (e.g., a chip, computer, or cloud server) to individually or collectively execute instructions to perform one or more of the methodologies discussed herein.

The example computer system 1400 includes at least one processor 1402 (e.g., a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or both, a processor core, a computing node, a cloud server, etc.), a main memory 1404, and a static memory 1406, which communicate with each other via a link 1408 (e.g., a bus). The computer system 1400 may further include a video display unit 1410, an alphanumeric input device 1412 (e.g., a keyboard), and a User Interface (UI) navigation device 1414 (e.g., a mouse). In one implementation, the video display unit 1410, the input device 1412, and the UI navigation device 1414 are incorporated into a touch screen display. The computer system 1400 may additionally include a storage device 1416 (e.g., a drive unit), a sound generation device 1418 (e.g., a speaker), a network interface device 1420, and one or more sensors 1422 such as, for example, a Global Positioning System (GPS) sensor, an accelerometer, a gyrometer, a position sensor, a motion sensor, a magnetometer, or other sensors.

The storage 1416 includes a machine-readable medium 1424 on which is stored one or more sets of data structures and instructions 1426 (e.g., software), the data structures and instructions 1426 being embodied or utilized by one or more of the methods or functions described herein. The instructions 1426 may also reside, completely or at least partially, within the main memory 1404, the static memory 1406, and/or the processor 1402 when executed by the computer system 1400, including machine-readable media, having the main memory 1404, the static memory 1406, and the processor 1402.

While the machine-readable medium 1424 is shown in an example embodiment to be a single medium, the term "machine-readable medium" may include a single medium or multiple media (e.g., a centralized, cloud, or distributed database, and/or associated caches and servers) that store the one or more instructions 1426. The term "machine-readable medium" shall also be taken to include any tangible medium that is capable of storing, encoding or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present description, or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. The term "machine-readable medium" shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Particular examples of a machine-readable medium include volatile or non-volatile memory, including, but not limited to, semiconductor memory devices (e.g., electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM)) and flash memory devices, to name a few; magnetic disks such as internal hard disks and removable disks; and CD-ROM and DVD-ROM disks.

The instructions 1426 may further be transmitted or received over a communication network 1428 using a transmission medium via the network interface device 1420 using any of a variety of well-known transmission protocols (e.g., HTTP). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), the Internet, mobile telephone networks, Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Bluetooth, Wi-Fi, 3G, and 4GLTE/LTE-A or WiMAX networks). The term "transmission medium" shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog signals or other intangible medium to facilitate communication of such software.

The example computer system 1400 may also include an input/output controller 1430 to receive input and output requests from at least one central processor 1402 and then send device-specific control signals to the devices that they control. Input/output controller 1430 may free at least one central processor 1402 from having to deal with the details of controlling each individual kind of device.

Language(s)

Unless specifically stated otherwise, terms such as "receiving," "associating," "determining," "updating," or the like, refer to the action and processes performed or effected by a computer system that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Also, as used herein, the terms "first," "second," "third," "fourth," and the like are intended to be used as labels to distinguish between different elements, and may not have an ordinal numerical meaning according to their numerical designation.

Examples described herein also relate to an apparatus for performing the methods described herein. The apparatus may be specially constructed for carrying out the methods described herein, or it may comprise a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a tangible storage medium readable by a computer.

The methods and illustrative examples described herein are not specific to any particular computer or other apparatus. Various general purpose systems may be used with the teachings described herein, or it may prove convenient to construct a more specialized apparatus to perform the method 500 and/or each of its various functions, routines, subroutines, or operations. Examples of structures for various of these systems are set forth in the description above.

The above description is illustrative and not restrictive. While the present description has been described with reference to particular illustrative examples and embodiments, it will be appreciated that the description is not limited to the examples and embodiments described. The scope of the description should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A three-dimensional sound generation system comprising:

a user interface device; and

processing means connected to the user interface means for:

receiving a specification of a three-dimensional space and a filter defined on a grid in the three-dimensional space and a distribution network thereof, wherein the three-dimensional space is presentable in a user interface of a user interface device;

determining one or more audio tracks, wherein each of the one or more audio tracks is associated with a respective sound source;

representing a listener and respective sound sources of the one or more audio tracks in the three-dimensional space;

selecting a plurality of filters from the filter distribution network corresponding to user settings of at least one listener position and sound source position in the three-dimensional space;

applying each of the plurality of filters to a respective audio track of the one or more audio tracks to generate one or more filtered audio tracks; and

three-dimensional sound generated from the one or more filtered tracks.

2. The sound generation system according to claim 1, wherein each of the one or more audio tracks was previously separated from a mixed audio track, or,

wherein the processing device is configured to:

receiving a mixed sound stream;

separating the sound stream into the one or more audio tracks using a machine learning model; and

determining a sound type for each of the one or more audio tracks using the machine learning model.

3. The sound generation system of claim 2, wherein the sound type is a voice sound, singing sound, musical instrument sound, car sound, helicopter sound, airplane sound, vehicle sound, gunshot sound, or other ambient noise.

4. The sound generation system of claim 1, further comprising:

a microphone array comprising a plurality of microphones, wherein the processing device is configured to:

implementing a plurality of beamformers, wherein each of the plurality of beamformers is arranged to pick up sound from a respective sound source from a respective direction with the microphone array; and

one or more audio tracks are generated, each corresponding to one of the beamformer outputs.

5. The sound generation system of claim 1, wherein, to represent a respective sound source of one or more audio tracks in three-dimensional space, for each audio track, the processing device is to:

an icon representing a sound type of the annotation set according to the user position in the three-dimensional space is presented on the user interface, wherein the icon may be a graphic, an image of the corresponding sound source, a video or an animated representation of the corresponding sound source.

6. The sound generation system of claim 1 wherein the filters in the distribution network of filters are head-related transfer function (HRTF) filters, and wherein the distribution network of HRTF filters contains an array of pre-computed left and right HRTF filter pairs defined on a grid in three-dimensional space, and each point on the grid is associated with one of all left and right HRTF filter pairs, the processing device being configured to, for each respective sound source, select the plurality of HRTF filters:

determining a relative position and distance between a sound source and a listener;

determining an index or points of the mesh to retrieve corresponding HRTF filters by applying an activation function to a relative position and distance between the sound source and a listener;

selecting a left and right HRTF filter pair associated with the determined points of the mesh as one of the plurality of HRTF filters; and

the above process is repeated corresponding to the movement of the listener or sound source.

7. The sound generation system of claim 6 wherein to apply each of the plurality of HRTF filters to a respective one of the one or more audio tracks to generate one or more filtered audio tracks, the processing device applies a left HRTF filter to generate a left filtered audio track for a left channel receptor and a right HRTF filter to generate a right filtered audio track for a right channel receptor, the left and right receptors to transmit output signals to headphones, or speakers.

8. The sound generation system of claim 1 wherein to generate three-dimensional sound from the one or more filtered tracks, the processing device is to generate binaural sound for headphones, or speakers.

9. The sound generation system of claim 1 wherein to generate three-dimensional sound from the one or more filtered tracks, the processing device is to provide the one or more filtered tracks to a plurality of speakers.

10. The sound generation system of claim 1, wherein the sound generation system is a conference or virtual concert system, wherein each of the respective sound sources contains a conference or concert participant, and wherein the processor is to:

the respective sound source positions are presented in a three-dimensional space representing a virtual meeting room or a virtual concert.

11. The sound generation system of claim 1, further comprising at least one of:

a microphone array including a plurality of microphones for picking up sounds from a plurality of sound sources in different directions;

one or more speakers for rendering three-dimensional sound;

an acoustic echo cancellation unit for removing echo in one or more tracks;

a noise reduction unit for reducing noise components in one or more tracks;

a set of sound equalizer units for processing one or more audio track outputs;

a reference sound pickup circuit located in proximity to the one or more speakers for picking up a reference signal, wherein an acoustic echo cancellation unit is for canceling echo based on the picked-up reference signal; and

and the voice recognition unit is used for recognizing the voice command.

12. The sound generation system of claim 1, wherein the filters on the distribution network comprise at least one of HRTF filters, all-pass filters, or equalization filters.

13. A computer-implemented method for generating three-dimensional sound, comprising:

receiving a specification of a three-dimensional space and a distribution network of filters defined on a grid in the three-dimensional space, wherein the three-dimensional space is presented in a user interface of a user interface device;

representing a listener and respective sound sources of the one or more audio tracks in three-dimensional space;

selecting a plurality of filters from a filter distribution network corresponding to user settings in listener position or sound source position in said three dimensional space;

applying each of a plurality of filters to a respective audio track of the one or more audio tracks to generate one or more filtered audio tracks; and

generating three-dimensional sound from the filtered soundtrack.

14. The computer-implemented method of claim 13, wherein each of the one or more audio tracks was previously separated from a mixed audio track, wherein the method further comprises:

receiving a pre-mixed sound stream containing one or more audio tracks;

separating the sound stream into one or more audio tracks using a machine learning model; and

a sound type for each of the one or more audio tracks is determined using a machine learning model.

15. The computer-implemented method of claim 14, wherein the sound type is one of a voice sound, a human sound, a musical instrument sound, an automobile sound, a helicopter sound, a vehicle sound, a gunshot sound, an airplane sound, or an ambient noise.

16. The computer-implemented method of claim 13, further comprising:

implementing a plurality of beamformers, wherein each of the plurality of beamformers is arranged to pick up sound from a respective sound source from a respective direction with an array of microphones; and

17. The computer-implemented method of claim 13, wherein representing respective sound sources of a plurality of audio tracks in a three-dimensional space, further comprises:

an icon representing a sound type of a label set according to a user position in a three-dimensional space is presented on a user interface, wherein the icon may be a graphic, an image of a corresponding sound source, a video of a corresponding sound source, or an animated representation.

18. The computer-implemented method of claim 13 wherein the filters in the distributed network of filters are head-related transfer function (HRTF) filters, wherein the network of HRTF filters comprises an array of pre-computed left and right HRTF filter pairs defined on a grid in three-dimensional space, and each point of the grid is associated with one of all left and right HRTF filter pairs, the selecting the plurality of HRTF filters further comprising, for each respective sound source:

determining points of a mesh by applying an activation function to a relative position and distance between a sound source and a listener;

19. The computer-implemented method of claim 18, wherein applying each of the plurality of HRTF filters to a respective one of the one or more audio tracks to generate one or more filtered audio tracks further comprises applying a left HRTF filter to generate a left filtered audio track for a left channel receptor and a right HRTF filter to generate a right filtered audio track for a right channel receptor, the left and right sound receptors to transmit output signals to headphones, or speakers.

20. The computer-implemented method of claim 13, wherein generating three-dimensional sound from the one or more filtered audio tracks further comprises providing one or more HRTF filtered or all-pass filtered audio tracks to a plurality of speakers.

21. The computer-implemented method of claim 13, wherein each respective sound source comprises a participant of a conference or virtual concert, further comprising:

the corresponding sound source position is presented in a three-dimensional space representing a virtual meeting room or a virtual concert.

22. The computer-implemented method of claim 13, further comprising at least one of:

picking up sounds from a plurality of sound sources in different directions using a microphone array including a plurality of microphones;

rendering three-dimensional sound with one or more speakers;

removing echo in one or more tracks with an acoustic echo cancellation unit;

reducing noise components in one or more audio tracks using a noise reduction unit;

processing one or more audio track outputs using a set of sound equalizer units;

picking up a reference signal with a reference sound pickup circuit located in the vicinity of one or more speakers, wherein an acoustic echo cancellation unit is configured to cancel echo based on the picked-up reference signal; or

The voice command is recognized using a voice recognition unit.