GB2601114A - Audio processing system and method - Google Patents

Audio processing system and method Download PDF

Info

Publication number
GB2601114A
GB2601114A GB2017819.0A GB202017819A GB2601114A GB 2601114 A GB2601114 A GB 2601114A GB 202017819 A GB202017819 A GB 202017819A GB 2601114 A GB2601114 A GB 2601114A
Authority
GB
United Kingdom
Prior art keywords
sound source
sound
audio
environment
operable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2017819.0A
Other versions
GB202017819D0 (en
Inventor
Villanueva Barreiro Marina
Schembri Danjeli
Cappello Fabio
Armstrong Calum
Ashton Derek Smith Alexei
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment Inc
Original Assignee
Sony Interactive Entertainment Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Interactive Entertainment Inc filed Critical Sony Interactive Entertainment Inc
Priority to GB2017819.0A priority Critical patent/GB2601114A/en
Publication of GB202017819D0 publication Critical patent/GB202017819D0/en
Publication of GB2601114A publication Critical patent/GB2601114A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Stereophonic System (AREA)

Abstract

A sound source separation system for separating audio corresponding to one or more sound sources within an environment. The system comprises an image obtaining unit operable to obtain one or more images of the environment, a sound obtaining unit operable to obtain audio comprising one or more sounds associated with the environment, an identification unit operable to identify, from one or more of the obtained images, one or more sound sources within the environment and to identify one or more properties associated with each sound source, and a separation unit operable to associate one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources. Preferably, the properties associated with a respective sound source include location, motion, loudness, size, and frequency profile. Preferably the audio is in an Ambisonics format.

Description

AUDIO PROCESSING SYSTEM AND METHOD BACKGROUND OF THE INVENTION
Field of the invention
This disclosure relates to an audio processing system and method.
Description of the Prior Art
The "background" description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly or impliedly admitted as prior art against the present invention.
The generation of immersive audio for reproduction is a problem that has become increasingly of interest in recent years. While surround systems have long been in use, these are often set up with a static configuration and a static user who views content on a display, such as a television or cinema screen. However, an increase in the availability of more immersive video content (such as interactive videos or virtual reality experiences) has provided a different set of circumstances under which immersive audio is to be desired.
These new circumstances can be problematic for audio reproduction because the viewpoint is able to be varied freely (or at least to some extent), which can vary the relative position of the viewpoint to a sound source. This means that if the sound does not change to account for this, it will not appear to be coming from the correct source. A change in the orientation of the viewpoint can also have a similar effect, as the relative direction of the sound source to the viewpoint will change. This therefore poses a significantly different problem to that of traditional media, in which the viewpoint is substantially fixed and/or predetermined.
In existing arrangements, it is common for a large number of microphones to be placed in an environment when seeking to capture audio that is suitable for an immersive reproduction. This can be problematic, particularly when capturing video of the environment at the same time, as the presence of microphones can impact the immersiveness of the video content if seen. However, the high number of microphones is often considered necessary so that an effective sound separation process can be performed -this sound separation enables sounds to be mixed separately, thereby enabling more control over the sound reproduction process for an environment.
Alternatively, or in addition, in traditional capture arrangements it can be common to perform multiple captures of audio to assist with isolating specific sounds/sound sources within an environment. This can of course lead to alternative problems such as the reproducibility of the events within the environment and the time taken to perform the capture process. Therefore traditional sound capture systems may be considered to be impractical or problematic for a number of reasons.
In addition to those issues associated with the capturing of audio as identified above, sound separation processes may also have a number of associated issues; due to the required processing and audio capture conditions it is common that the sounds become distorted. In some arrangements, it is only possible to perform the sound separation under heavily constrained conditions -this limits the usefulness of the separation process to specific arrangements of audio capture hardware and sound sources, for instance.
In view of the above problems, it is considered that an improved method for generating audio data is desired. It is in the context of these problems that the present disclosure arises.
SUMMARY OF THE INVENTION
This disclosure is defined by claim 1.
Further respective aspects and features of the disclosure are defined in the appended claims.
It is to be understood that both the foregoing general description of the invention and the following detailed description are exemplary, but are not restrictive, of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: Figure 1 schematically illustrates a simplified method; Figure 2 schematically illustrates a two-dimensional representation of higher-order Ambisonics; Figure 3 schematically illustrates an example of sound source localisation; Figure 4 schematically illustrates a sound source separation method; Figure 5 schematically illustrates a content capture and processing arrangement; Figure 6 schematically illustrates a sound source separation system; and Figure 7 schematically illustrates a sound source separation method.
DESCRIPTION OF THE EMBODIMENTS
Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, embodiments of the present disclosure are described.
Embodiments of the present disclosure provide for an improved sound source separation process in which a different approach is taken to that of traditional arrangements. Sound source separation is a process by which an audio input is analysed to determine a number of distinct sound sources within the audio. Once identified, the sound source separation process is then configured to generate a respective audio stream for each of the identified sources, with the respective audio stream comprising audio exclusively from the corresponding sound source (or at least as exclusively as possible, given technological or implementation constraints that can render the separation less-than-perfect). Such separated audio may be useful for generating immersive content, as noted above, and may be of particular use when generating a virtual environment based on a real environment as this may enable sound sources to be added, removed, and/or modified with greater ease and with less risk of reducing a sense of immersion.
In traditional arrangements, such as those discussed above, sound source separation is a process in which audio is analysed to identify different sound sources. However in embodiments of the present disclosure such as those described below it is instead considered that sounds sources, and/or one or more other properties of those sources, are identified prior to considering the audio and that instead the sound source separation effectively comprises a search for those sound sources within the audio.
This is therefore a reversal of the methods described above.
Figure 1 illustrates a simplified method in line with this; this method begins at step 100 with an identification of one or more sound sources. This identification may be performed based upon images of those sound sources, and/or the environment in which the sound sources are present. At a step 110, the sound sources are separated based upon this identification -for instance, based upon location information or properties of the sound source that has been identified. Examples of the use of these properties are provided in more detail below. Finally, at a step 120 the separated audio is stored for reproduction at a later time -although in some cases it may also (or instead) be used to generate an audio output immediately.
The techniques outlined in the present disclosure are described primarily with respect to an Ambisonics-based arrangement, but this should not be considered essential. Any audio recording or storage scheme which has a spatial dependency may be used in the manner described below, as the teachings of the present disclosure may be adapted freely according to the requirements of specific recording/storage schemes. For instance, any audio capture that is performed with multi-channel microphone array may be considered suitable. An example of an alternative basis for such an arrangement is one in which beamforming or frequency-domain independent component analysis is performed -these are other implementations in which a spatial dependency may be utilised according to the present disclosure.
Ambisonics, and in particular higher-order Ambisonics, refers to an audio format in which a spatial dependency is encoded for the audio -this is analogous to older surround sound formats, but for three-dimensional sound reproduction. The Ambisonics format comprises a speaker-independent representation of a sound field which can then be used to generate an audio output for a specific loudspeaker layout or the like.
Figure 2 schematically illustrates a two-dimensional representation of some of the harmonic functions that make up higher-order Ambisonics. While these are typically used in a three-dimensional manner, as noted above, only two dimensions are shown here to aid the clarity of the following discussion. Each Ambisonic component represents a capture of a soundfield in which the sensitivity of the capture to a particular spatial direction is defined by the order of the harmonic function such that higher-order harmonics generally correspond to captures that are more sensitive to a particular spatial direction.
In Figure 2, the circle 200 and each of the pairs of ovals 210 represent examples of first-order Ambisonic components in the x-y plane. First-order Ambisonics has a relatively low spatial resolution, meaning that sounds sources are not able to be particularly well localised. This means that it can be difficult to generate an accurate sound reproduction for particular arrangements. In view of this, higher-order Ambisonics is often implemented; these enable a higher resolution to be obtained. The elements 220 and 230 are examples of two-dimensional representations of second-and third-order Ambisonic components respectively.
Ambisonic components can be summed together with different weightings in order to sample a corresponding soundfield, which has been recorded in 3-dimensions, in a particular direction. The resolution and the specific shape of this directional sampling is dependent upon on both the Ambisonic components that are utilised to perform the sampling, and the respective weightings of those components Figure 3 schematically illustrates a pair of two-dimensional examples of the identification of a target sound source 300 and the separation of sounds associated with the sound source 300 from an audio recording as part of a sound source separation process. In the first example, the sound source 300 is identified as being at the indicated location and first-order Ambisonic components (corresponding to reference 210 of Figure 2) are combined to sample the sound field in the direction of the sound source 300. The first-order Ambisonic components are selected such that the soundfield is sampled with a sensitivity represented by the element 310. In the second example, third-order Ambisonic components (corresponding to reference 230 of Figure 2) are combined to sample the sound field in the direction of the source 300 such that the soundfield is sampled with a sensitivity represented by the element 320.
As is clear from consideration of the relative shape and size of the elements 310 and 320, the use of higher-order Ambisonic components results in a more directional sampling of the soundfield. Therefore by locating the sound sources within the environment it is considered that it is possible to use higher-order Ambisonic components for the sound separation process. By sampling the soundfield in this manner, sounds associated with a sound source may be able to be identified more easily due to the directional consideration of the sound; this directionality enables the audio from any sound sources that do not lie in the same (or a similar) direction to be omitted.
The sound source separation process may therefore be improved as fewer sounds that are not associated with a target sound source will be present in a re-sampled recording. This re-sampled recording may also form the basis of a more accurate and/or efficient sound separation process in which sounds associated with non-target sound sources are removed or reduced. This is because the amount of noise is already reduced, thereby making it easier to identify the desired audio content (that is) the sounds corresponding to the target sound source).
An Ambisonics representation of a sound field may be generated for each of a number of microphones within an environment, rather than simply relying upon a single microphone (or microphone arrangement). In such embodiment, the audio from a number of microphone and/or microphone arrangements may be considered when performing the sound source separation process. This may enable an improved separation to be performed, as the audio associated with the sound source can be gathered from a range of different directions which can lead to a more accurate representation of the sound source's output. By determining the location of a sound source within the environment, the contribution from each of the microphone/microphone arrangements may be weighted accordingly such that those nearer the source can provide a correspondingly higher contribution to the separated audio.
The above discussion illustrates how spatial information can be used in conjunction with the Ambisonics encoding scheme to perform a sound separation process. However, as noted above it is not considered essential that Ambisonics is used in embodiments of the present disclosure. Any suitable audio capture arrangements and/or recording formats may be used as appropriate, given the ability to determine a directional dependence upon the audio.
Figure 4 schematically illustrates a sound source separation method according to a number of embodiments of the present disclosure. It is considered that the method shown may be varied as appropriate for a given implementation, and that the steps may be performed in any suitable order or timing -for instance, it is considered that the image capture and audio capture steps may be performed substantially simultaneously with the sound separation, if the processing is performed at the time of capture.
A step 400 comprises capturing images of the environment in which one or more sound sources are present. In some embodiments, multiple images may be captured at each time interval so as to provide multiple viewpoints within the environment. Each of the images may be a two-or three-dimensional image, as appropriate.
A step 410 comprises processing the captured images. This processing may comprise any suitable image processing to assist with the identification of sound sources, and any number of consecutive images may be considered together as a part of the processing as appropriate (for example, to detect motion of an object). This processing step 410 comprises at least an identification of one or more sound sources, for example via an image recognition process, and a location of the sound source. This location may be determined relative to the camera, relative to a microphone position; or with respect to a coordinate system associated with the environment, for example.
The identification may be performed using any suitable computer vision techniques; for example. Alternatively, or in addition, one or more sound sources may be identified with a predetermined tag or marker to aid with identification. In some embodiments the identification of the sound source comprises only identifying that the sound source exists, and then localising that sound source; the identification need not include the assignment of a tag or attribute indicative of the type of sound source. Should such a tag or attribute be desired, the identification may be performed to any appropriate degree of specificity in a given implementation -for example, in some embodiments a sound source may be identified as 'bird', while in others a more specific identification (such as 'blackbird') or less specific identification (such as 'animal') may be considered suitable.
A step 420 comprises capturing audio in the environment using one or more microphones and/or microphone arrangements. In some embodiments, the captured audio is stored before being processed while in others the processing (discussed below) is performed such that only separated audio is stored for future use -of course, combinations of these may also be considered appropriate in some embodiments. As noted above, the audio is captured so as to have a directional dependence upon the audio -for instance, by using a distributed selection of microphones with known positions, and/or by capturing audio in accordance with a spatial audio scheme such as Ambisonics.
A step 430 comprises processing the captured audio so as to separate the sound sources in dependence upon the processing performed in step 410. That is to say that the sound sources are separated within the captured audio in dependence upon at least a determined location within the environment. An example of such a separation is discussed above with reference to Figure 3, although any suitable separation process may be performed. Further factors that may be considered when performing the separation, based upon the identification of step 420, are discussed below.
As noted above, this sound separation step 430 is effectively performed as a search based upon one or more parameters determined by step 420-that is to say that a particular sound source is being looked for within the captured audio. For instance, the step 420 may determine that a bird is located at position X -the sound separation process would then be performed so as to extract bird-related sounds that emanate from position X within the captured audio.
A step 440 comprises outputting the audio corresponding to the separated sound sources. In some embodiments this includes outputting the audio to respective audio files for each sound source for later reproduction; alternatively, or in addition, the outputting includes generating an audio output for one or more audio reproduction devices (such as loudspeakers) using the audio corresponding to the separated sound sources.
In some embodiments, an additional step may be performed in which the captured audio is also analysed to identify a sound source -this can be used to either confirm or modify the identification performed in step 410 above. For instance, computer vision techniques may fail to differentiate between a cat and a dog, and a sound separation may be performed to account for either -once the separation is performed however it is possible to determine from the audio as to which animal it is.
Similarly, corrections may be provided where appropriate (such as if a cat is identified from the images, but it is determined form the audio that it is actually a dog). Such a step may only be performed when confidence in the identification is low, in some embodiments, to reduce the amount of redundant processing that may be performed. The results of such processing can also be used to update the image processing model, rather than simply confirming/modifying the results, so as to improve the efficiency of future processing.
Once the audio has been separated into a number of different audio streams corresponding to individual sound sources, these may be subjected to any desired processing. In some cases, audio processing may be performed to enhance the sound quality using a process that is tailored to the specific sound source. Alternatively, or in addition, sources may be modified or replaced with alternative sounds as appropriate for the given implementation -for instance, when recording a dialogue scene for a movie the audio of an actor speaking may be replaced with an alternative voiceover or the like. The audio may also be mixed on a per-source basis so as to generate improved reproduction for a given listener position.
As noted above, any of a number of different characteristics or properties of a sound source (in addition to the location) may be considered as a part of the identification processing for a sound source. Each of these characteristics or properties may be used to refine or improve the sound source separation process in a corresponding fashion so as to improve the obtained outputs. Many of the below properties or characteristics may be dependent upon at least a classification of an object, although some may only be dependent upon physical characteristics such as size or position.
A first example of a property is that of motion of the sound source within the environment. This may be useful for interpolation purposes, given that image capture takes place at a much slower rate than the sampling rate of typical audio. This can enable an improved sound source localisation to be performed even when a relatively low frame rate is used for image capture. In some cases, this motion may also be used to inform as to the expected audio -for instance, an object may have a different sound profile dependent on whether it is moving towards or away from a listener. An example of this is the engine sound in a vehicle, which may be muffled or otherwise modified if driving away from a listener relative to driving towards. As the rate of motion increases, this may also lead to a Doppler shift which can significantly vary the sound that is to be expected from a source -a common example being the variation of the sound of an emergency vehicle in dependence upon the speed and direction of travel.
Consideration of the Doppler shift leads on to a second property that may be identified -that of the expected frequency of the sounds associated with a sound source. This may be an advantageous property to consider as it can significantly reduce the frequency range that is considered in the sound separation process. The expected frequency may be determined based upon a characterisation of an object (such as identifying a particular instrument or the like), or other factors such as a size of the object (for instance, larger objects are often associated with lower-frequency audio -although this may be dependent upon at least an approximate characterisation of a sound source so as to identify at least a type of object). In a simple example, if a guitar is identified as a sound source then the frequency profile of the associated sounds can be identified rather reliably. Returning to an earlier example, the differentiation between the sounds of a dog and a cat can also be simplified by consideration of the expected frequencies associated with each animal, even when present in the same audio capture.
The size of an identified sound source may also be considered as a part of the sound separation process.
This is because it can be useful to ascertain the level of precision that should be applied when identifying the sounds associated with that source. For instance, a smaller object may be associated with a smaller area from which sound emanates -and as such increased precision may be considered when separating the sounds. In the example discussed with reference to Figure 3, this may comprise the use of higher-order Ambisonics than for a larger object or the abandoning of lower-order Ambisonics as appropriate.
The expected loudness of a sound source may also be considered useful as an input to the sound source separation process; a measure of this may be based upon a characterisation of the object (for instance, a plane is expected to be louder than a clock), or physical properties such as the size (such as a large dog being louder than a small dog) or location (as the nearer an object is to a listener, the louder it is for a given sound output.
In some embodiments, the identification may include an identification of when sounds are output by the sound source. One example of this is an identification of when a person speaks, based upon mouth motion or the like. With this information) the sound source separation process can be made more efficient as audio need only be analysed at times at which an audio input is expected. A further improvement may also consider the time-of-flight of the sound, effectively an estimate of the time taken for the output sound to reach each of one or more microphones/microphone arrangements. This can then assist with both specifying a more precise expected time in which the sound source is active within the captured audio, and combining the audio that is captured from multiple microphones/microphone arrangements.
Figure 5 schematically illustrates a content capture and processing arrangement configured to implement one or more embodiments of the present disclosure. This arrangement includes a camera arrangement 500, a microphone arrangement 510, and a processing device 520. The components of the system are shown as being linked; however this should not be regarded as being limiting, as the components may be connected in any suitable configuration and with any suitable type of connection (such as wired or wireless).
The camera arrangement 500 comprises one or more cameras arranged to capture images of an environment in which audio is recorded. The one or more cameras may be provided in any suitable configuration -cameras may be placed individually within the environment, or provided in groups. In some embodiments, one or more of the cameras may be three-dimensional. Alternatively, or in addition, the arrangement of cameras may enable each (or at least a selection) of the sound sources to be captured in images by two or more cameras so as to improve the accuracy and/or precision of the localisation and identification.
In some embodiments, cameras may be provided that utilise a wide-angle lens so as to enable images to be captured of a greater field of view than if a normal lens were to be used. The arrangement of cameras should be selected so as to provide coverage of a substantial portion of the environment, or at least the sound sources within the environment, so as to be able to provide robust inputs to the sound source separation process. An alternative, or additional, variation is that of using infra-red cameras instead of visible light cameras as this may aid an identification process in some embodiments -in some embodiments, it may be appropriate to use both visible light and infra-red cameras.
The microphone arrangement 510 comprises one or more microphones that are configured so as to capture audio within the environment. This arrangement may comprise any suitable combination of devices as appropriate -for instance, a single unit may comprise multiple microphones so as to enable a directional and comprehensive audio capture with that unit. The arrangement may comprise one such unit, or multiple, instead of or as well as any number of single-microphone units. One or more of the microphones may be configured as a directional microphone as appropriate. One or more microphones may be configured to capture audio with particular frequencies, in some embodiments, for instance so as to enable a less noisy audio capture over a desired range of frequencies.
While discussed as separate arrangements, in some embodiments it is considered that at least some of the microphones and cameras are provided in a combined fashion such that a single unit houses at least one camera and at least one microphone. This can simplify the capture process, and limit the number of individual units that are required for audio/image capture within the environment.
The processing unit 520 is operable to perform processing so as to separate sound sources from the audio captured by the microphone arrangement 510 based upon the content of the images captured by the camera arrangement 500. This processing may be performed according to the method of claim 4, for example, with further detail provided below.
Figure 6 schematically illustrates a sound source separation system for separating audio corresponding to one or more sound sources within an environment -this system may be embodied in a processing unit or any suitable computing device (such as the processing unit 520 of Figure 5). In some embodiments, this processing may be performed by a cloud computing arrangement such that the processing is performed remotely to the content (audio and image) capture arrangement. Similarly, the system may perform the processing at a time later than the capture time and as such it is not required that the system is directly associated with any content capture devices.
The system of Figure 6 is comprised of four separate units; an image obtaining unit 600, a sound obtaining unit 610, an identification unit 620, a separation unit 630, and an optional output unit 640.
While shown as separate units here, in practice these units may each be embodied within a single processor or distributed between a number of different devices as appropriate -for instance, image capture and processing may be performed by a first device with the sound separation being performed by a second device at a later time.
The image obtaining unit 600 is operable to obtain one or more images of the environment. In some embodiments the images may be obtained directly from one or more cameras; alternatively, or in addition, one or more images may be obtained from a storage medium or another device that stores imagery from an earlier image capture process. In some embodiments the obtained images may be in a video format, with processing being performed on some or all of the individual frames (and/or groups of frames where appropriate).
The sound obtaining unit 610 is operable to obtain audio comprising one or more sounds associated with the environment. In some embodiments the audio may be obtained directly from one or more audio capture elements such as microphones; alternatively, or in addition, this may comprise the obtaining of one or more audio files from a storage medium or another device that stores audio from an earlier audio capture process. An audio capture element may refer to a single microphone, or an array of microphones. In some embodiments, a number of audio capture elements may be embodied as a single audio capture device -for example, a device comprising a multi-microphone array, or an Ambisonics microphone.
As has been discussed above, in some embodiments the obtained audio may be in an Ambisonics format -however this is not regarded as being essential, as the teachings of this disclosure can be modified so as to be compatible with a range of different audio formats without undue burden upon the skilled person. That is to say that the skilled person would be capable of adapting the teachings of the present disclosure as appropriate for a particular format, as the processing that is disclosed is independent of a particular format.
The identification unit 620 is operable to identify, from one or more of the obtained images, one or more sound sources within the environment and to identify one or more properties associated with each sound source. In some embodiments, this identification is simply a determination of the existence of the sound source and the property is its location (and/or one or more additional physical properties such as size or motion). In other embodiments, the identification unit 620 is operable to classify one or more of the sound sources and to use this classification to identify one or more properties associated with a respective sound source -for instance, classifying a sound source (such as a particular bird) and then consulting a database to obtain one or more properties associated with that classification (such as typical sounds of that bird).
In some embodiments, the identification unit 620 is operable to determine a motion of one or more of the sound sources as one of the properties associated with a respective sound source. The motion of a sound source may be identified by comparing the location of the sound source in multiple successive image captures (or a sample of images over time). This motion may be used by the separation unit 630 to identify one or more of an expected frequency, loudness, frequency profile, and/or timing of a sound, for example.
In some embodiments, the identification unit 620 is operable to determine a size of one or more of the sound sources as one of the properties associated with a respective sound source. The size of a sound source may be identified based upon a classification and lookup as discussed above, or a determination of the size based upon the size of the sound source within the image (for instance, relative to a known object or based upon a known distance from the camera or the like). The size of the sound source may be used by the separation unit 630 to identify one or more of an expected frequency, loudness, and/or frequency profile of a sound source, for example.
In some embodiments, the identification unit 620 is operable to determine an expected loudness of one or more of the sound sources as one of the properties associated with a respective sound source. The expected loudness may be based upon a classification and lookup of properties of a sound source, and/or consideration of one or more other properties such as distance from a microphone or the like. The expected loudness may be used by the separation unit 630 to differentiate between different sound sources, for example.
In some embodiments, the identification unit 620 is operable to determine an expected frequency profile of one or more of the sound sources as one of the properties associated with a respective sound source. The frequency profile may be determined using a classification and lookup of properties of a sound source, for example; for instance, a particular bird may be identified and the frequency profile of its call may be used as an input for the separation unit 630 so as to assist with separating the sounds of the bird from the rest of the sounds in the environment.
In some embodiments, the identification unit 620 is operable to determine an expected timing of sounds from one or more of the sound sources as one of the properties associated with a respective sound source. An example of this that was described above was that of identifying when a person is speaking in the environment based upon a detected mouth movement; similarly, any audio event that may be detected from images may be considered in this context. For instance, an object being dropped or a vehicle beginning to move may be easy to identify from images. The timing information may also include a factor corresponding to a distance of a sound source from each of one or more microphones. This timing information may be used by the separation unit 630 to identify when a sound source is expected to appear in each of one or more audio recordings, and thereby assist with at least the efficiency of the separation by identifying which parts of an audio recording should be analysed.
The separation unit 630 is operable to associate one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources. This association may comprise a marked up audio file or the generation of metadata to be associated with an audio file such that different sound sources can be identified from the audio file. Alternatively, or in addition, the separation unit 630 may be operable to generate a respective audio stream for each sound source (or a group of sound sources, where appropriate) that can be played back or utilised on an individual basis.
In some embodiments the identification unit 620 may be operable to analyse the sounds associated with a respective sound source (that is, the output of the separation unit 630) to identify the sound source in dependence upon the sound. If it is determined that the initial identification (the image-based identification) performed by the identification unit 620 was inaccurate or was not suitably precise, then the identification unit 620 may be operable to update the process of identifying sound sources from one or more obtained images in dependence upon the identification of the sound source in dependence upon the sound. That is to say that the identification process can be modified or improved in an iterative manner based upon the separated audio so as to aid further sound source separation processing.
One example of this is an initial identification determining that a bird is a sound source. This identification can be used to improve the sound separation process, as general information about a bird can be used to determine expected sounds. Once the audio has been at least partially separated (for instance, after a predetermined time has elapsed in the audio), an analysis of the separated audio may be performed to identify a specific bird or group of birds that are likely to correspond to the sound source. This information can then be used to improve the sound source identification process by either updating the computer vision model or by tagging an object and tracking it within the images. This can then result in an improved input for the remainder of the separation process and/or future separation processes.
In some embodiments, the sound obtaining unit 610 is operable to obtain audio corresponding to each of two or more audio capture elements in the environment. In such embodiments, the separation unit 630 may be operable to assign a weighting to the audio corresponding to each of the audio capture elements (such as microphones or microphone arrangements) in dependence upon one or more of the identified properties, the weighting being able to he selected for each sound source independently. This weighting can be used to reflect the relative proximity of the sound source to each microphone in an environment (such that those closer have a higher weighting, as the audio data is expected to be more useful) or relative direction if the microphones are at least somewhat directional in their audio capture.
Alternatively, or in addition, if audio capture elements are configured to capture different frequencies or the like then it is considered that an appropriate weighting may be assigned to the audio from each device in dependence upon the relative frequencies of the expected sound from a sound source and the audio capture element.
The optional output unit 640 is operable to output one or more respective audio streams for each identified sound source to an audio storage unit. This audio storage unit may be any suitable device or storage medium -examples include cloud storage, a local hard drive, and/or a removable storage medium such as a disc. The output unit 640 may instead, or as well, be operable to output one or more audio streams for reproduction by loudspeakers or the like.
The arrangement of Figure 6 is an example of a processor (for example, a GPU and/or CPU located in a games console or any other computing device) that is operable to separating audio corresponding to one or more sound sources within an environment, and in particular is operable to: obtain one or more images of the environment; obtain audio comprising one or more sounds associated with the environment; identify, from one or more of the obtained images, one or more sound sources within the environment; identify one or more properties associated with each sound source; and associate one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources.
Figure 7 schematically illustrates a sound source separation method for separating audio corresponding to one or more sound sources within an environment. Such a method may be implemented using an arrangement such as that discussed above with reference to Figure 6.
A step 700 comprises obtaining one or more images of the environment; these may be obtained directly from a camera or they may be images that have been stored from an earlier image capture process.
A step 710 comprises obtaining audio comprising one or more sounds associated with the environment; these may be obtained directly from an audio capture element such as a microphone or they may be derived from one or more audio files that have been stored from an earlier audio capture process.
A step 720 comprises identifying, from one or more of the obtained images, one or more sound sources within the environment. This may comprise a determination of the location of those sound sources, and/or a classification of the objects to determine what they are.
A step 730 comprises identifying one or more properties associated with each sound source. These properties may be identified based upon the images themselves, or they may be obtained from a lookup of properties based upon a classification of a sound source -for instance, a database of audio properties of different sound sources.
A step 740 comprises associating one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources. This may comprise the generation of separate audio streams for each of one or more sound sources within the environment, and/or a labelling of the obtained audio to assist with identifying sound sources within that audio.
The techniques described above may be implemented in hardware, software or combinations of the two. In the case that a software-controlled data processing apparatus is employed to implement one or more features of the embodiments, it will be appreciated that such software, and a storage or transmission medium such as a non-transitory machine-readable storage medium by which such software is provided, are also considered as embodiments of the disclosure.
Thus, the foregoing discussion discloses and describes merely exemplary embodiments of the present invention. As will be understood by those skilled in the art, the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting of the scope of the invention, as well as other claims. The disclosure, including any readily discernible variants of the teachings herein, defines, in part, the scope of the foregoing claim terminology such that no inventive subject matter is dedicated to the public.

Claims (15)

  1. CLAIMS1. A sound source separation system for separating audio corresponding to one or more sound sources within an environment, the system comprising: an image obtaining unit operable to obtain one or more images of the environment; a sound obtaining unit operable to obtain audio comprising one or more sounds associated with the environment; an identification unit operable to identify, from one or more of the obtained images, one or more sound sources within the environment and to identify one or more properties associated with each sound source; and a separation unit operable to associate one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources.
  2. 2. A system according to claim 1, wherein the identification unit is operable to classify one or more of the sound sources and to use this classification to identify one or more properties associated with a respective sound source.
  3. 3. A system according to any preceding claim, wherein the identification unit is operable to determine a location of one or more of the sound sources as one of the properties associated with a respective sound source.
  4. 4. A system according to any preceding claim, wherein the identification unit is operable to determine a motion of one or more of the sound sources as one of the properties associated with a respective sound source.
  5. 5. A system according to any preceding claim, wherein the identification unit is operable to determine a size of one or more of the sound sources as one of the properties associated with a respective sound source.
  6. 6. A system according to any preceding claim, wherein the identification unit is operable to determine an expected loudness of one or more of the sound sources as one of the properties associated with a respective sound source.
  7. 7. A system according to any preceding claim, wherein the identification unit is operable to determine an expected frequency profile of one or more of the sound sources as one of the properties associated with a respective sound source.
  8. 8. A system according to any preceding claim, wherein the identification unit is operable to determine an expected timing of sounds from one or more of the sound sources as one of the properties associated with a respective sound source.
  9. 9. A system according to any preceding claim: wherein the identification unit is operable to analyse the sounds associated with a respective sound source to identify the sound source in dependence upon the sound; and wherein the identification unit is operable to update the process of identifying sound sources from one or more obtained images in dependence upon the identification of the sound source in dependence upon the sound.
  10. 10. A system according to any preceding claim: wherein the sound obtaining unit is operable to obtain audio corresponding to each of two or more audio capture elements in the environment; and wherein the separation unit is operable to assign a weighting to the audio corresponding to each of the audio capture elements in dependence upon one or more of the identified properties, the weighting being able to be selected for each sound source independently.
  11. 11. A system according to any preceding claim, wherein the obtained audio is in an Ambisonics format.
  12. 12. A system according to any preceding claim, comprising an output unit operable to output one or more respective audio streams for each identified sound source to an audio storage unit.
  13. 13. A sound source separation method for separating audio corresponding to one or more sound sources within an environment, the method comprising: obtaining one or more images of the environment; obtaining audio comprising one or more sounds associated with the environment; identifying, from one or more of the obtained images, one or more sound sources within the 15 environment; identifying one or more properties associated with each sound source; and associating one or more sounds with a respective sound source within the environment in dependence upon one or more properties of the identified sound sources
  14. 14. Computer software which, when executed by a computer, causes the computer to carry out the method of claim 13.
  15. 15. A non-transitory machine-readable storage medium which stores computer software according to claim 14.
GB2017819.0A 2020-11-11 2020-11-11 Audio processing system and method Pending GB2601114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2017819.0A GB2601114A (en) 2020-11-11 2020-11-11 Audio processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB2017819.0A GB2601114A (en) 2020-11-11 2020-11-11 Audio processing system and method

Publications (2)

Publication Number Publication Date
GB202017819D0 GB202017819D0 (en) 2020-12-23
GB2601114A true GB2601114A (en) 2022-05-25

Family

ID=74046380

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2017819.0A Pending GB2601114A (en) 2020-11-11 2020-11-11 Audio processing system and method

Country Status (1)

Country Link
GB (1) GB2601114A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011071686A (en) * 2009-09-25 2011-04-07 Nec Corp Video sound processor, and video sound processing method, and program
US20160064000A1 (en) * 2014-08-29 2016-03-03 Honda Motor Co., Ltd. Sound source-separating device and sound source -separating method
US20170374453A1 (en) * 2016-06-23 2017-12-28 Canon Kabushiki Kaisha Signal processing apparatus and method
US20170374463A1 (en) * 2016-06-27 2017-12-28 Canon Kabushiki Kaisha Audio signal processing device, audio signal processing method, and storage medium
US20190394423A1 (en) * 2018-06-20 2019-12-26 Casio Computer Co., Ltd. Data Processing Apparatus, Data Processing Method and Storage Medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011071686A (en) * 2009-09-25 2011-04-07 Nec Corp Video sound processor, and video sound processing method, and program
US20160064000A1 (en) * 2014-08-29 2016-03-03 Honda Motor Co., Ltd. Sound source-separating device and sound source -separating method
US20170374453A1 (en) * 2016-06-23 2017-12-28 Canon Kabushiki Kaisha Signal processing apparatus and method
US20170374463A1 (en) * 2016-06-27 2017-12-28 Canon Kabushiki Kaisha Audio signal processing device, audio signal processing method, and storage medium
US20190394423A1 (en) * 2018-06-20 2019-12-26 Casio Computer Co., Ltd. Data Processing Apparatus, Data Processing Method and Storage Medium

Also Published As

Publication number Publication date
GB202017819D0 (en) 2020-12-23

Similar Documents

Publication Publication Date Title
CN112400325B (en) Data driven audio enhancement
US20220159403A1 (en) System and method for assisting selective hearing
CN112088315B (en) Multi-mode speech localization
JP6464449B2 (en) Sound source separation apparatus and sound source separation method
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
US20170366896A1 (en) Associating Audio with Three-Dimensional Objects in Videos
US20160140964A1 (en) Speech recognition system adaptation based on non-acoustic attributes
JP2022538511A (en) Determination of Spatialized Virtual Acoustic Scenes from Legacy Audiovisual Media
JP2016178652A (en) Audio processing apparatus
WO2020022055A1 (en) Information processing device and method, and program
JP2015019371A5 (en)
JP5618043B2 (en) Audiovisual processing system, audiovisual processing method, and program
US11875770B2 (en) Systems and methods for selectively providing audio alerts
US20230164509A1 (en) System and method for headphone equalization and room adjustment for binaural playback in augmented reality
JP2023508063A (en) AUDIO SIGNAL PROCESSING METHOD, APPARATUS, DEVICE AND COMPUTER PROGRAM
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
US20120242860A1 (en) Arrangement and method relating to audio recognition
JP5383056B2 (en) Sound data recording / reproducing apparatus and sound data recording / reproducing method
GB2601114A (en) Audio processing system and method
CN105741852B (en) Attention adaptive audio time domain adjusting method
US11513762B2 (en) Controlling sounds of individual objects in a video
CN115810209A (en) Speaker recognition method and device based on multi-mode feature fusion network
JP7321736B2 (en) Information processing device, information processing method, and program
JP7464730B2 (en) Spatial Audio Enhancement Based on Video Information
JP7493412B2 (en) Audio processing device, audio processing system and program