CN118235435A

CN118235435A - Distributed audio device evasion

Info

Publication number: CN118235435A
Application number: CN202280075182.1A
Authority: CN
Inventors: B·索斯韦尔; D·古纳万; A·J·西斐德
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2021-11-10
Filing date: 2022-11-04
Publication date: 2024-06-21

Abstract

An audio processing method may involve receiving an output signal from each of a plurality of microphones in an audio environment, the output signal corresponding to a current utterance of a person. The method may involve determining, responsive to the output signal and based at least in part on the audio device location information and the echo management system information, one or more audio processing variations to apply to audio data rendered as loudspeaker feed signals for two or more audio devices in the audio environment. The audio processing variations may involve a reduction in loudspeaker reproduction level of one or more loudspeakers in the audio environment. The method may involve applying one or more types of audio processing changes. The audio processing variations may have the effect of increasing the speech-to-echo ratio at one or more microphones.

Description

Distributed audio device evasion

Cross Reference to Related Applications

The present application claims priority from U.S. provisional application 63/278,003 filed on month 11 and 10 of 2021, U.S. provisional application 63/362,842 filed on month 4 and 12 of 2022, and European application 22167857.6 filed on month 4 and 12 of 2022, all of which are incorporated herein by reference in their entireties.

Technical Field

The present disclosure relates to systems and methods for orchestrating and implementing audio devices, such as smart audio devices, and controlling speech-to-echo ratio (SER) in such audio devices.

Background

Audio devices, including but not limited to smart audio devices, have been widely deployed and are becoming a common feature for many households. While existing systems and methods for controlling audio devices provide benefits, improved systems and methods would still be desirable.

Symbols and terms

Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspeaker)" and "audio reproduction transducer" are synonymously used to denote any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speakers may be implemented to include multiple transducers (e.g., woofers and tweeters) that may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.

Throughout this disclosure, including in the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, transforming, or applying gain to a signal or data) is used in a broad sense to mean performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has undergone preliminary filtering or preprocessing prior to performing an operation thereon).

Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.

Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.

Throughout this disclosure, including in the claims, the term "coupled" or "coupled" is used to mean a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.

As used herein, a "smart device" is an electronic device that may operate interactively and/or autonomously to some degree, typically configured to communicate with one or more other devices (or networks) via various wireless protocols such as bluetooth, zigbee, near field communication, wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, and the like. Some well-known smart device types are smart phones, smart cars, smart thermostats, smart doorbell, smart locks, smart refrigerators, tablet phones and tablet computers, smart watches, smart bracelets, smart key chains, and smart audio devices. The term "smart device" may also refer to a device that exhibits some properties of pervasive computing such as artificial intelligence.

The expression "smart audio device" is used herein to denote a smart device that is a single-purpose audio device or a multi-purpose audio device (e.g., an audio device implementing at least some aspects of the virtual assistant functionality). A single-use audio device is a device that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera) and is designed largely or primarily to achieve a single use, such as a Television (TV). For example, while a TV may generally play (and be considered capable of playing) audio from program material, in most instances, modern TVs run some operating system on which applications (including television-watching applications) run locally. In this sense, single-use audio devices having speaker(s) and microphone(s) are typically configured to run local applications and/or services to directly use the speaker(s) and microphone(s). Some single-use audio devices may be configured to be combined together to enable playback of audio over a zone or user-configured area.

One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured to communicate. Such multi-purpose audio devices may be referred to herein as "virtual assistants. A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera). In some examples, the virtual assistant may provide the ability to use multiple devices (other than the virtual assistant) for applications that in a sense support the cloud or that are otherwise not fully implemented in or on the virtual assistant itself. In other words, at least some aspects of the virtual assistant functionality (e.g., speech recognition functionality) may be implemented (at least in part) by one or more servers or other devices with which the virtual assistant may communicate via a network (e.g., the internet). Virtual assistants can sometimes work together, for example, in a discrete and conditionally defined manner. For example, two or more virtual assistants may work together in the sense that one of them (e.g., the virtual assistant that is most confident that the wake word has been heard) responds to the wake word. In some implementations, the connected virtual assistants may form a constellation that may be managed by a host application, which may be (or implement) the virtual assistant.

In this document, the "wake word" is used in a broad sense to mean any sound (e.g., a word spoken by a human or other sound), wherein the smart audio device is configured to wake up in response to detecting ("hearing") the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, "wake-up" means a state in which the device enters a waiting (in other words, listening) sound command. In some examples, a so-called "wake word" herein may include more than one word, e.g., a phrase.

Herein, the expression "wake word detector" means a device (or means including software for configuring the device to continuously search for an alignment between real-time sound (e.g., speech) features and a training model). Typically, a wake word event is triggered whenever the wake word detector determines that the probability of detecting a wake word exceeds a predefined threshold. For example, the threshold may be a predetermined threshold that is adjusted to give a reasonable tradeoff between false acceptance rate and false rejection rate. After the wake word event, the device may enter a state (which may be referred to as an "awake" state or an "attention" state) in which the device listens for commands and passes the received commands to a larger, more computationally intensive recognizer.

As used herein, the terms "program stream" and "content stream" refer to a collection of one or more audio signals, and in some instances, a collection of video signals, at least portions of which are intended to be heard together. Examples include selection of music, movie soundtracks, movies, television programs, audio portions of television programs, podcasts, live voice calls, synthesized voice responses from intelligent assistants, and the like. In some examples, the content stream may include multiple versions of at least a portion of the audio signal, e.g., the same dialog in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at a time.

Disclosure of Invention

At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some examples, the method(s) may be implemented at least in part by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some such methods may involve receiving, by a control system, output signals from one or more microphones in an audio environment. In some examples, the output signal may include a signal corresponding to a current utterance of a person. In some such examples, the current utterance may be or may include a wake-up word utterance.

Some such methods may involve determining, by the control system, one or more audio processing variations to apply to audio data rendered as loudspeaker feed signals for two or more audio devices in the audio environment responsive to the output signals and based at least in part on the audio device location information and echo management system information. In some examples, the audio processing variation may involve a reduction in loudspeaker reproduction level of one or more loudspeakers in the audio environment. Some such methods may involve the application of one or more types of audio processing changes by the control system.

In some examples, at least one of the audio processing variations may correspond to an increased signal-to-echo ratio. According to some such examples, the echo management system information may include a model of echo management system performance. For example, the model of echo management system performance may include an Acoustic Echo Canceller (AEC) performance matrix. In some examples, the model of echo management system performance may include a measure of expected echo return loss enhancement provided by the echo management system.

According to some examples, determining the one or more types of audio processing variations may be based at least in part on optimization of a cost function. Alternatively or additionally, in some examples, the one or more types of audio processing variations may be based at least in part on acoustic models of inter-device echoes and intra-device echoes. Alternatively or additionally, in some examples, one or more types of audio processing variations may be based at least in part on a mutual audibility of audio devices in the audio environment, e.g., based on a mutual audibility matrix.

In some examples, one or more types of audio processing variations may be based at least in part on the estimated location of the person. In some such examples, the estimated location of the person may be based at least in part on output signals from a plurality of microphones in the audio environment. According to some such examples, the audio processing variation may involve changing a rendering process to warp the rendering of the audio signal away from the estimated location of the person.

Alternatively or additionally, in some examples, the one or more types of audio processing variations may be based at least in part on the listening object. In some such examples, the listening object may include a spatial component, a frequency component, or both a spatial component and a frequency component.

Alternatively or additionally, in some examples, the one or more types of audio processing variations may be based at least in part on one or more constraints. In some such examples, the one or more constraints may be based at least in part on the perceptual model. Alternatively or additionally, the one or more constraints may be based at least in part on audio content energy conservation, audio spatial conservation, audio energy vectors, regularization constraints, or a combination thereof. Alternatively or additionally, some examples may involve updating an acoustic model of the audio environment, a model of echo management system performance, or both, after applying the one or more types of audio processing changes.

In some examples, the one or more types of audio processing variations may involve spectral modification. In some such examples, the spectral modification may involve reducing the level of audio data in a frequency band between 500Hz and 3 KHz.

Aspects of some disclosed embodiments include a control system configured (e.g., programmed) to perform one or more of the disclosed methods or steps thereof, and a tangible, non-transitory computer-readable medium (e.g., a disk or other tangible storage medium) embodying non-transitory storage of data, the tangible, non-transitory computer-readable medium storing code (e.g., code executable to perform one or more of the disclosed methods or steps thereof) for performing the one or more of the disclosed methods or steps thereof. For example, some disclosed embodiments may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including one or more of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more of the disclosed methods (or steps thereof) in response to data asserted thereto.

Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include those memory devices as described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in the present disclosure can be implemented in non-transitory media having software stored thereon.

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.

Drawings

Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure.

Fig. 1B illustrates an example of an audio environment.

Fig. 2 shows echo paths between three of the audio devices of fig. 1B.

Fig. 3 is a system block diagram representing components of an audio device according to one example.

Figure 4 illustrates elements of a evasion module according to one example.

Fig. 5 is a block diagram illustrating an example of an audio device including a evasion module.

Fig. 6 is a block diagram illustrating an alternative example of an audio device including a evasion module.

Fig. 7 is a flowchart outlining one example of a method for determining a dodging solution.

Fig. 8 is a flowchart outlining another example of a method for determining a dodging solution.

Fig. 9 is a flow chart summarizing an example of the disclosed method.

Fig. 10 is a flowchart outlining one example of a method that may be performed by an apparatus, such as the apparatus shown in fig. 1A.

FIG. 11 is a block diagram of elements configured to implement one example of an embodiment of a region classifier.

FIG. 12 is a flowchart outlining one example of a method that may be performed by an apparatus, such as apparatus 150 of FIG. 1A.

Fig. 13 is a flowchart outlining another example of a method that may be performed by an apparatus, such as apparatus 150 of fig. 1A.

Fig. 14 is a flowchart outlining another example of a method that may be performed by an apparatus, such as apparatus 150 of fig. 1A.

Fig. 15 and 16 are diagrams illustrating an example set of speaker activation and object rendering positions.

FIG. 17 is a flowchart outlining one example of a method that may be performed by an apparatus or system, such as the apparatus or system shown in FIG. 1A.

Fig. 18 is a diagram of speaker activation in an example embodiment.

FIG. 19 is a diagram of object rendering locations in an example embodiment.

Fig. 20 is a diagram of speaker activation in an example embodiment.

FIG. 21 is a diagram of object rendering locations in an example embodiment.

Fig. 22 is a diagram of speaker activation in an example embodiment.

FIG. 23 is a diagram of object rendering locations in an example embodiment.

Detailed Description

Some embodiments are configured to implement a system that includes coordinated audio devices, also referred to herein as orchestrated audio devices. In some implementations, the orchestrated audio device may comprise a smart audio device. According to some such embodiments, two or more of the smart audio devices may be wake-up word detectors or may be configured to implement wake-up word detectors. Thus, in such examples, multiple microphones (e.g., asynchronous microphones) may be present in the audio environment.

Currently, designers often view audio devices as a single point interface to audio, which may be a mix of entertainment, communications, and information services. The use of audio for notification and voice control has the advantage of avoiding visual or physical intrusion. In all forms of interactive audio, the problem of improving full duplex (input and output) audio capabilities remains a challenge. When there is an audio output in the room that is not related to transmission in the room or capture based on information, it is desirable to remove the audio from the captured signal (e.g., by echo cancellation and/or echo suppression).

Some disclosed embodiments provide a method for managing a listener or "user" experience to improve the key criteria of successful full duplex at one or more audio devices. This criterion is referred to herein as a signal-to-echo ratio (SER), which may be defined as the ratio between a speech signal or other desired signal to be captured in an audio environment (e.g., room) via one or more microphones and an "echo" presented at the audio device, the echo comprising signals from the one or more microphones corresponding to output program content, interactive content, etc., being played back by one or more loudspeakers of the audio environment. Those skilled in the art will appreciate that in this context, an "echo" is not necessarily reflected before being captured by a microphone.

Such an embodiment may be useful where there is more than one audio device within the acoustic range of the user, such that each audio device will be able to present audio program material that is appropriate to sound at the user's location for the desired entertainment, communication, or information service. The value of such an embodiment may be particularly high when there are three or more audio devices that are similarly close to the user. If the audio device is closer to the user, the audio device may be more advantageous in terms of the ability to accurately locate sound or deliver specific audio signals and images to the user. However, if the audio devices include one or more microphones, one or more of the audio devices may also have a microphone system that is more suitable for picking up the user's speech.

An audio device may often need to respond to a user's voice command while the audio device is playing content, in which case the microphone system of the audio device will detect content played back by the audio device: in other words, the audio device will hear its own "echo". Due to the specialized nature of wake-up word detectors, such devices may be able to perform better than more general speech recognition engines in the presence of such echoes. One common mechanism implemented in these audio devices is commonly referred to as "evasion," which involves reducing the playback level of the audio device after a wake word is detected so that the audio device can better recognize the wake word post-command spoken by the user. Such evasion typically results in an improvement in SER, which is a common indicator for predicting speech recognition performance.

In the distributed and orchestrated audio device context, where multiple audio devices are located in a single acoustic space (also referred to herein as an "audio environment"), merely evading playback of a single audio device may not be the best solution. This may be true, in part, because "echoes" (detected audio playback) from other undelivered audio devices in an audio environment may limit the maximum achievable SER obtained by evading playback of only a single audio device.

Thus, some disclosed embodiments may cause audio processing changes of two or more audio devices of an audio environment in order to increase SER at one or more microphones of the audio environment. In some examples, the audio processing variation(s) may be determined from the results of the optimization process. According to some examples, the optimization process may involve a tradeoff between objective sound capture performance goals and constraints that preserve one or more aspects of the user's listening experience. In some examples, the constraint may be a perceptual constraint, an objective constraint, or a combination thereof. Some disclosed examples relate to implementing a model describing perceived effects of echo management signal chains (which may also be referred to herein as "capture stacks"), acoustic space, and audio processing variation(s), and explicitly weighing the model (e.g., seeking a solution that takes into account all such factors).

According to some examples, the process may involve a closed loop system in which, for example, the acoustic and capture stack model is updated after each audio processing change (e.g., each change in one or more rendering parameters). Some such examples may involve iteratively improving the performance of the audio system over time.

Some disclosed embodiments may be based at least in part on one or more of the following factors, or a combination thereof:

A model of the acoustic environment (in other words, the acoustics of the audio environment) and the echo management signal chain, which model can predict the realized SER for a given configuration and output solution;

constraints that limit the output solution according to both objective and subjective metrics, which may include:

Content energy retention;

the retention of the space;

the energy vector; and/or

Regularization, such as:

■ Level 1 regularization (L1), level 2 regularization (L2), or regularization (LN) distortion of a two-or three-dimensional array of loudspeaker activations, which array may be referred to herein as a "waffle"; and/or

■ Regularization of L1, L2, or LN of the evasion gain;

listening to the target, which can determine both spatial and level components in the target solution.

According to some examples, the avoidance solution may be based at least in part on one or more of the following factors, or a combination thereof:

simple gains that may be applied to audio content that has been rendered in some examples;

these gains may be full band or frequency dependent, depending on the particular implementation;

input to the renderer, which in some examples may use a combination of the input and the waffle; and/or

Input to a device, module, etc. that generates the waffle, which may be referred to herein as a waffle maker and in some instances may be a component of a renderer. In some examples, the waffle maker can use such inputs to generate a dodged waffle.

In some implementations, such input by the waffle maker and/or renderer can be used to alter the audio playback so that the audio objects may appear to be "pushed" away from the location where the wake-up word was detected. Some such implementations may involve determining the relative activation of a set of loudspeakers in an audio environment by optimizing a cost as a function of: (a) A model of perceived spatial location of an audio signal that is played when played back over a set of loudspeakers in an audio environment; (b) A measure of the proximity of the intended perceived spatial position of the audio signal to the position of each loudspeaker of the set of loudspeakers; and (c) one or more additional dynamically configurable functions. Some of these embodiments will be described in detail below.

Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 1A are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 150 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 150 may be or may include one or more components of an audio system. For example, in some implementations, the apparatus 150 may be an audio device, such as a smart audio device. In other examples, apparatus 150 may be a mobile device (e.g., a cellular telephone), a laptop computer, a tablet computer device, a television, or other type of device.

According to some alternative embodiments, the apparatus 150 may be or may include a server. In some such examples, the apparatus 150 may be or may include an encoder. Thus, in some examples, the apparatus 150 may be a device configured for use within an audio environment, such as a home audio environment, while in other examples, the apparatus 150 may be a device configured for use in a "cloud", e.g., a server.

In this example, the apparatus 150 includes an interface system 155 and a control system 160. In some implementations, the interface system 155 can be configured to communicate with one or more other devices in an audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 155 can be configured to exchange control information and associated data with an audio device of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 150.

In some implementations, the interface system 155 can be configured to receive a content stream or to provide a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some examples, the audio data may include spatial data such as channel data and/or spatial metadata. For example, the metadata may be provided by a device that may be referred to herein as an "encoder. In some examples, the content stream may include video data and audio data corresponding to the video data.

The interface system 155 may include one or more network interfaces and/or one or more external device interfaces, such as one or more Universal Serial Bus (USB) interfaces. According to some embodiments, the interface system 155 may include one or more wireless interfaces. The interface system 155 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 155 may include one or more interfaces between control system 160 and a memory system, such as optional memory system 165 shown in fig. 1A. However, in some examples, control system 160 may include a memory system. In some implementations, the interface system 155 can be configured to receive input from one or more microphones in an environment.

For example, control system 160 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.

In some implementations, the control system 160 may reside in more than one device. For example, in some implementations, a portion of the control system 160 may reside in a device within one of the environments depicted herein, and another portion of the control system 160 may reside in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet), or the like. In other examples, a portion of control system 160 may reside in a device within one of the environments depicted herein, and another portion of control system 160 may reside in one or more other devices of the environments. For example, the functionality of the control system may be distributed across multiple intelligent audio devices of the environment, or may be shared by orchestration devices (as may be referred to herein as devices of the intelligent home hub) and one or more other devices of the environment. In other examples, a portion of control system 160 may reside in a device (e.g., a server) implementing a cloud-based service, and another portion of control system 160 may reside in another device (e.g., another server, a memory device, etc.) implementing a cloud-based service. In some examples, the interface system 155 may also reside in more than one device.

In some embodiments, the control system 160 may be configured to at least partially perform the methods disclosed herein. According to some examples, control system 160 may be configured to determine and cause audio processing changes of two or more audio devices of an audio environment in order to increase SER at one or more microphones of the audio environment. In some examples, the audio processing variation(s) may be based at least in part on the audio device location information and the echo management system information. According to some examples, the audio processing variation(s) may be responsive to a microphone output signal corresponding to a current utterance of a person (e.g., an utterance of a wake-up word). In some examples, the audio processing variation(s) may be determined from the results of the optimization process. According to some examples, the optimization process may involve a tradeoff between objective sound capture performance goals and constraints that preserve one or more aspects of the user's listening experience. In some examples, the constraint may be a perceptual constraint, an objective constraint, or a combination thereof.

Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include those memory devices as described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. One or more non-transitory media may reside, for example, in the optional memory system 165 and/or the control system 160 shown in fig. 1A. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for controlling at least one device to perform some or all of the methods disclosed herein. For example, the software may be executed by one or more components of a control system, such as control system 160 of FIG. 1A.

In some examples, the apparatus 150 may include an optional microphone system 170 shown in fig. 1A. The optional microphone system 170 may include one or more microphones. According to some examples, optional microphone system 170 may include a microphone array. In some examples, the microphone array may be configured to determine direction of arrival (DOA) and/or time of arrival (TOA) information, e.g., according to instructions from the control system 160. In some examples, the microphone array may be configured for receive side beamforming, e.g., according to instructions from the control system 160. In some implementations, one or more microphones may be part of or associated with another device (e.g., a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 150 may not include the microphone system 170. However, in some such embodiments, the apparatus 150 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 160. In some such embodiments, a cloud-based embodiment of the apparatus 150 may be configured to receive microphone data or data corresponding to microphone data from one or more microphones in an audio environment via the interface system 160.

According to some embodiments, the apparatus 150 may include an optional loudspeaker system 175 shown in fig. 1A. The optional microphone system 175 may include one or more microphones, which may also be referred to herein as "speakers" or more generally as "audio reproduction transducers. In some examples (e.g., cloud-based implementations), the apparatus 150 may not include the loudspeaker system 175.

In some embodiments, the apparatus 150 may include an optional sensor system 180 shown in fig. 1A. The optional sensor system 180 may include one or more touch sensors, gesture sensors, motion detectors, and the like. According to some embodiments, the optional sensor system 180 may include one or more cameras. In some implementations, the camera may be a standalone camera. In some examples, one or more cameras of optional sensor system 180 may reside in a smart audio device, which in some examples may be configured to at least partially implement a virtual assistant. In some such examples, one or more cameras of the optional sensor system 180 may reside in a television, mobile phone, or smart speaker. In some examples, the apparatus 150 may not include the sensor system 180. However, in some such embodiments, the apparatus 150 may still be configured to receive sensor data for one or more sensors in the audio environment via the interface system 160.

In some implementations, the apparatus 150 may include an optional display system 185 shown in fig. 1A. The optional display system 185 may include one or more displays, such as one or more Light Emitting Diode (LED) displays. In some examples, the optional display system 185 may include one or more Organic Light Emitting Diode (OLED) displays. In some examples, the optional display system 185 may include one or more displays of a smart audio device. In other examples, the optional display system 185 may include a television display, a laptop computer display, a mobile device display, or another type of display. In some examples where the apparatus 150 includes the display system 185, the sensor system 180 may include a touch sensor system and/or a gesture sensor system proximate to one or more displays of the display system 185. According to some such embodiments, the control system 160 may be configured to control the display system 185 to present one or more Graphical User Interfaces (GUIs).

According to some such examples, apparatus 150 may be or may include a smart audio device. In some such embodiments, the apparatus 150 may be or may include a wake word detector. For example, the apparatus 150 may be or may include a virtual assistant.

Fig. 1B illustrates an example of an audio environment. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 1B are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements, different arrangements of elements, and so forth.

According to this example, audio environment 100 includes audio devices 110A, 110B, 110C, 110D, and 110E. In some examples, audio devices 110A-110E may be examples of apparatus 150 of FIG. 1A. In this example, audio devices 110A-110E each include at least a respective one of microphones 120A, 120B, 120C, 120D, and 120E, and at least a respective one of microphones 121A, 121B, 121C, 121D, and 121E. In this example, various examples of microphones 120A-120E and microphones 121A-121E are shown. However, one or more of the audio devices 110A-110E may include a microphone system including a plurality of microphones and/or a loudspeaker system including a plurality of loudspeakers. According to some examples, audio devices 110A-110E may each be a smart audio device, such as a smart speaker.

In some examples, some or all of audio devices 110A-110E may be orchestrated audio devices that operate (at least in part) according to instructions from an orchestration device. According to some such examples, the orchestration device may be one of the audio devices 110A-110E. In other examples, the orchestration device may be another device, such as a smart home hub.

In this example, people 101A and 101B are in an audio environment. In this example, the acoustic event is caused by the speaking person 101A speaking in the vicinity of the audio device 110A. The element 102 is intended to represent speech of the person 101A. In this example, speech 102 corresponds to person 101A speaking a wake word.

Fig. 2 shows echo paths between three of the audio devices of fig. 1B. Elements of fig. 2 that are not described with reference to fig. 1B are as follows:

200AA: an echo path from device 110A to device 110A (from loudspeaker 121A to microphone 120A);

200AB: an echo path from device 110A to device 110B (from loudspeaker 121A to microphone 120B);

200AC: an echo path from device 110A to device 110C (from loudspeaker 121A to microphone 120C);

200BA: an echo path from device 110B to device 110A (from loudspeaker 121B to microphone 120A);

200BB: an echo path from device 110B to device 110B (from loudspeaker 121B to microphone 120B);

200BC: an echo path from device 110B to device 110C (from loudspeaker 121B to microphone 120C);

200CA: an echo path from device 110C to device 110A (from loudspeaker 121C to microphone 120A);

200CB: an echo path from device 110C to device 110C (from loudspeaker 121C to microphone 120B); and

200CC: an echo path from device 110C to device 110C (from loudspeaker 121C to microphone 120C).

These echo paths indicate the effect of the playback audio or "echo" of each audio device on the other audio devices. This effect may be referred to herein as the "mutual audibility" of the audio device. The mutual audibility will depend on various factors including the location and orientation of each audio device in the audio environment, the playback level of each audio device, the loudspeaker capabilities of each audio device, etc. Some embodiments may involve constructing a more accurate representation of the mutual audibility of audio devices in an audio environment, such as audibility matrix a representing the energy of echo paths 200AA-200 CC. For example, each column of the audibility matrix may represent an audio device loudspeaker and each row of the audibility matrix may represent an audio device microphone, or vice versa. In some such audibility matrices, the diagonal of the audibility matrix may represent the echo path from the loudspeaker(s) of an audio device to the microphone(s) of the same audio device.

As can be seen from fig. 2, if 200CA and 200BA, which are echo paths from audio devices 110C and 110B to audio device 100A, respectively, have strong coupling, the echoes from audio device 110C and audio device 110B will be significant in the echoes (residuals) of audio device 110A. In such an instance, if the audio system's response to detecting speech 102 (which in this example corresponds to a wake word) is simply turning the nearest loudspeaker down, this would involve evading playback from only loudspeaker(s) 121A. If this is the only response of the audio system to the detection of a wake-up word, the potential SER improvement may be significantly limited. Thus, some disclosed examples may involve other responses to detecting wake words in such cases.

Fig. 3 is a system block diagram representing components of an audio device according to one example. In fig. 3, the block representing the audio device 110A includes a loudspeaker 121A and a microphone 120A. In some examples, loudspeaker 121A may be one of a plurality of loudspeakers in a loudspeaker system (e.g., loudspeaker system 175 of fig. 1A). Similarly, according to some embodiments, microphone 120A may be one of a plurality of microphones in a microphone system (e.g., microphone system 170 of fig. 1A).

In this example, the audio device 110A includes a renderer 201A, an Echo Management System (EMS) 203A, and a speech processor/communication block 240A. In this example, the EMS203A may be or may include an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES. According to this example, the renderer 201A is configured to render audio data 301 received by the audio device 110A or stored on the audio device 110A for reproduction on the loudspeaker 121A. In some examples, the audio data may include one or more audio signals and associated spatial data. The spatial data may for example indicate an expected perceived spatial position corresponding to the audio signal. In some examples, the spatial data may be or may include spatial metadata corresponding to the audio object. In this example, the renderer output 220A is provided to the loudspeaker 121A for playback, and the renderer output 220A is also provided to the EMS203A as a reference for echo cancellation.

In addition to receiving the renderer output 220A, in this example, the EMS203A also receives a microphone signal 223A from the microphone 120A. In this example, the EMS203A processes the microphone signal 223A and provides an echo cancellation residual 224A (which may also be referred to herein as "residual output 224A") to the speech processor/communication block 240A.

In some implementations, the speech processor/communication block 240A may be configured for speech recognition functionality. In some examples, the speech processor/communication block 240A may be configured to provide telecommunication services, such as telephone calls, video conferences, and the like. Although not shown in fig. 3, the speech processor/communication block 240A may be configured to communicate with one or more networks, the loudspeaker 121A, and/or the microphone 120A, for example, via an interface system. For example, the one or more networks may include a local Wi-Fi network, one or more types of telephone networks, and so forth.

Figure 4 illustrates elements of a evasion module according to one example. In this embodiment, the evasion module 400 is implemented by an example of the control system 160 of fig. 1A. In this example, the elements of fig. 4 are as follows:

401: acoustic models of inter-device and intra-device echoes, which in some examples include acoustic models of user utterances. According to some examples, acoustic model 401 may be or may include a model of how playback from each audio device in an audio environment presents itself as an echo detected by the microphone of each audio device (itself and other devices) in the audio environment. In some examples, the acoustic model 401 may be based at least in part on the audio environment impulse response estimate, characteristics of the impulse response, such as peak amplitude, decay time, and the like. In some examples, the acoustic model 401 may be based at least in part on the audibility estimate. In some such examples, the audibility estimation may be based on microphone measurements. Alternatively or additionally, audibility estimates may be inferred by audio device location (e.g., based on echo power being inversely proportional to distance between audio devices). In some examples, the acoustic model 401 may be based at least in part on a long-term estimate of AEC/AES filter taps. In some examples, acoustic model 401 may be based at least in part on a waffle, which may contain information about the capabilities (loudness) of loudspeaker(s) of each audio device;

452: spatial information including information about a location of each of the plurality of audio devices in the audio environment. In some examples, the spatial information 452 may include information regarding an orientation of each of the plurality of audio devices. According to some examples, spatial information 452 may include information regarding the location of one or more persons in an audio environment. In some examples, the spatial information 452 may include information about an impulse response of at least a portion of the audio environment;

402: a model of EMS performance that may indicate the performance of EMS203A of fig. 3 or the performance of another EMS (e.g., the EMS of fig. 5 or 6). In this example, EMS performance model 402 predicts how well EMS (AEC, AES, or both) will perform. In some examples, EMS performance model 402 may predict how well EMS will perform given the current audio environment impulse response, the current noise level(s) of the audio environment, the type of algorithm(s) used to implement the EMS, the type of content played back in the audio environment, the number of echo references fed into the EMS algorithm(s), the capacity/quality of the microphone in the audio environment (nonlinearities in the microphone will set an upper limit on expected performance), or a combination thereof. According to some examples, EMS performance model 402 may be based at least in part on empirical observations, such as by observing how EMS behaves under various conditions, storing data points based on such observations, and constructing a model based on these data points (e.g., by fitting a curve). In some examples, EMS performance model 402 may be based at least in part on machine learning, such as by training a neural network based on empirical observations of EMS performance. Alternatively or additionally, EMS performance model 402 may be based at least in part on theoretical analysis of algorithm(s) used by the EMS. In some examples, EMS performance model 402 may indicate ERLE (echo return loss enhancement) caused by operation of the EMS, which is a useful indicator for evaluating EMS performance. ERLE may, for example, indicate the amount of additional signal loss imposed by the EMS between each audio device;

403: information about one or more current listening targets. In some examples, the listening object information 403 may set an object to be achieved by the avoidance module, such as a SER object or SER improvement object. According to some examples, the listening object information 403 may include both spatial and level components;

450: target related factors that may be used to determine a target, such as external triggers, acoustic events, mode indications, etc.;

404: one or more constraints to be applied during the process of determining a avoidance solution, such as constraints that trade off improvements in listening performance (in other words, improvements in the ability of one or more microphones to capture audio (e.g., human speech)) with other metrics (e.g., degradation of a human listening experience in an audio environment). For example, in one example, the constraint may prevent the evasion module 400 from reducing the loudspeaker reproduction level of some or all of the loudspeakers in the audio environment to an unacceptably low level, such as a 0 full decibel scale (dBFS);

451: the metadata regarding the current audio content may include spatial metadata, level metadata, content type metadata, and the like. Such metadata may provide information (directly or indirectly) about what would have had the one or more microphones evading a listening experience of a person in an audio environment. For example, if the spatial metadata indicates that multiple loudspeakers of an audio environment are reproducing "large" audio objects, evading one of the loudspeakers may not adversely affect the listening experience. As another example, if the content metadata indicates that the content is a podcast, in some instances, the monolog or conversation of the podcast may be played back by multiple loudspeakers of the audio environment, thus evading one of the loudspeakers may not adversely affect the listening experience. However, if the content metadata indicates that the audio content corresponds to a movie or television program, the dialog of such content may be played back primarily or entirely by a particular loudspeaker (e.g., a "front" loudspeaker), and thus rendering these loudspeakers evasive may have an adverse effect on the listening experience;

405: a model for deriving perceptual driving constraints. Some detailed examples are set forth elsewhere herein;

406: optimization algorithms, which may vary depending on the particular implementation. In some examples, the optimization algorithm 406 may be or may include a closed-form optimization algorithm. In some examples, the optimization algorithm 406 may be or may include an iterative process. Some detailed examples are set forth elsewhere herein; and

480: The dodging output by the dodging module 400. As described in more detail elsewhere herein (e.g., with reference to fig. 5 and 6), the avoidance solution 480 may vary depending on various factors including whether the avoidance solution 480 is provided to the renderer, or whether the avoidance solution 480 is provided for audio data output from the renderer.

According to some examples, AEC model 402 and acoustic model 401 may provide means for estimating or predicting SER, e.g., as described below.

Constraint(s) 404 and perceptual model 405 may be used to ensure that the dodging solution 480 output by the dodging module 400 is not degenerate or worthless. An example of a worthless solution is to set the playback level globally to 0. Constraint(s) 404 may be perceptual and/or objective. According to some examples, constraint(s) 404 may be based at least in part on a perceptual model, such as a model of human hearing. In some examples, constraint(s) 404 may be based at least in part on audio content energy preservation, audio spatial preservation, audio energy vectors, or one or more combinations thereof. According to some examples, constraint(s) 404 may be or may include regularization constraints. Listening to the target information 403 may, for example, determine a current target SER improvement to be achieved by distributed evasion (in other words, evading two or more audio devices of the audio environment).

Global optimization

In some examples, selecting the audio device(s) for evasion may cause the estimated SER and/or wake word information obtained when the wake word is detected to be used to select the audio device that will listen to the next utterance. If the audio device selection is wrong, it is unlikely that the best listening device will understand the commands spoken after the wake-up word. This is because Automatic Speech Recognition (ASR) is more difficult than Wake Word Detection (WWD), which is one of the motivating factors for evasion. If the best listening device is not dodged, ASR is likely to fail on all audio devices. Thus, in some such examples, the evasion method involves using a priori estimates (from the WWD) to optimize the ASR phase by evading the nearest (or best estimated) audio device(s).

Thus, some evasion embodiments involve using a priori estimates in determining the evasion solution. However, in embodiments such as the embodiment shown in FIG. 4, listening goals and constraints may be applied to achieve more robust ASR performance. In some such examples, the evasion method may involve configuring the evasion algorithm such that the SER improvement is significant and located at all potential user locations in the acoustic space. In this way, it can be ensured that at least one of the microphones in the room has sufficient SER to achieve robust ASR performance. Such an embodiment may be advantageous if the speaker's location is unknown, or if there is uncertainty about the speaker's location. Some such examples may involve taking into account one or more uncertainties by spatially widening the SER improvement zone, thereby taking into account variations in speaker and/or audio device position estimates.

Some such examples may involve the use of delta parameters or similar parameters discussed below. Other examples may involve a multi-parameter model describing or corresponding to uncertainty of speaker location and/or audio device location estimates.

In some embodiments, the evasion method may be performed in the context of one or more user areas. As will be detailed later in this document, a set of acoustic featuresCan be used to estimate the posterior probability p (Z _k |w (j)) of a certain set of zone labels Z _k (for k= { 1..k }) for K different user zones in the environment. The association of each audio device with each user zone may be provided by the user himself, as part of the training process described in this document, or may alternatively be provided by way of an application (e.g., an Alexa smartphone application or a Sonos S2 controller smartphone application). For example, some embodiments may represent the association of the jth device with the user zone with zone label Z _k as Z (Z _k, n) ∈ [0,1]. In some embodiments, both Z (Z _k, n) and a posterior probability p (Z _k |w (j)) may be considered context information. Conversely, some embodiments may treat the acoustic feature W (j) itself as part of the context. In other embodiments, more than one of these quantities (Z _k, n), a posterior probability p (Z _k |w (j), and the acoustic feature W (j) itself) and/or a combination of these quantities may be part of the context information.

In some examples, the evasion method may use an amount associated with one or more user zones in selecting an audio device for evasion or other audio processing changes. In the case where both z and p are available, an example audio device selection decision may be made according to the following expression:

According to some such embodiments, the audio devices with the highest associations with the user zones most likely to contain users (speakers) will have the most audio processing (e.g., rendering) changes applied to them. In some examples, δ may be a positive number in the range of [0.5,4.0 ]. According to some such examples, δ may be used to spatially control the range of rendering variations. In such an embodiment, if δ is selected to be 0.5, more devices will receive a larger rendering change, while a value of 4.0 will limit rendering change only to devices closest to the most likely user zone.

In some embodiments, the acoustic feature W (j) may be used directly in the evasion method. For example, if the wake word confidence score associated with utterance j is w _n (j), then the audio device selection may be made according to the following expression:

In the preceding expressions, δ has the same interpretation as the preceding example, and further has the utility of compensating for a typical distribution of wake word confidence that a particular wake word system may appear. If most audio devices tend to report high wake-up word confidence, δ may be selected as a relatively high number, such as 3.0, to increase the spatial specificity of rendering variant applications. If the wake word confidence drops off rapidly as the speaker is located farther from the device, δ may be selected to be a relatively low number, such as 1.0 or even 0.5, in order to include more devices in the rendering change application. The reader will appreciate that in some alternative embodiments, formulas similar to those above for acoustic features (such as an estimate of speech level at the microphone of the device and/or direct-to-reverberation ratio of the user utterance) may be substituted for wake-up word confidence.

Fig. 5 is a block diagram illustrating an example of an audio device including a evasion module. The functions of the renderer 201A, the speech processor/communication block 240A, EMS a, the loudspeaker(s) 121A, and the microphone(s) 120A may be substantially as described with reference to fig. 3, except as described below. In this example, the renderer 201A, the speech processor/communication block 240A, EMS a, and the evasion module 400 are implemented by an example of the control system 160 described with reference to fig. 1A. For example, the avoidance module 400 may be an example of the avoidance module 400 described with reference to fig. 4. Accordingly, the evasion module 400 may be configured to determine one or more types of audio processing variations (indicated by or corresponding to the evasion 480) to apply to at least the rendered audio data of the audio device 110A (e.g., audio data that has been rendered as a loudspeaker feed). The audio processing variation may be or may include a reduction in loudspeaker reproduction level of one or more loudspeakers in the audio environment. In some examples, the evasion module 400 may be configured to determine one or more types of audio processing changes to apply to rendered audio data of two or more audio devices in an audio environment.

In this example, the renderer output 220A and the dodge solution 480 are provided to a gain multiplier 501. In some examples, the dodge solution 480 includes a gain for the gain multiplier 501 to apply to the renderer output 220A to produce the processed audio data 502. According to this example, the processed audio data 502 is provided to the EMS203A as a local reference for echo cancellation. Here, the processed audio data 502 is also provided to the loudspeaker(s) 121A for reproduction.

In some examples, the avoidance module 400 may be configured to determine a avoidance solution 480, as described below with reference to fig. 8. According to some examples, the avoidance module 400 may be configured to determine a avoidance solution 480, as described below in the "optimization for a particular device" section.

Fig. 6 is a block diagram illustrating an alternative example of an audio device including a evasion module. The functions of the renderer 201A, the speech processor/communication block 240A, EMS a, the loudspeaker(s) 121A, and the microphone(s) 120A may be substantially as described with reference to fig. 3, except as described below. For example, the avoidance module 400 may be an example of the avoidance module 400 described with reference to fig. 4. In this example, the renderer 201A, the speech processor/communication block 240A, EMS a, and the evasion module 400 are implemented by an example of the control system 160 described with reference to fig. 1A.

According to this example, the evasion module 400 is configured to provide an evasion solution 480 to the renderer 201A. In some such examples, the dodging 480 may cause the renderer 201A to implement one or more types of audio processing changes during the process of rendering the received audio data 301 (or during the process of rendering audio data already stored in the memory of the audio device 110A), which may include a reduction in loudspeaker reproduction level. In some examples, the evasion module 400 may be configured to determine an evasion solution 480 for implementing one or more types of audio processing changes via one or more instances of the renderer 201A in one or more other audio devices in the audio environment.

In this example, the renderer output 201A outputs processed audio data 502. According to this example, the processed audio data 502 is provided to the EMS203A as a local reference for echo cancellation. Here, the processed audio data 502 is also provided to the loudspeaker(s) 121A for reproduction.

In some examples, the avoidance solution 480 may include one or more penalties implemented by a flexible rendering algorithm, e.g., as described below. In some such examples, the penalty may be a loudspeaker penalty estimated to cause the desired SER improvement. According to some examples, determining the one or more types of audio processing variations may be based on optimization of a cost function by the evasion module 400 or the renderer 201A.

Fig. 7 is a flowchart outlining one example of a method for determining a dodging solution. In some examples, method 720 may be performed by an apparatus such as the apparatus shown in fig. 1A, 5, or 6. In some examples, method 720 may be performed by a control system of an orchestration device, which in some examples may be an audio device. In some examples, the method 720 may be performed at least in part by a evasion module, such as the evasion module 400 of fig. 4, 5, or 6. According to some examples, method 720 may be performed at least in part by a renderer. As with other methods described herein, the blocks of method 720 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

In this example, the process begins at block 725. In some examples, block 725 may correspond to a boot process, or a time when the boot process has completed and the device configured to perform method 720 is ready to work.

According to this example, block 730 involves waiting for a wake word to be detected. If the method 720 is being performed by an audio device, block 730 may also involve playing back rendered audio data corresponding to received or stored audio content (e.g., audio sound of a music content, podcast, movie or television program, etc.).

In this example, upon detection of a wake word (e.g., in block 730), the SER of the wake word is estimated in block 735. S (a) is an estimate of the speech-to-echo ratio at device a. By definition, the speech-to-echo ratio in dB is given by:

in the foregoing expression, in which the expression, Represents an estimate of speech energy in dB, and/>Representing an estimate of the residual echo energy in dB after echo cancellation. Various methods for estimating these quantities are disclosed herein, such as:

(1) The speech energy and residual echo energy may be estimated by an off-line measurement process performed for a particular device taking into account the acoustic coupling between the microphone and speaker of the device and the performance of the on-board echo cancellation circuit. In some such examples, the average speech energy level "AVGSPEECH" may be determined by an average level of human speech measured by the device at a nominal distance. For example, speech from a minority standing 1m away from the device may be recorded during production by a microphone equipped device, and the energy may be averaged to produce AVGSPEECH. According to some such examples, the average residual echo energy level "AvgEcho" may be estimated by playing the music content from the device during production and running an on-board echo cancellation circuit to produce an echo residual signal. Energy averaging of the echo residual signal for small samples of the music content may be used for estimation AvgEcho. AvgEcho may alternatively be set to a nominal low value, such as-96.0 dB, when the device is not playing audio. In some such embodiments, the speech energy and residual echo energy may be represented as follows:

(2) According to some examples, the average speech energy may be determined by obtaining energy of a microphone signal corresponding to the user utterance as determined by a Voice Activity Detector (VAD). In some such examples, when the VAD indicates no speech, the average residual echo energy may be estimated by the energy of the microphone signal. If x represents samples of device a's microphone Pulse Code Modulation (PCM) at a sample rate and V represents a VAD flag (which takes a value of 1.0 for samples corresponding to voice activity, otherwise takes a value of 0.0), then the speech energy and residual echo energy may be represented as follows:

/>

(3) Further to the previous approach, in some embodiments, the energy in the microphone may be considered a random variable and may be modeled separately based on the VAD determination. Any number of statistical modeling techniques may be used to estimate the statistical models Sp and E of speech energy and echo energy, respectively. Then, an average (in dB) of both speech and echo of approximately S (a) can be derived from Sp and E, respectively. Common methods of achieving this are found in the statistical signal processing arts, for example:

Assuming the energy is a gaussian distribution and calculating the second order statistic of the bias And/>

Creating a discrete binned histogram of energy values to produce a potential multi-modal distribution, after the step of applying a desired maximization (EM) parameter estimate to a hybrid model (e.g., a gaussian hybrid model), the maximum average value belonging to any sub-distribution in the hybrid can be used

According to this example, block 740 involves obtaining a target SER (from block 745 in this example) and calculating a target SER improvement. In some embodiments, the desired SER improvement (SERI) may be determined as follows:

SERI＝S(m)-TargetSER[dB]

In the foregoing expression, m represents the device/microphone location where SER is being improved, and TARGETSER represents a threshold, which in some examples may be set according to the application in use. For example, the wake word detection algorithm may tolerate lower operations SER than the command detection algorithm, and the command detection algorithm may tolerate lower operations SER than the large vocabulary speech recognizer. Typical values for TARGETSER may be on the order of-6 dB to 12 dB. If S (m) is unknown or not easily estimated in some instances, a preset value may be sufficient based on offline measurements of speech and echo recorded in a typical echo room or setup. Some embodiments may determine an audio device to modify audio processing (e.g., rendering) by specifying f_n in the range of 0 to 1. Other embodiments may involve specifying the extent to which the audio processing (e.g., rendering) should be modified (in terms of speech-to-echo ratio improvement decibels) s _n (also denoted herein as s_n), which may be calculated according to the following equation:

s_n＝SERI*f_n

some embodiments may calculate f_n directly from the device geometry, e.g., as follows:

In the foregoing expression, m represents an index of a device to be selected for maximum audio processing (e.g., rendering) modification, and H (m, i) represents an approximate physical distance between the devices m and i. Other embodiments may involve other choices of the mitigation or smoothing function of the device geometry.

Thus, H is a property of the physical location of the audio device in the audio environment. According to particular embodiments, H may be determined or estimated according to various methods. Various examples of methods for estimating the location of an audio device in an audio environment are described below.

In this example, block 750 involves calculating what may be referred to herein as a "dodging solution". While the dodging solution may involve determining a reduction in loudspeaker reproduction level of one or more loudspeakers in an audio environment, the dodging solution may also involve one or more other audio processing variations, such as the audio processing variations disclosed herein. The avoidance solution determined in block 750 is one example of the avoidance solution 480 of fig. 4, 5, and 6. Accordingly, block 750 may be performed by the evasion module 400.

According to this example, the evasion solution determined in block 750 is based (at least in part) on the target SER, the evasion constraint (represented by block 755), the AEC model (represented by block 765), and the acoustic model (represented by block 760). The acoustic model may be an example of the acoustic model 401 of fig. 4. For example, the acoustic model may be based at least in part on inter-device audibility, which may also be referred to herein as inter-device echo or mutual audibility. In some examples, the acoustic model may be based at least in part on intra-device echoes. In some examples, the acoustic model may be based at least in part on an acoustic model of the user utterance, such as acoustic properties of a typical human utterance, acoustic properties of a human utterance previously detected in an audio environment, and so forth. In some examples, the AEC model may be an instance of AEC model 402 of fig. 4. The AEC model may indicate the performance of EMS203A of fig. 5 or 6. In some examples, EMS performance model 402 may indicate the actual or expected ERLE (echo return loss enhancement) caused by the operation of the AEC. ERLE may, for example, indicate the amount of additional signal loss imposed by the AEC between each audio device. According to some examples, EMS performance model 402 may be based at least in part on an expected ERLE for a given number of echo references. In some examples, EMS performance model 402 may be based at least in part on an estimated ERLE calculated from the actual microphone and residual signals.

In some examples, the avoidance solution determined in block 750 may be an iterative solution, while in other examples, the avoidance solution may be a closed-form solution. Examples of both iterative and closed-form solutions are disclosed herein.

In this example, block 770 involves applying the evasion solution determined in block 750. In some examples, as shown in fig. 5, the evasion solution may be applied to the rendered audio data. In other examples, as shown in fig. 6, the evasion solution determined in block 750 may be provided to a renderer. The dodging solution may be applied as part of a process of rendering audio data input to a renderer.

In this example, block 775 involves detecting another utterance, which in some examples may be a command spoken after the wake word. According to this example, block 780 involves estimating the SER of the utterance detected in block 775. In this example, block 785 involves updating the AEC model and the acoustic model based at least in part on the SER estimated in block 780. According to this example, the process of block 785 is completed after the application of the dodging solution. In a perfect system, the actual SER improvement and the actual SER would be just the targets. In real world systems, the actual SER improvement and the actual SER may be different from the target. In such embodiments, method 720 involves updating information and/or models for computing a dodging solution using at least the SER. According to this example, the evasion solution is based at least in part on the acoustic model of block 760. For example, the acoustic model of block 760 may indicate a very strong acoustic coupling between the audio device X and the microphone Y, and thus the evasion solution may involve largely evading the signal from the microphone Y. However, after estimating the SER of the utterance detected in block 775 while applying the evasion solution, the control system may have determined that the actual SER and/or SERI is not intended. If so, block 785 may involve updating the acoustic model accordingly (in this example, by reducing the acoustic coupling estimate between audio device X and microphone Y). According to this example, the process then returns to block 730.

Fig. 8 is a flowchart outlining another example of a method for determining a dodging solution. In some examples, method 800 may be performed by an apparatus such as the apparatus shown in fig. 1A, 5, or 6. In some examples, method 800 may be performed by a control system of an orchestration device, which in some examples may be an audio device. In some examples, the method 800 may be performed at least in part by a evasion module, such as the evasion module 400 of fig. 4,5, or 6. According to some examples, method 800 may be performed at least in part by a renderer. As with other methods described herein, the blocks of method 800 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

In this example, the process begins at block 805. In some examples, block 805 may correspond to a boot process, or a time when the boot process has been completed and the device configured to perform method 800 is ready to operate.

According to this example, block 810 involves estimating a current echo level without applying a dodging solution. In this example, block 810 involves estimating a current echo level based (at least in part) on an acoustic model (represented by block 815) and an AEC model (represented by block 820). Block 810 may involve estimating a current echo level to be generated by a current evasion candidate solution. In some examples, the estimated current echo level may be combined with the current speech level to produce an estimated current SER improvement.

The acoustic model may be an example of the acoustic model 401 of fig. 4. For example, the acoustic model may be based at least in part on inter-device audibility, which may also be referred to herein as inter-device echo or mutual audibility. In some examples, the acoustic model may be based at least in part on intra-device echoes. In some examples, the acoustic model may be based at least in part on an acoustic model of the user utterance, such as acoustic properties of a typical human utterance, acoustic properties of a human utterance previously detected in an audio environment, and so forth.

In some examples, the AEC model may be an instance of AEC model 402 of fig. 4. The AEC model may indicate the performance of EMS203A of fig. 5 or 6. In some examples, EMS performance model 402 may indicate the actual or expected ERLE (echo return loss enhancement) caused by the operation of the AEC. ERLE may, for example, indicate the amount of additional signal loss imposed by the AEC between each audio device. According to some examples, EMS performance model 402 may be based at least in part on an expected ERLE for a given number of echo references. In some examples, EMS performance model 402 may be based at least in part on an estimated ERLE calculated from the actual microphone and residual signals.

According to this example, block 825 involves obtaining a current avoidance solution (represented by block 850) and estimating the SER based on applying the current avoidance solution. In some examples, the avoidance solution may be determined as described with reference to block 750 of fig. 7.

In this example, block 830 involves calculating a difference or "error" between the current estimate of the SER improvement and the target SER improvement (represented by block 835). In some alternative examples, block 830 may involve calculating a difference between the current estimate of SER and the target SER.

In this example, block 840 involves determining whether the difference or "error" calculated in block 830 is sufficiently small. For example, block 840 may involve determining whether the difference calculated in block 830 is equal to or less than a threshold. In some examples, the threshold may be in a range of 0.1dB to 1.0dB, such as 0.1dB, 0.2dB, 0.3dB, 0.4dB, 0.5dB, 0.6dB, 0.7dB, 0.8dB, 0.9dB, or 1.0dB. In such an example, if it is determined in block 840 that the difference calculated in block 830 is equal to or less than the threshold, then the process ends in block 845. The current dodging solution may be output and/or applied.

However, if it is determined in block 840 that the difference calculated in block 830 is not sufficiently small (e.g., not equal to or less than a threshold), then the process continues in this example to block 855. According to this example, the avoidance solution is or includes a avoidance vector. In this example, block 855 involves calculating gradients of the cost function and the constraint function relative to the evasion vector. For example, the cost function may correspond to (or describe) an error between the estimated SER improvement and the target SER improvement, as determined in block 830.

In some implementations, the constraint function may penalize the impact of the evasion vector on one or more objective functions (e.g., audio energy retention functions), one or more subjective functions (e.g., one or more perception-based functions), or a combination thereof. In some such examples, the one or more constraints may be based on a perceptual model of human hearing. According to some examples, the one or more constraints may be based on audio spatial reservations.

In some examples, block 855 may involve optimizing costs as a function of a model of perceived spatial locations of audio signals played when played back over a set of loudspeakers in an environment and a measure of the proximity of the expected perceived spatial locations of the audio signals to the locations of each of the set of loudspeakers. In some such examples, the cost may be a function of one or more additional dynamically configurable functions. In some such examples, at least one of the one or more additional dynamically configurable functions corresponds to echo canceller performance. According to some such examples, at least one of the one or more additional dynamically configurable functions corresponds to a mutual audibility of the loudspeaker in an audio environment. Detailed examples are provided below. However, other embodiments may not involve these types of cost functions.

According to this example, block 865 involves updating the current dodging solution using a gradient and one or more types of optimizers (e.g., an algorithm such as the following, a random gradient descent, or another known optimizer).

In this example, block 870 involves evaluating a change in the avoidance solution from a previous avoidance solution. According to this example, if it is determined in block 870 that the change in the avoidance solution from the previous solution is less than the threshold, the process ends in block 875. According to some examples, the threshold may be expressed in decibels. According to some such examples, the threshold may be in a range of 0.1dB to 1.0dB, such as 0.1dB, 0.2dB, 0.3dB, 0.4dB, 0.5dB, 0.6dB, 0.7dB, 0.8dB, 0.9dB, or 1.0dB. In some examples, if it is determined in block 870 that the change in the avoidance solution from the previous solution is less than or equal to the threshold, the process ends.

However, in this example, if it is determined in block 870 that the change in the avoidance solution from the previous solution is not less than the threshold, the avoidance solution of block 850 is updated to the current avoidance solution and the process returns to block 825. In some examples, method 800 may continue until block 845 or block 875 is reached. According to some examples, if block 845 or block 875 is not reached within a time interval or within multiple iterations, method 800 may terminate.

After the method 800 ends, the resulting dodging solution may be applied. In some examples, as shown in fig. 5, the evasion solution may be applied to the rendered audio data. In other examples, as shown in fig. 6, the avoidance solution determined by method 800 may be provided to a renderer. The dodging solution may be applied as part of a process of rendering audio data input to a renderer. However, as noted elsewhere herein, in some implementations, the method 800 may be performed at least in part by a renderer. According to some such implementations, the renderer may determine and apply the dodging solution.

The following algorithm is an example of obtaining a dodging solution. In some examples, the evasion solution may be, include, or indicate a gain to be applied to the rendered audio data. Thus, in some such examples, the dodging solution may be applied to a dodging module 400, such as the dodging module shown in fig. 5.

The following symbol definitions are defined as indicated below:

A represents a mutual audibility matrix (audibility between each audio device);

p represents a nominal (non-dodged) playback level vector (across the audio device);

d represents a dodging solution vector (across devices), which may correspond to the dodging solution 480 output by the dodging module 400; and

C represents an AEC performance matrix, which in this example indicates ERLE (echo return loss enhancement), i.e. the amount of additional signal loss applied by the AEC between each audio device.

The net echo in the microphone feed of audio device i may be represented as follows:

in equation 1, J represents the number of audio devices in the room, Indicating audibility of audio device j to audio device i and P _j indicates the playback level of audio device j. If the effect of evading any audio device is then considered, the echo in the microphone feed can be expressed as follows:

In equation 2, D _j represents the evasion gain. If a simple model of AEC is also considered, where nominal ERLE is applied to each speaker rendering independently, the echo power in the residual can be expressed as follows:

In equation 3, C _j ⁱ represents the ability of audio device i to cancel the echo of audio device j. In some examples, since the AEC includes or does not include a reference for cancellation, C _j ⁱ will be assumed to be-20 dB or 0dB.

If a particular device is performing cancellation on a particular non-local ("far") device entry in the C matrix, then generating C may be as trivial as setting the entry to the nominal cancellation performance value. More complex models may take into account audio environment adaptive noise (any noise in the adaptive filter process) and cross-channel correlation. For example, if a given evasion solution is applied, such a model can predict how the AEC will perform in the future. Some such models may be based at least in part on echo levels and noise levels of the audio environment.

In this example, the distributed evasion problem is formulated as an optimization problem that involves minimizing the total echo power in the AEC residual by changing the evasion vector, which can be expressed as follows:

Without being constrained, the formula of equation 4 would drive each audio device to a dodge of 0dBFS. Thus, some examples introduce a constraint that balances improvement in SER against negative impact on the listening experience. Some such examples consider the magnitude of loudspeaker rendering without changing covariance. This constrained problem can be expressed as:

In equation 5, λ represents a lagrange multiplier that weights the experience of the listener with the improvement of SER, a ^L represents audibility of each device at the listener's position, and g () represents one possible constraint function. Various constraint functions may be used in determining the dodging solution. Another constraint function may be expressed as follows:

It gives the following simple gradient:

Iterative evasion solution

In one example, the gradient-based iterative solution of the distributed evasion optimization problem may take the form of:

In equation 8, F represents a cost function describing the distributed evasion problem, and D ⁿ represents the evasion vector at the nth iteration. Using this form we are constrained to D ε [0,1]. However, even if regularization terms are added in equation 8, D ε [0,1] cannot be guaranteed without adding some heuristics and/or hard constraints. Another method involves formulating a gradient-based iterative evasion solution as follows:

Dⁿ⁺¹＝DⁿZ (9)

In some examples, Z ε [0,1]. However, if it is desired to improve the SER at audio device i by reducing the echo power in the microphone feed of audio device i while also maintaining the perceived quality of the rendered audio content (at least to some extent), it can be seen that only D+.0 and Z+.0 need be ensured. This means that some audio devices are allowed to increase their full-band volume, at least to some extent, in order to maintain the quality of the rendered content. Thus, in some examples, Z may be defined as follows:

Z＝exp{F+R} (10)

In equation 10, F represents a cost function describing the distributed evasion problem, and R is a regularization term intended to maintain the quality of the rendered audio content. In some examples, R may be an energy retention constraint. Regularization term R may also allow finding a reasonable solution for D.

If T is considered to be a target SER improvement at audio device i obtained by evading the audio device in the audio environment, F can be defined as follows:

/>

In equation 11, E _res,i ⁿ represents the echo in the residual of device i at the nth iteration evaluated using equation 3, and E _res,i ⁰ represents the echo in the residual of device i when D is all times. However, as shown in equation 11, defining F will be such that the step size is a function of T and the error. In some examples, F may be reformulated as follows to remove the correlation with the target SER:

It may be advantageous to adjust each element of the evasion vector in proportion to the sensitivity of E with respect to each element. Furthermore, it may be advantageous to allow the user to control the step size to some extent. With these objectives in mind, in some examples, F may be reformulated as follows:

In equation 13, M scales the individual contribution of each device to the echo in the residual of audio device i. In some examples, M may be represented as follows:

In equation 14, the product of Hadamard is indicated by the term. Scaling by the square root of equation 14 may produce acceptable (or even improved) results, according to some examples. In some examples, regularization term R may be represented as follows:

R＝λ(1-D) (15)

In equation 15, λ represents a lagrange multiplier. Equation 15 allows the control system to find a concise solution for D. However, in some examples, the avoidance solution that maintains an acceptable listening experience and preserves the total energy in the audio environment may be determined by defining the regularization term R as follows:

Evasion solver algorithm

In view of the above discussion, the following evasion solver algorithm can be readily understood. The following algorithm is an example of the method 800 of fig. 8.

Input: a: audibility matrix (amplitude)

C: elimination matrix (amplitude)

P: playback level vector (amplitude)

T: target SER improvement (amplitude)

J: number of audio devices

I: audio device to optimize evasion

And (3) outputting: d: evasion vector (amplitude)

D=1 (vector of all 1)

M′＝Aⁱ⊙Cⁱ⊙P

M＝M′/max(M′)

Not_ converged =true

At not_ converged have

1.

2.

3. If preserve _energy

A. Then:

b. Otherwise: r=λ (1-D)

4.Dⁿ⁺¹＝Dⁿ exp{F+R}

5. If D ⁿ⁺¹-Dⁿ||₂ < THRESH, then

Non_ converged =false

According to some such examples, THRESH (threshold) may be 6dB, 8dB, 10dB, 12dB, 14dB, etc.

Fig. 9 is a flow chart summarizing an example of the disclosed method. In some examples, method 900 may be performed by an apparatus such as the apparatus shown in fig. 1A, 5, or 6. In some examples, method 900 may be performed by a control system of an orchestration device, which in some examples may be an audio device. In some examples, the method 900 may be performed at least in part by a evasion module, such as the evasion module 400 of fig. 4, 5, or 6. According to some examples, method 900 may be performed at least in part by a renderer. As with other methods described herein, the blocks of method 900 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

In this example, block 905 involves receiving, by the control system, output signals from one or more microphones in the audio environment. Here, the output signal includes a signal corresponding to a current utterance of the person. In some examples, the current utterance may be or may include a wake-up word utterance.

According to this example, block 910 involves determining, by a control system, one or more audio processing variations to apply to audio data rendered as loudspeaker feed signals for two or more audio devices in an audio environment in response to the output signals and based at least in part on the audio device location information and the echo management system information. In this example, the audio processing variation includes a reduction in loudspeaker reproduction level of one or more loudspeakers in the audio environment. Thus, the audio processing variation may include or may be indicated by what is referred to herein as a dodging solution. According to some examples, at least one of the one or more types of audio processing variations may correspond to an increased signal-to-echo ratio.

However, in some examples, the audio processing variations may include or involve variations other than a reduction in loudspeaker reproduction level. For example, the audio processing variations may involve shaping the spectrum of the output of one or more loudspeakers, which may also be referred to herein as "spectral modification" or "spectral shaping". Some such examples may involve shaping the spectrum with a substantially linear Equalization (EQ) filter designed to produce an output that is different from the spectrum of the audio that it is desired to detect. In some examples, if the output spectrum is shaped for detection of human speech, the filter may reduce frequencies in the range of approximately 500Hz to 3kHz (e.g., plus or minus 5% or 10% at each end of the frequency range). Some examples may involve shaping the loudness to emphasize low and high frequencies, leaving room in the intermediate frequency band (e.g., in the range of about 500Hz to 3 kHz).

Alternatively or additionally, in some examples, the audio processing variation may involve changing an upper limit or peak of the output to reduce peak levels and/or reduce distortion products that may additionally reduce performance of any echo cancellation as part of the overall system creating the implemented SER for audio detection, e.g., via a time domain dynamic range compressor or a multi-band frequency dependent compressor. Such audio signal modification may effectively reduce the amplitude of the audio signal and may help limit the excursion of the loudspeaker.

Alternatively or additionally, in some examples, the audio processing variation may involve spatially manipulating the audio in a manner that will tend to reduce the energy or coupling of the output of one or more loudspeakers with one or more microphones of the system (e.g., the audio processing manager) that achieve a higher SER. Some such embodiments may involve the "twist" examples described herein.

Alternatively or additionally, in some examples, the audio processing variation may involve preserving energy and/or creating continuity at a particular or broad set of listening positions. In some examples, the energy removed from one loudspeaker may be compensated by providing additional energy in or to another loudspeaker. In some examples, the overall loudness may remain the same or substantially the same. This is not a fundamental feature, but may be an effective means of allowing more stringent changes to the audio processing of the 'nearest' device or nearest group of devices without losing content. However, when dealing with complex audio outputs and audio scenes, the continuity and/or retention of energy may be particularly relevant.

Alternatively or additionally, in some examples, the audio processing variation may involve a time constant of activation. For example, the application speed of the audio processing changes may be slightly faster (e.g., 100-200 ms) than the speed of returning them to a normal state (e.g., 1000-10000 ms), so the change(s) of the audio processing (if perceptible) appear intentional, but the subsequent return from the change(s) may appear irrelevant (from the perspective of the user) to any actual event or change, and in some instances may be slow to almost undetectable.

In this example, block 915 involves applying, by the control system, one or more types of audio processing changes. In some examples (such as the example shown in fig. 5), audio processing variations may be applied to the rendered audio data according to the evasion solution 480 from the evasion module 400. According to some examples (such as the example shown in fig. 5), audio processing variations may be applied by a renderer. In some such examples, the one or more types of audio processing variations may involve changing the rendering process to distort the rendering of the audio signal away from the estimated location of the person. However, in some such examples, such audio processing variations may still be based at least in part on the dodging solution 480 from the dodging module 400.

According to some examples, the echo management system information may include a model of echo management system performance. In some examples, the model of echo management system performance may be or may include an Acoustic Echo Canceller (AEC) performance matrix. In some examples, the model of echo management system performance may be or may include a measure of expected echo return loss enhancement provided by the echo management system.

In some examples, the one or more types of audio processing variations may be based at least in part on acoustic models of inter-device echoes and intra-device echoes. Alternatively or additionally, in some examples, the one or more types of audio processing variations may be based at least in part on a mutual audibility matrix.

Alternatively or additionally, in some examples, the one or more types of audio processing variations may be based at least in part on an estimated location of the person. In some examples, the estimated location may correspond to a point, while in other examples, the estimated location may correspond to an area, such as a user area. According to some such examples, the user zone may be part of an audio environment, such as a sofa area, a table area, a chair area, and so forth. In some examples, the estimated position may correspond to an estimated position of a head of a person. According to some examples, the estimated location of the person may be based at least in part on output signals from a plurality of microphones in the audio environment.

In some implementations, the one or more types of audio processing variations can be based at least in part on the listening object. For example, the listening object may include a spatial component, a frequency component, or both.

According to some examples, the one or more types of audio processing variations may be based at least in part on one or more constraints. The one or more constraints may be based on, for example, a perceptual model, such as a model of human hearing. Alternatively or additionally, the one or more constraints may relate to audio content energy preservation, audio spatial preservation, audio energy vectors, regularization constraints, or a combination thereof.

In some examples, method 900 may involve updating an acoustic model of the audio environment, a model of echo management system performance, or both after applying the one or more types of audio processing changes.

In some examples, determining the one or more types of audio processing variations may be based at least in part on optimization of a cost function. According to some such examples, the cost function may correspond to or be similar to one of the cost functions of equations 10-13. Other examples of audio processing variations based at least in part on the optimization cost function are described in detail below.

Examples of audio device location methods

As described in the description of fig. 9 and elsewhere herein, in some examples, audio processing changes (e.g., changes corresponding to a dodging solution) may be based at least in part on audio device location information. The location of an audio device in an audio environment may be determined or estimated by various methods including, but not limited to, the methods described in the following paragraphs.

Some such methods may involve receiving a direct indication of the user, for example, using a smart phone or tablet to mark or indicate the approximate location of the device on a plan view or similar graphical representation of the environment. Such digital interfaces are commonplace in managing the configuration, grouping, name, use, and identity of smart home devices. For example, such direct indication may be provided by an Amazon Alexa smart phone application, a Sonos S2 controller application, or the like.

Some examples may involve using measured signal strengths (sometimes referred to as received signal strength indications or RSSI) of common wireless communication technologies such as bluetooth, wi-Fi, zigBee, etc. to produce estimates of physical distances between devices to solve basic trilateration issues, for example, as disclosed in j.yang and y.chen, "Indoor Localization Using Improved RSS-Based Lateration Methods [ indoor positioning using improved RSS-based least squares ]", GLOBECOM 2009-2009IEEE Global Telecommunications Conference [ global teleconferencing ], holoulu, HI [ santaloshan, hawaii ],2009, pages 1-6, doi:10.1109/glocom.2009.5425237 and/or node positioning using Received Signal Strength Indicator (RSSI) based trilateration methods in Mardeni,R.&Othman,Shaifull&Nizam,(2010)"Node Positioning in ZigBee Network Using Trilateration Method Based on the Received Signal Strength Indicator(RSSI)[ZigBee networks ] "46, both of which are hereby incorporated by reference.

In U.S. patent No. 10,779,084, entitled "Automatic Discovery and Localization of Speaker Locations in Surround Sound Systems [ automatic discovery and localization of speaker locations in a surround sound system ]," which is hereby incorporated by reference, a system is described that can automatically locate the position of a loudspeaker and microphone in a listening environment by acoustically measuring the time of arrival (TOA) between each speaker and microphone.

International publication No. WO 2021/127286 A1, entitled "Audio Device Auto-Location [ audio device auto-positioning ]" which is hereby incorporated by reference, discloses a method for estimating an audio device position, a listener position, and a listener position in an audio environment. Some disclosed methods involve estimating an audio device location in an environment by direction of arrival (DOA) data and by determining an interior angle for each of a plurality of triangles based on the DOA data. In some examples, each triangle has a vertex corresponding to the audio device location. Some disclosed methods involve determining a side length of each side of each triangle and performing a forward alignment process that aligns each of the plurality of triangles to produce a forward alignment matrix. Some disclosed methods involve determining to perform an inverse alignment process that aligns each of the plurality of triangles in an inverse order to produce an inverse alignment matrix. The final estimate of each audio device location may be based at least in part on the values of the forward alignment matrix and the values of the backward alignment matrix.

Other disclosed methods of international publication No. WO 2021/127286 A1 involve estimating listener position and, in some examples, estimating listener position. Some such methods involve prompting a listener (e.g., via audio prompts from one or more loudspeakers in the environment) to speak one or more utterances and estimating listener position from the DOA data. The DOA data may correspond to microphone data obtained by a plurality of microphones in the environment. The microphone data may correspond to detection of one or more utterances by the microphone. At least some of the microphones may be co-located with a loudspeaker. According to some examples, estimating listener position may involve a triangulation process. Some such examples involve triangulating the user's voice by finding an intersection between the DOA vectors through the audio device. Some disclosed methods of determining listener orientation involve prompting a user to identify one or more loudspeaker positions. Some such examples involve prompting a user to identify one or more loudspeaker locations by moving aside to the loudspeaker location(s) and speaking a speech. Other examples relate to prompting a user to identify one or more loudspeaker locations by pointing at each of the one or more loudspeaker locations with a handheld device (e.g., a cellular telephone) that includes an inertial sensor system and a wireless interface configured for communication with a control system of an audio device (e.g., a control system of an orchestration device) that controls an audio environment. Some disclosed methods involve determining the listener orientation by causing the loudspeakers to render the audio object such that the audio object appears to rotate about the listener, and prompting the listener to speak (e.g., "stop |") when the listener perceives the audio object to be in a location such as a loudspeaker location, a television location, etc. Some disclosed methods involve determining a position and/or orientation of a listener via camera data, e.g., by determining a relative position of the listener and one or more audio devices of an audio environment from the camera data, by determining an orientation of the listener relative to the one or more audio devices of the audio environment from the camera data (e.g., from a direction in which the listener is facing), etc.

A system is described in Shi,Guangi et al,Spatial Calibration of Surround Sound Systems including Listener Position Estimation,(AES137^th Convention,October 2014)[Shi,Guangi et al, incorporated herein by reference, including spatial calibration of a surround sound system for listener position estimation, (AES No. 137 convention, 2014, month 10), in which a single linear microphone array associated with a component of its position-predictable reproduction system (e.g., a sound bar, front center speaker) measures the time difference of arrival (TDOA) for both the satellite microphone and the listener to locate the position of both the microphone and the listener. In this case, the listening orientation is inherently defined as a line connecting the detected listening position with a component of the reproduction system comprising a linear microphone array, such as a sound bar co-located with the television (placed directly above or directly below the television). Since the position of the sound bar is predictably placed directly above or below the video screen, the geometry of the measured distance and angle of incidence can be converted to an absolute position relative to any point in front of the reference sound bar position using simple trigonometric principles. The distance between the loudspeakers of the linear microphone array and the microphone can be estimated by playing a test signal and measuring the time of flight (TOF) between the transmitting loudspeaker and the receiving microphone. The time delay of the direct component of the measured impulse response may be used for this purpose. The impulse response between the loudspeaker and the microphone array element may be obtained by playing a test signal through the loudspeaker being analyzed. For example, a Maximum Length Sequence (MLS) or a chirp signal (also referred to as a logarithmic sine sweep) may be used as the test signal. The room impulse response may be obtained by calculating a cyclic cross-correlation between the captured signal and the MLS input. Figure 2 of this reference shows the echo impulse response obtained using the MLS input. The impulse response is said to be similar to measurements made in a typical office or living room. The delay of the direct component is used to estimate the distance between the loudspeaker and the microphone array element. For loudspeaker distance estimation, any loop-back delay of the audio device for playback of the test signal should be calculated and removed from the measured TOF estimation.

Examples of estimating a position and orientation of a person in an audio environment

The position and orientation of a person in an audio environment may be determined or estimated by various methods including, but not limited to, the methods described in the following paragraphs.

In the head tracking technique for virtual acoustic applications, (AES No. 133 convention, 10 2012) Hess,Wolfgang,Head-Tracking Techniques for Virtual Acoustic Applications,(AES133rd Convention,October 2012)[Hess,Wolfgang,, hereby incorporated by reference, a number of commercially available techniques for tracking both the position and orientation of a listener's head in the context of a spatial audio reproduction system are presented. One particular example discussed is microsoft Kinect. With its depth sensing and standard cameras and publicly available software (Windows Software Development Kit (SDK)), a combination of skeletal tracking and facial recognition can be used to simultaneously track the position and orientation of the heads of multiple listeners in space. Although the Kinect of Windows has been off-production, azure Kinect Development Kits (DK) implementing the next generation microsoft depth sensor are currently available.

In U.S. patent No. 10,779,084, entitled "Automatic Discovery and Localization of Speaker Locations in Surround Sound Systems [ automatic discovery and localization of speaker locations in a surround sound system ]," which is hereby incorporated by reference, a system is described that can automatically locate the position of a loudspeaker and microphone in a listening environment by acoustically measuring the time of arrival (TOA) between each speaker and microphone. The listening position may be detected by placing and positioning a microphone (e.g., a microphone in a mobile phone held by a listener) at a desired listening position, and the associated listening orientation may be defined by placing another microphone at a point in the viewing direction of the listener (e.g., at the TV). Alternatively, the listening orientation may be defined by positioning a loudspeaker (e.g. on a TV) in the viewing direction.

Estimating a person's position from a user zone

In some examples, the estimated location of the person in the audio environment may correspond to a user zone. This section describes a method for estimating a user zone in which a person is located based at least in part on microphone signals.

Fig. 10 is a flowchart outlining one example of a method that may be performed by an apparatus, such as the apparatus shown in fig. 1A. As with other methods described herein, the blocks of method 1000 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In this embodiment, method 1000 involves estimating a location of a user in an environment.

In this example, block 1005 involves receiving an output signal from each of a plurality of microphones in an environment. In this example, each of the plurality of microphones resides in a microphone location of the environment. According to this example, the output signal corresponds to a current utterance of the user. In some examples, the current utterance may be or may include a wake-up word utterance. For example, block 1005 may involve a control system (e.g., control system 120 of fig. 1A) receiving an output signal from each of a plurality of microphones in an environment via an interface system (e.g., interface system 205 of fig. 1A).

In some examples, at least some microphones in an environment may provide an output signal that is asynchronous with respect to an output signal provided by one or more other microphones. For example, a first microphone of the plurality of microphones may sample audio data according to a first sampling clock and a second microphone of the plurality of microphones may sample audio data according to a second sampling clock. In some examples, at least one of the microphones in the environment may be included in or configured to communicate with the smart audio device.

According to this example, block 1010 involves determining a plurality of current acoustic features from the output signal of each microphone. In this example, the "current acoustic feature" is an acoustic feature derived from the "current utterance" of block 1005. In some implementations, block 1010 may involve receiving the plurality of current acoustic features from one or more other devices. For example, block 1010 may involve receiving at least some of the plurality of current acoustic features from one or more wake word detectors implemented by one or more other devices. Alternatively or additionally, in some implementations, block 1010 may involve determining the plurality of current acoustic features from the output signal.

Whether the acoustic signature is determined by a single device or multiple devices, the acoustic signature may be determined asynchronously. If the acoustic signature is determined by multiple devices, the acoustic signature will typically be determined asynchronously unless the devices are configured to coordinate the process of determining the acoustic signature. If the acoustic signature is determined by a single device, in some embodiments the acoustic signature may still be determined asynchronously, as the single device may receive the output signal of each microphone at different times. In some examples, the acoustic features may be determined asynchronously, as at least some microphones in the environment may provide output signals that are asynchronous with respect to output signals provided by one or more other microphones.

In some examples, the acoustic features may include a wake word confidence indicator, a wake word duration indicator, and/or at least one receive level indicator. The reception level index may indicate a reception level of sound detected by the microphone, and may correspond to a level of an output signal of the microphone.

Alternatively or additionally, the acoustic features may include one or more of the following:

average state entropy (purity) along each wake-up word state aligned with the 1 best (1-best) of the acoustic model (Viterbi).

CTC penalty (connection timing classification penalty) for the acoustic model of the wake-up word detector.

The wake word detector may be trained to provide an estimate of speaker-to-microphone distance and/or RT60 estimates in addition to wake word confidence. The distance estimate and/or the RT60 estimate may be acoustic features.

Instead of or in addition to the broadband reception level/power at the microphone, the acoustic feature may be the reception level in a plurality of log/Mel/Bark spaced apart frequency bands. The frequency bands may vary according to particular implementations (e.g., 2 frequency bands, 5 frequency bands, 20 frequency bands, 50 frequency bands, 1 octave frequency bands, or 1/3 octave frequency bands).

Cepstral representation of the previous point spectrum information, calculated by DCT (discrete cosine transform) on the logarithm of the band power.

Band power in the band that weights the human speech. For example, the acoustic features may be based on only a particular frequency band (e.g., 400Hz-1.5 kHz). In this example, the higher and lower frequencies may be ignored.

Voice activity detector confidence per band or per bin.

The acoustic features may be based at least in part on the long-term noise estimate, ignoring microphones with poor signal-to-noise ratios.

Kurtosis as a measure of speech "peak". Kurtosis may be an indicator of long reverberation tail.

Estimated wake word start time. It is expected that the start and duration of all microphones will be equal within about one frame. Outliers may provide hints for unreliable estimates. This assumes a level of synchronicity-not necessarily with the samples-but for example with frames of tens of milliseconds.

According to this example, block 1015 involves applying a classifier to the plurality of current acoustic features. In some such examples, applying the classifier may involve applying a model trained on previously determined acoustic features derived from a plurality of previous utterances spoken by the user in a plurality of user zones in the environment. Various examples are provided herein.

In some examples, the user zone may include a sink zone, a food preparation zone, a refrigerator zone, a dining zone, a sofa zone, a television zone, a bedroom zone, and/or a doorway zone. According to some examples, the one or more user zones may be predetermined user zones. In some such examples, one or more predetermined user zones may have been selected by the user during the training process.

In some implementations, applying the classifier may involve applying a gaussian mixture model trained on previous utterances. According to some such embodiments, applying the classifier may involve applying a gaussian mixture model trained on one or more of a normalized wake-up word confidence, a normalized average reception level, or a maximum reception level of the previous utterance. However, in alternative embodiments, the application classifier may be based on a different model, such as one of the other models disclosed herein. In some examples, the model may be trained using training data labeled with user zones. However, in some examples, applying the classifier involves applying a model trained using unlabeled training data that is unlabeled with the user zone.

In some examples, the previous utterance may have been or may have included a wake-up word utterance. According to some such examples, the previous utterance and the current utterance may be utterances of the same wake-up word.

In this example, block 1020 involves determining an estimate of the user zone in which the user is currently located based at least in part on the output of the classifier. In some such examples, the estimate may be determined without reference to the geometric positions of the plurality of microphones. For example, the estimate may be determined without reference to the coordinates of the respective microphones. In some examples, the estimate may be determined without estimating a geometric location of the user.

Some embodiments of method 1000 may involve selecting at least one speaker based on the estimated user zone. Some such embodiments may involve controlling at least one selected speaker to provide sound to an estimated user zone. Alternatively or additionally, some embodiments of the method 1000 may involve selecting at least one microphone based on the estimated user area. Some such embodiments may involve providing a signal output by at least one selected microphone to a smart audio device.

FIG. 11 is a block diagram of elements configured to implement one example of an embodiment of a region classifier. According to this example, the system 1100 includes a plurality of loudspeakers 1104 distributed in at least a portion of an environment (e.g., the environment shown in fig. 1A or 1B, etc.). In this example, the system 1100 includes a multi-channel loudspeaker renderer 1101. According to this embodiment, the output of the multi-channel loudspeaker renderer 1101 serves as both a loudspeaker drive signal (for driving the speaker feed of the speaker 1104) and an echo reference. In this embodiment, the echo reference is provided to the echo management subsystem 1103 via a plurality of loudspeaker reference channels 1102, which comprise at least some of the loudspeaker feed signals output from the renderer 1101.

In this embodiment, the system 1100 includes a plurality of echo management subsystems 1103. According to this example, the echo management subsystem 1103 is configured to implement one or more echo suppression procedures and/or one or more echo cancellation procedures. In this example, each echo management subsystem 1103 provides a corresponding echo management output 1103A to one of the wake-up word detectors 1106. The echo management output 1103A has attenuated echoes with respect to the input of an associated one of the echo management subsystems 1103.

According to this embodiment, the system 1100 includes N microphones 1105 (N is an integer) distributed in at least a portion of an environment (e.g., the environment shown in fig. 1A or 1B). The microphones may include array microphones and/or spot microphones. For example, one or more intelligent audio devices located in an environment may include a microphone array. In this example, the output of microphone 1105 is provided as an input to echo management subsystem 1103. According to this embodiment, each of the echo management subsystems 1103 captures the output of an individual microphone 1105 or an individual group or subset of microphones 1105.

In this example, the system 1100 includes a plurality of wake-up word detectors 1106. According to this example, each of the wake word detectors 1106 receives an audio output from one of the echo management subsystems 1103 and outputs a plurality of acoustic features 1106A. The acoustic features 1106A output from each echo management subsystem 1103 may include (but are not limited to): a measure of wake word confidence, wake word duration, and receive level. Although three arrows depicting three acoustic features 1106A are shown as being output from each echo management subsystem 1103, more or fewer acoustic features 1106A may be output in alternative embodiments. Furthermore, although these three arrows strike classifier 1107 along more or less vertical lines, this does not indicate that classifier 1107 must receive acoustic feature 1106A from all wake-up word detectors 1106 at the same time. As noted elsewhere herein, in some examples, acoustic features 1106A may be asynchronously determined and/or provided to the classifier.

According to this embodiment, the system 1100 includes a region classifier 1107, which may also be referred to as a classifier 1107. In this example, the classifier receives a plurality of features 1106A from a plurality of wake-up word detectors 1106 of a plurality (e.g., all) microphones 1105 in the environment. According to this example, the output 1108 of the zone classifier 1107 corresponds to an estimate of the user zone in which the user is currently located. According to some such examples, the output 1108 may correspond to one or more posterior probabilities. Based on bayesian statistics, the estimate of the user region in which the user is currently located may be or may correspond to the maximum posterior probability.

Example implementations of a classifier are described next, which in some examples may correspond to the region classifier 1107 of fig. 11. Let x _i (N) be the i-th microphone signal of discrete time N, i= { 1..n } (i.e., microphone signal x _i (N) is the output of N microphones 1105). Processing of the N signals x _i (N) in the echo management subsystem 1103 generates a 'clean' microphone signal e _i (N), where i= { 1..n }, each signal is generated at a discrete time N. In this example, a clean signal e _i (n) (referred to as 1103A in fig. 11) is fed to the wake-up word detector 1106. Here, each wake-up word detector 1106 generates a feature vector w _i (J), which is referred to as 1106A in fig. 11, where j= { 1..j } is an index corresponding to the jth wake-up word utterance. In this example, classifier 1107 will aggregate the feature setAs input.

According to some implementations, the zone tag set C _k (for k= { 1..k }) may correspond to K different user zones in the environment. For example, the user zone may include a sofa zone, a kitchen zone, a reading chair zone, and the like. Some examples may define more than one zone in a kitchen or other room. For example, the kitchen area may include a sink area, a food preparation area, a refrigerator area, and a dining area. Similarly, living room areas may include sofa areas, television areas, reading chair areas, one or more doorway areas, and the like. For example, during a training phase, the user may select zone labels for these zones.

In some implementations, the classifier 1107 estimates the posterior probability p (C _k |w (j)) of the feature set W (j), for example, by using a bayesian classifier. The probability p (C _k |w (j)) indicates the probability of the user in each of the regions C _k (for the jth utterance and kth region, for each of the regions C _k and each of the utterances), and is an example of the output 1108 of the classifier 1107.

According to some examples, training data may be collected (e.g., for each user zone) by prompting the user to select or define a zone (e.g., a sofa zone). The training process may involve prompting the user to speak a training utterance, such as a wake-up word, in the vicinity of the selected or defined region. In the sofa region example, the training process may involve prompting the user to speak a training utterance at the center and extreme edges of the sofa. The training process may involve prompting the user to repeat the training utterance several times at each location within the user region. The user may then be prompted to move to another user zone and continue until all of the designated user zones are covered.

FIG. 12 is a flowchart outlining one example of a method that may be performed by an apparatus such as the apparatus 200 of FIG. 1A. As with other methods described herein, the blocks of method 1200 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In this embodiment, method 1200 involves training a classifier to estimate the user's location in the environment.

In this example, block 1205 involves prompting a user to speak at least one training utterance in each of a plurality of locations within a first user region of the environment. In some examples, the training utterance(s) may be one or more instances of a wake-up word utterance. According to some embodiments, the first user zone may be any user zone selected and/or defined by a user. In some examples, the control system may create a corresponding zone tag (e.g., a corresponding example of one of the zone tags C _k described above) and may associate the zone tag with training data obtained for the first user zone.

An automatic prompting system may be used to collect these training data. As described above, the interface system 205 of the apparatus 200 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. For example, the apparatus 200 may provide the following cues to the user on the screen of the display system, or hear those cues broadcast via one or more speakers during the training process:

"move to sofa".

"Say wake-up word ten times while rotating the head".

"Move to the middle position between sofa and reading chair, and speak wake-up word ten times".

"Stand in kitchen as if cooking, and speak wake-up words ten times".

In this example, block 1210 involves receiving a first output signal from each of a plurality of microphones in an environment. In some examples, block 1210 may involve receiving the first output signal from all active microphones in the environment, while in other examples, block 1210 may involve receiving the first output signal from a subset of all active microphones in the environment. In some examples, at least some microphones in an environment may provide an output signal that is asynchronous with respect to an output signal provided by one or more other microphones. For example, a first microphone of the plurality of microphones may sample audio data according to a first sampling clock and a second microphone of the plurality of microphones may sample audio data according to a second sampling clock.

In this example, each microphone of the plurality of microphones resides in a microphone location of the environment. In this example, the first output signal corresponds to an instance of the detected training utterance received from the first user region. Because block 1205 involves prompting the user to speak at least one training utterance in each of a plurality of locations within a first user region of the environment, in this example, the term "first output signal" refers to a set of all output signals corresponding to the training utterance of the first user region. In other examples, the term "first output signal" may refer to a subset of all output signals corresponding to the training utterances of the first user zone.

According to this example, block 1215 involves determining one or more first acoustic features from each first output signal. In some examples, the first acoustic feature may include a wake word confidence indicator and/or a receive level indicator. For example, the first acoustic feature may include a normalized wake word confidence indicator, an indication of a normalized average reception level, and/or an indication of a maximum reception level.

As described above, because block 1205 involves prompting the user to speak at least one training utterance in each of a plurality of locations within a first user region of the environment, the term "first output signal" refers in this example to a set of all output signals corresponding to the training utterances of the first user region. Thus, in this example, the term "first acoustic feature" refers to a set of acoustic features derived from a set of all output signals corresponding to the training utterances of the first user zone. Thus, in this example, the set of first acoustic features is at least as large as the set of first output signals. For example, if two acoustic features are determined from each output signal, the set of first acoustic features will be twice as large as the set of first output signals.

In this example, block 1220 involves training a classifier model to establish a correlation between the first user zone and the first acoustic feature. For example, the classifier model may be any of the models disclosed herein. According to this embodiment, the classifier model is trained without reference to the geometric positions of the plurality of microphones. In other words, in this example, during the training process, the classifier model is not provided with data about the geometric positions of the plurality of microphones (e.g., microphone coordinate data).

Fig. 13 is a flowchart outlining another example of a method that may be performed by an apparatus such as the apparatus 200 of fig. 1A. As with other methods described herein, the blocks of method 1300 need not be performed in the order indicated. For example, in some implementations, at least a portion of the acoustic feature determination process of block 1325 may be performed prior to block 1315 or block 1320. Moreover, such methods may include more or fewer blocks than shown and/or described. In this embodiment, method 1300 involves training a classifier to estimate a user's location in an environment. Method 1300 provides an example of expanding method 1200 to multiple user zones of an environment.

In this example, block 1305 involves prompting a user to speak at least one training utterance in a location within a user region of the environment. In some examples, block 1305 may be performed in the manner described above with reference to block 1205 of fig. 12, except that block 1305 involves a single location within the user area. In some examples, the training utterance(s) may be one or more instances of a wake-up word utterance. According to some embodiments, the user area may be any user area selected and/or defined by a user. In some examples, the control system may create a corresponding zone tag (e.g., a corresponding example of one of the zone tags C _k described above) and may associate the zone tag with training data obtained for the user zone.

According to this example, block 1310 is performed substantially as described above with reference to block 1210 of fig. 12. However, in this example, the process of block 1310 is generalized to any user area, not necessarily the first user area for which training data is acquired. Thus, the output signal received in block 1310 is "output signal from each of a plurality of microphones in the environment, each of the plurality of microphones residing in a microphone location of the environment, the output signal corresponding to an instance of the detected training utterance received from the user area. In this example, the term "output signal" refers to a set of all output signals corresponding to one or more training utterances in the location of the user zone. In other examples, the term "output signal" may refer to a subset of all output signals corresponding to one or more training utterances in the location of the user zone.

According to this example, block 1315 involves determining whether sufficient training data has been acquired for the current user zone. In some such examples, block 1315 may involve determining whether an output signal corresponding to a threshold number of training utterances has been obtained for the current user zone. Alternatively or additionally, block 1315 may involve determining whether an output signal corresponding to a training utterance in a threshold number of locations within the current user zone has been obtained. If not, the method 1300 returns to block 1305 in this example and prompts the user to speak at least one additional utterance at a location within the same user area.

However, if it is determined in block 1315 that sufficient training data has been acquired for the current user zone, then in this example, the process continues to block 1320. According to this example, block 1320 involves determining whether to obtain training data for the additional user zone. According to some examples, block 1320 may involve determining whether training data for each user zone previously identified by the user has been obtained. In other examples, block 1320 may involve determining whether training data for a minimum number of user zones has been obtained. The minimum number may have been selected by the user. In other examples, the minimum number may be a minimum recommended number per environment, a minimum recommended number per room in the environment, and so on.

If it is determined in block 1320 that training data for the additional user zone should be obtained, then in this example the process continues to block 1322, which involves prompting the user to move to another user zone of the environment. In some examples, the next user zone may be selected by the user. According to this example, following the prompt of block 1322, the process continues to block 1305. In some such examples, following the prompt of block 1322, the user may be prompted to confirm that the user has arrived at the new user zone. According to some such examples, the user may be required to confirm that the user has arrived at the new user zone before providing the prompt of block 1305.

If it is determined in block 1320 that training data for additional user zones should not be obtained, then in this example, the process continues to block 1325. In this example, method 1300 involves obtaining training data for K user zones. In this embodiment, block 1325 involves determining first through G acoustic features from first through H output signals corresponding to each of the first through K user zones that have acquired training data. In this example, the term "first output signal" refers to the set of all output signals corresponding to the training utterances of the first user zone, and the term "H output signal" refers to the set of all output signals corresponding to the training utterances of the K-th user zone. Similarly, the term "first acoustic feature" refers to a set of acoustic features determined from the first output signal, and the term "G-th acoustic feature" refers to a set of acoustic features determined from the H-th output signal.

According to these examples, block 1330 involves training a classifier model to establish correlations between the first through K-th user zones and the first through K-th acoustic features, respectively. For example, the classifier model may be any of the classifier models disclosed herein.

In the foregoing example, the user zones are marked (e.g., according to a corresponding instance of one of the zone tags C _k described above). However, depending on the particular implementation, the model may be trained based on labeled or unlabeled user zones. In the case of a tag, each training utterance may be paired with a tag corresponding to a user zone, e.g., as follows:

Training the classifier model may involve determining a best fit of the labeled training data. Without loss of generality, a suitable classification method for a classifier model may include:

A bayesian classifier, for example, with each class of distribution described by a multivariate normal distribution, a full covariance gaussian mixture model, or a diagonal covariance gaussian mixture model;

Vector quantization;

Nearest neighbor (k-means);

A neural network with a SoftMax output layer, where one output corresponds to each class;

Support Vector Machine (SVM); and/or

Lifting techniques, e.g. gradient lifts (GBM)

In one example of implementing the unlabeled case, the data may be automatically partitioned into K clusters, where K may also be unknown. Unlabeled automatic segmentation may be performed, for example, by using classical clustering techniques (e.g., k-means algorithms or gaussian mixture modeling).

To improve robustness, regularization may be applied to classifier model training, and model parameters may be updated over time as new utterances are spoken.

Further aspects of the embodiments are described next.

An example set of acoustic features (e.g., acoustic feature 1106A of fig. 11) may include a likelihood of a wake word confidence, an average receive level over an estimated duration of the wake word with the highest confidence, and a maximum receive level over the duration of the wake word with the highest confidence. The features may be normalized with respect to the maximum value of the features for each wake-up word utterance. Training data may be labeled and a full covariance Gaussian Mixture Model (GMM) may be trained to maximize the expectations of training the labels. The estimated region may be a class that maximizes the posterior probability.

The above description of some embodiments discusses learning an acoustic zone model from a training data set collected during the collection process of cues. In this model, the training time (or configuration mode) and the runtime (or normal mode) may be considered as two different modes in which the microphone system may be located. An extension of this approach is online learning, where part or all of the acoustic zone models are learned or adapted online (e.g., at run time or in a regular mode). In other words, even after applying the classifier during the "run-time" to estimate the user zone in which the user is currently located (e.g., according to method 1000 of FIG. 10), the process of training the classifier may continue in some embodiments.

Fig. 14 is a flowchart outlining another example of a method that may be performed by an apparatus such as the apparatus 200 of fig. 1A. As with other methods described herein, the blocks of method 1400 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In this embodiment, method 1400 involves continuously training the classifier during a "run-time" process that estimates the user's location in the environment. Method 1400 is an example of what is referred to herein as an online learning mode.

In this example, block 1405 of method 1400 corresponds to blocks 1005-1020 of method 1000. Here, block 1405 involves providing an estimate of the user region in which the user is currently located based at least in part on the output of the classifier. According to this embodiment, block 1410 involves obtaining implicit or explicit feedback regarding the estimation of block 1405. In block 1415, the classifier is updated according to the feedback received in block 1405. For example, block 1415 may involve one or more reinforcement learning methods. As suggested by the dashed arrow from block 1415 to block 1405, in some implementations, method 1400 may involve returning to block 1405. For example, method 1400 may involve providing a future estimate of the user zone in which the user is located at the future time based on applying the updated model.

Explicit techniques for obtaining feedback may include:

Query the user using a voice User Interface (UI) if the prediction is correct. (e.g., the user may be provided with a sound indicating that "I want you on the sofa, please say 'pair' or 'error').

Notify the user that the voice UI can be used at any time to correct the incorrect predictions. (e.g., a sound may be provided to the user indicating that "when you and me speak, me is now able to predict where you are, if me is mispredicted, say' Amanda, i are not on the sofa, i are on a reading chair, and so on").

Informing the user that the correct predictions can be rewarded at any time using the voice UI. (e.g., a sound may be provided to the user indicating that "I can now predict where you are when you and I speak if I predict correctly, you can help further improve My predictions by speaking 'Amanda, not wrong I were on the sofa.', and so on").

Including physical buttons or other UI elements that the user can operate to give feedback (e.g., thumb up and/or thumb down buttons on a physical device or in a smartphone application).

The goal of predicting the user zone in which the user is located may be to inform the microphone of a selection or adaptive beamforming scheme that attempts to pick up sound more efficiently from the user's acoustic zone, e.g., to better identify commands after wake-up words. In such a scenario, implicit techniques for obtaining feedback regarding the quality of the region prediction may include:

Penalty results in a prediction of false recognition of commands following the wake word. Alternatives that may indicate false recognition may include the user shortening the response of the voice assistant to the command, e.g., by speaking an anti-command, such as "Amanda, stop-! ";

Penalty results in a prediction of lower confidence that the speech recognizer has successfully recognized the command. Many automatic speech recognition systems have the ability to return a confidence level of their results, which can be used for this purpose;

Penalty results in the second traversal of the wake word detector failing to retrospectively detect the prediction of the wake word with high confidence; and/or

Enhance predictions that enable high confidence recognition of wake words and/or correct recognition of user commands.

The following is an example of the second traversal wake word detector failing to retrospectively detect wake words with high confidence. It is assumed that after an output signal corresponding to the current utterance is obtained from a microphone in the environment and after an acoustic feature is determined based on the output signal (e.g., via a plurality of first traversal wake word detectors configured for communication with the microphone), the acoustic feature is provided to a classifier. In other words, the acoustic features are assumed to correspond to the detected wake-up word utterances. Further assume that the classifier determines that the person speaking the current utterance is most likely in region 3, which corresponds to the reading chair in this example. For example, when a person is in zone 3, there may be a particular microphone or learned combination of microphones known to be most suitable for listening to the person's voice, for example, to send to a cloud-based virtual assistant service for voice command recognition.

It is further assumed that after determining which microphone(s) will be used for speech recognition, but before actually sending the person's speech to the virtual assistant service, the second traversal wake-up word detector operates on microphone signals corresponding to the speech detected by the selected microphone(s) of zone 3 that you will submit for command recognition. If the second traversal wake word detector is not consistent with the plurality of first traversal wake word detectors, i.e., the wake word is actually declared, this may be because the classifier mispredicted the region. Therefore, the classifier should be penalized.

Techniques for posterior updating of a region mapping model after speaking one or more wake words may include:

Maximum A Posteriori (MAP) adaptation of Gaussian Mixture Model (GMM) or nearest neighbor model; and/or

Reinforcement learning, e.g., neural networks, such as by associating appropriate "single hot" (in the case of correct predictions) or "single cold" (in the case of incorrect predictions) ground truth labels with SoftMax outputs and applying online back propagation to determine new network weights.

Some examples of MAP adaptation in this context may involve adjusting the average in the GMM each time a wake word is spoken. In this way, the average value may become more like the acoustic features observed when a subsequent wake word is spoken. Alternatively or additionally, such examples may involve adjusting variance/covariance or mixing weight information in the GMM each time a wake word is spoken.

For example, the MAP adaptation scheme may be as follows:

μ_i,new＝μ_i,old*α+x*(1-α)

In the foregoing equation, μ _i,old represents the average value of the ith gaussian in the mixture, α represents a parameter that controls how aggressively MAP adaptation should occur (α may be in the range of [0.9,0.999 ]), and x represents the feature vector of the new wake-up word utterance. The index "i" will correspond to a mixing element that returns the highest prior probability that contains the speaker position at the wake-up word time.

Alternatively, each mixing element may be adjusted according to their prior probability of containing wake words, e.g., as follows:

M_i,new＝μ_i,old*β_i*x(1-β_i)

In the foregoing equation, β _i =α (1-P (i)), where P (i) represents the prior probability that the observation x is due to the mixed element i.

In one reinforcement learning example, there may be three user zones. Assume that the model predicts a probability of three user regions for a particular wake word [0.2,0.1,0.7]. If the second information source (e.g., the second walk-through wake-up word detector) confirms that the third zone is correct, the ground truth tag may be [0, 1] ("one-hot"). Posterior updating of the region mapping model may involve back-propagating errors through the neural network, which in effect means that if the same inputs are again displayed, the neural network will predict region 3 more strongly. Conversely, if the second source display 3 is incorrectly predicted, then in one example, the ground truth label may be [0.5,0.5,0.0]. If the same input is displayed in the future, back-propagating errors through the neural network will make it unlikely that the model predicts region 3.

Further examples of audio processing variations involving cost function optimization

As described elsewhere herein, in various disclosed examples, one or more types of audio processing variations may be based on optimization of a cost function. Some such examples relate to flexible rendering.

Flexible rendering allows rendering spatial audio on any number of arbitrarily placed speakers. In view of the wide deployment of audio devices, including but not limited to smart audio devices (e.g., smart speakers) in the home, there is a need to implement flexible rendering techniques that allow consumer products to perform flexible rendering of audio and playback of the audio so rendered.

Several techniques have been developed to implement flexible rendering. They consider rendering problems as one of the cost function minimizations, where the cost function consists of two terms: a first term modeling the desired spatial impression that the renderer is attempting to achieve, and a second term assigning a cost for activating the speaker. To date, this second term has focused on creating a sparse solution in which only speakers are activated that are very close to the desired spatial location of the audio being rendered.

Playback of spatial audio in a consumer environment is typically associated with a specified number of loudspeakers placed at specified locations: for example, 5.1 and 7.1 surround sound. In these cases, the content is written specifically for the associated loudspeakers and encoded as discrete channels, one for each loudspeaker (e.g., dolby Digital (Dolby Digital) or Dolby digital+ (Dolby Digital Plus), etc.). Recently, an immersive, object-based spatial audio format (Dolby panoramic sound (Dolby Atmos)) was introduced, which breaks this association between content and specific loudspeaker locations. Instead, content may be described as a collection of individual audio objects, each of which has possibly time-varying metadata describing the desired perceived location of the audio object in three-dimensional space. At playback time, the content is converted to loudspeaker feeds by a renderer that adapts the number and position of loudspeakers in the playback system. However, many such renderers still limit the position of a set of loudspeakers to one of a set of specified layouts (e.g., 3.1.2, 5.1.2, 7.1.4, 9.1.6, etc. with dolby panoramas).

Beyond such limited rendering, methods have been developed that allow object-based audio to be flexibly rendered over virtually any number of loudspeakers placed at any location. These methods require the renderer to know the number and physical location of loudspeakers in the listening space. In order to make such a system practical for an average consumer, an automated method for positioning the loudspeaker is desired. One such approach relies on the use of multiple microphones that may be co-located with a loudspeaker. By playing the audio signal by means of loudspeakers and recording with microphones, the distance between each loudspeaker and the microphone is estimated. From these distances the positions of both the loudspeaker and the microphone are then derived.

While consumer space has been introducing object-based spatial audio, so-called "smart speakers" have been rapidly adopted, such as Amazon Echo (Amazon Echo) series products. The great popularity of these devices can be attributed to the simplicity and convenience they offer over wireless connections and integrated voice interfaces (e.g., alexa of amazon), but the sound capabilities of these devices are often limited, particularly in terms of spatial audio. In most cases, these devices are limited to mono or stereo playback only. However, combining the above flexible rendering and automatic localization techniques with multiple orchestrated intelligent speakers can result in a system with very complex spatial playback functionality and still be very simple for the consumer to set up. The consumer can place any number of speakers anywhere convenient as desired, no need to lay speaker cables due to the wireless connection, and the built-in microphone can be used to automatically position the speakers for the associated flexible renderer.

Conventional flexible rendering algorithms aim to achieve as close as possible to a specific desired perceived spatial impression. In orchestrated intelligent speaker systems, sometimes maintaining this spatial impression may not be the most important or desirable goal. For example, if someone simultaneously tries to speak into an integrated speech assistant, it may be desirable to temporarily change the spatial rendering in a manner that reduces the relative playback level of speakers near certain microphones to increase the signal-to-noise ratio and/or signal-to-echo ratio (SER) of the microphone signals that include the detected speech. Some embodiments described herein may be implemented as modifications to existing flexible rendering methods to allow such dynamic modifications to spatial rendering, for example, for the purpose of achieving one or more additional objectives.

Existing flexible rendering techniques include centroid amplitude panning (CMAP) and Flexible Virtualization (FV). At a high level, both techniques render a set of one or more audio signals, each audio signal having an associated desired perceived spatial location for playback on a set of two or more speakers, wherein the relative activation of the set of speakers is a function of a model of the perceived spatial location of the audio signal played back by the speakers and the proximity of the desired perceived spatial location of the audio signal to the speaker location. The model ensures that the listener hears the audio signal near its intended spatial location and the proximity item controls which speakers are used to achieve this spatial impression. In particular, the proximity item facilitates activation of speakers near a desired perceived spatial location of the audio signal. For both CMAP and FV, this functional relationship can be conveniently derived from a cost function written as the sum of two terms, one term for spatial aspects and one term for proximity:

Here, the collection Representing the position of a set of M loudspeakers,/>Represents the desired perceived spatial position of the audio signal and g represents the M-dimensional vector of speaker activation. For CMAP, each activation in the vector represents the gain of each speaker, while for FV, each activation represents a filter (in the second case, g can equivalently be considered a vector of complex values at a particular frequency, and different g is calculated across multiple frequencies to form a filter). The best vector for activation is found by minimizing the cost function across activations:

Under some definition of the cost function, it is difficult to control the absolute level of optimal activation resulting from the above-described minimization, although the relative level between the components of g _opt is appropriate. To solve this problem, a subsequent normalization of g _opt may be performed in order to control the absolute level of activation. For example, it may be desirable to normalize the vector to have a unit length, which complies with the usual constant power translation rules:

The exact behavior of the flexible rendering algorithm depends on the specific construction of the two terms C _spatial and C _proximity of the cost function. For CMAP, C _spatial is derived from a model that places the perceived spatial position of an audio signal played from a set of loudspeakers at the centroid of the positions of these loudspeakers weighted by their associated activation gains g _i (the elements of vector g):

equation 19 is then manipulated to represent the spatial cost of the squared error between the desired audio position and the audio position produced by the activated loudspeaker:

for FV, the spatial terms of the cost function are defined differently. The goal is to generate and audio object positions at the listener's left and right ears The corresponding binaural response b. Conceptually, b is the 2 x 1 vector of filters (one for each ear), but it is more convenient to consider it as a2 x 1 vector of complex values at a particular frequency. Continuing with the representation at a particular frequency, the desired binaural response may be derived from a set of HRTFs indexed by object position:

Meanwhile, the 2×1 binaural response e produced by the loudspeakers at the listener's ears is modeled as an mx1 vector g of 2×m acoustic transmission matrix H multiplied by complex speaker activation values:

e＝Hg (22)

The acoustic transmission matrix H is based on a set of loudspeaker positions Modeling with respect to listener position. Finally, the spatial component of the cost function is defined as the square error between the desired binaural response (equation 5) and the loudspeaker-generated binaural response (equation 6):

/>

Conveniently, the spatial terms of the cost functions for CMAP and FV defined in both equations 4 and 7 may be rearranged into a matrix quadratic form as a function of speaker activation g:

Where a is an mxm square matrix, B is a1 xm vector, and C is a scalar. The rank of matrix a is 2 and therefore when M >2, there are an infinite number of speaker activations g with spatial error terms equal to zero. The second term C _proximity, which introduces a cost function, removes this uncertainty and produces a particular solution with perceptually beneficial properties compared to other possible solutions. For both CMAP and FV, C _proximity is constructed so that the position Away from the desired audio signal position/>Is more penalized than the activation of a speaker that is located close to the desired location. This construction results in a sparse optimal set of speaker activations in which only speakers that are close to the desired audio signal location will be significantly activated and in fact result in a spatial reproduction of the audio signal, which is more perceptually robust to listener movements around the set of speakers.

To this end, the second term C _proximity of the cost function may be defined as a distance weighted sum of absolute squares of speaker activations. This is briefly expressed in matrix form as:

Where D is a diagonal matrix of distance penalties between the desired audio location and each speaker:

the distance penalty function may take many forms, but the following is a useful parameterization

Wherein,Is the euclidean distance between the desired audio position and the speaker position, and α and β are adjustable parameters. The parameter α indicates the global strength of the penalty; d ₀ corresponds to the spatial range of the penalty for distance (loudspeakers at a distance of about d ₀ or more will be penalized), and β accounts for the abrupt nature of the onset of the penalty at a distance d ₀.

Combining the two terms of the cost function defined in equations 8 and 9a yields the overall cost function

C(g)＝g^*Ag+Bg+C+g^*Dg＝g^*(A+D)g+Bg+C (26)

Setting the derivative of the cost function with respect to g to zero and solving for g yields the best speaker activation solution:

In general, the optimal solution in equation 27 may yield a speaker activation with a negative value. For CMAP construction of a flexible renderer, such negative activation may be undesirable, and thus equation 27 may be minimized with all activations remaining positive.

Fig. 15 and 16 are diagrams illustrating an example set of speaker activation and object rendering positions. In these examples, speaker activation and object rendering positions correspond to speaker positions of 4, 64, 165, -87, and-4 degrees. Fig. 15 shows speaker activations 1505a, 1510a, 1515a, 1520a, and 1525a, which include the best solutions of equation 11 for these particular speaker positions. Fig. 16 plots individual speaker locations as points 1605, 1610, 1615, 1620, and 1625, which correspond to speaker activations 1505a, 1510a, 1515a, 1520a, and 1525a, respectively. Fig. 16 also shows ideal object positions (in other words, positions where audio objects are to be rendered) for a large number of possible object angles as points 1630a, and corresponding actual rendering positions for these objects as points 1635a, connected to the ideal object positions by dashed lines 1640 a.

One class of embodiments relates to a method for rendering audio for playback by at least one (e.g., all or some) of a plurality of coordinated (orchestrated) intelligent audio devices. For example, a set of smart audio devices present in (in a system of) a user's home may be orchestrated to handle various simultaneous use cases, including flexible rendering (according to embodiments) of audio for playback by all or some of the smart audio devices (i.e., by speaker(s) of all or some of the smart audio devices). Many interactions with the system are considered, which require dynamic modification of the rendering. Such correction may, but need not, be focused on spatial fidelity.

Some embodiments are methods for rendering audio for playback by at least one (e.g., all or some) of a set of intelligent audio devices (or for playback by at least one (e.g., all or some) of another set of speakers). Rendering may include minimizing a cost function, wherein the cost function includes at least one dynamic speaker activation term. Examples of such dynamic speaker activation items include (but are not limited to):

Proximity of the speaker to one or more listeners;

Proximity of the speaker to attractive or repulsive forces;

audibility of the speaker with respect to some locations (e.g., listener location or baby room);

capacity of the speaker (e.g., frequency response and distortion);

synchronization of speakers with respect to other speakers;

wake word performance; and

Echo canceller performance.

The dynamic speaker activation item(s) may enable at least one of a variety of actions, including distorting the spatial presentation of audio away from a particular smart audio device so that a speaker may be better heard by the microphone of the particular smart audio device or so that a secondary audio stream may be better heard from the speaker(s) of the smart audio device.

Some embodiments implement rendering for coordinated (orchestrated) speaker(s) playback of multiple intelligent audio devices. Other embodiments implement rendering for playback by speaker(s) of another set of speakers.

Pairing a flexible rendering method (implemented according to some embodiments) with a set of wireless smart speakers (or other smart audio devices) can result in a very capable and easy-to-use spatial audio rendering system. When considering interactions with such systems, it is clearly desirable to dynamically modify the spatial rendering in order to optimize for other objectives that may occur during use of the system. To achieve this objective, one class of embodiments enhances existing flexible rendering algorithms (where speaker activation is a function of the previously disclosed spatial and proximity terms) with one or more additional dynamically configurable functions that depend on one or more properties of the audio signal being rendered, the speaker group, and/or other external inputs. According to some embodiments, the cost function of the existing flexible rendering given in equation 1 adds these one or more additional dependent terms according to the following equation

In equation 28, the termRepresents an additional cost term, and/>Representing a set of one or more properties of an audio signal being rendered (e.g., an object-based audio program)/>Represents a set of one or more properties of a speaker that is rendering audio, and/>Representing one or more additional external inputs. Each itemThe return cost is generally defined by the aggregate/>, as a function of the activation g associated with a combination of one or more properties of the audio signal, speaker, and/or external inputAnd (3) representing. It should be appreciated that the collection/>At least comprise from/>Or/>An element of any one of the above.

Examples of (a) include, but are not limited to:

the desired perceived spatial position of the audio signal;

Level of audio signal (possibly time-varying); and/or

The spectrum of the audio signal (possibly time-varying).

Examples of (a) include, but are not limited to:

The position of the loudspeaker in the listening space;

Frequency response of the loudspeaker;

playback level limitation of the loudspeaker;

Parameters of the dynamic processing algorithm inside the loudspeaker, such as limiter gain;

Measurement or estimation of acoustic transmissions from each speaker to the other speakers;

Measurement of echo canceller performance on speaker; and/or

The relative synchronisation of the loudspeakers with respect to each other.

Examples of (a) include, but are not limited to:

The location of one or more listeners or speakers in the playback space;

Measurement or estimation of acoustic transmission from each loudspeaker to the listening position;

Measurement or estimation of acoustic transmission from a speaker to a set of loudspeakers;

The location of some other landmark in the playback space; and/or

Measurement or estimation of acoustic transmission from each speaker to some other landmark in the playback space;

using the new cost function defined in equation 28, the best set of activations can be found by minimization and possibly post normalization with respect to g as previously specified in equations 28a and 28 b.

In some examples, one or more of the cost items C _j may be determined by a dodging module (e.g., the dodging module 400 of fig. 6) as one or moreItem, one or more/>A function of an item or a combination thereof. In some such examples, the dodging solution 480 provided to the renderer may include one or more of such items. In other examples, one or more of such items may be determined by the renderer. In some such examples, one or more of such items may be determined by the renderer in response to the dodging solution 480. According to some examples, one or more of such terms may be determined according to an iterative process (e.g., method 800 of fig. 8).

FIG. 17 is a flowchart outlining one example of a method that may be performed by an apparatus or system, such as the apparatus or system shown in FIG. 1A. As with other methods described herein, the blocks of method 1700 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. The blocks of method 1700 may be performed by one or more devices, which may be (or may include) a control system, such as control system 160 shown in fig. 1A.

In this embodiment, block 1705 relates to receiving, by the control system and via the interface system, audio data. In this example, the audio data includes one or more audio signals and associated spatial data. According to this embodiment, the spatial data indicates an expected perceived spatial position corresponding to the audio signal. In some examples, the desired perceived spatial location may be explicit, for example, as indicated by location metadata such as dolby panoramic location metadata. In other examples, the desired perceived spatial location may be implicit, e.g., the desired perceived spatial location may be a hypothetical location associated with a channel according to dolby 5.1, dolby 7.1, or other channel-based audio format. In some examples, block 1705 relates to a rendering module of a control system that receives audio data via an interface system.

According to this example, block 1710 involves rendering, by a control system, audio data for reproduction via a set of loudspeakers of an environment to produce a rendered audio signal. In this example, rendering each of the one or more audio signals included in the audio data involves determining relative activation of a set of loudspeakers in the environment by optimizing a cost function. According to this example, the cost is a function of a model of the perceived spatial position of the audio signal when played back over the set of loudspeakers in the environment. In this example, the cost is also a function of a measure of the proximity of the intended perceived spatial position of the audio signal to the position of each loudspeaker of the set of loudspeakers. In this embodiment, the cost is also a function of one or more additional dynamically configurable functions. In this example, the dynamically configurable functionality is based on one or more of the following: proximity of the loudspeakers to one or more listeners; proximity of the loudspeaker to the attractive force location, wherein attractive force is a factor in favor of relatively higher loudspeaker activation closer to the attractive force location; proximity of the loudspeaker to the repulsive force location, wherein repulsive force is a factor in favor of a relatively lower loudspeaker activation closer to the repulsive force location; the ability of each loudspeaker to be relative to other loudspeakers in the environment; synchronization of the loudspeakers with respect to other loudspeakers; wake word performance; or echo canceller performance.

In this example, block 1715 involves providing the rendered audio signal to at least some of the group of loudspeakers of the environment via an interface system.

According to some examples, the model of perceived spatial locations may produce binaural responses corresponding to audio object locations at the listener's left and right ears. Alternatively or additionally, the model of perceived spatial location may place the perceived spatial location of the audio signal played from a set of loudspeakers at the centroid of the locations of the set of loudspeakers weighted by the associated activation gains of the loudspeakers.

In some examples, the one or more additional dynamically configurable functions may be based at least in part on a level of the one or more audio signals. In some examples, the one or more additional dynamically configurable functions may be based at least in part on a spectrum of the one or more audio signals.

Some examples of method 1700 involve receiving loudspeaker layout information. In some examples, the one or more additional dynamically configurable functions may be based at least in part on the location of each loudspeaker in the environment.

Some examples of method 1700 involve receiving loudspeaker specification information. In some examples, the one or more additional dynamically configurable functions may be based at least in part on the capabilities of each loudspeaker, which may include one or more of the following: frequency response, playback level limits, or parameters of one or more loudspeaker dynamic processing algorithms.

According to some examples, the one or more additional dynamically configurable functions may be based at least in part on measurements or estimates of acoustic transmissions from each loudspeaker to other loudspeakers. Alternatively or additionally, the one or more additional dynamically configurable functions may be based at least in part on listener or speaker positions of one or more persons in the environment. Alternatively or additionally, the one or more additional dynamically configurable functions may be based at least in part on measurements or estimates of acoustic transmissions from each loudspeaker to a listener or speaker location. The estimation of the acoustic transmission may be based, for example, at least in part, on walls, furniture, or other objects that may reside between each loudspeaker and the listener or speaker location.

Alternatively or additionally, the one or more additional dynamically configurable functions may be based at least in part on object locations of one or more non-loudspeaker objects or landmarks in the environment. In some such embodiments, the one or more additional dynamically configurable functions may be based at least in part on a measurement or estimate of acoustic transmission from each loudspeaker to the object location or landmark location.

Flexible rendering may be implemented by employing one or more appropriately defined additional cost terms to achieve many new and useful behaviors. All of the example acts listed below are to penalize certain loudspeakers under certain conditions that are deemed undesirable. The end result is that these loudspeakers are less activated in the spatial rendering of a set of audio signals. In many of these cases, one might consider simply turning down the undesired loudspeakers, without depending on any modification to the spatial rendering, but such a strategy might significantly reduce the overall balance of the audio content. For example, certain components of the mix may become completely inaudible. On the other hand, for the disclosed embodiments, integrating these penalties into the core optimization of the rendering allows the rendering adaptation and uses the remaining less penalized speakers to perform the best possible spatial rendering. This is a more elegant, adaptable, more efficient solution.

Example uses include, but are not limited to:

Providing a more balanced spatial presentation around the listening area

It has been found that spatial audio is preferably presented across a loudspeaker at approximately the same distance from the intended listening area. The cost may be built such that loudspeakers that are significantly closer or farther than the average distance of the loudspeakers to the listening area are penalized, thus reducing the activation of the loudspeakers;

Move audio away or towards listener or speaker

If a user of the system is attempting to speak with an intelligent voice assistant of or associated with the system, it may be beneficial to create a cost that penalizes loudspeakers closer to the speaker. In this way, these loudspeakers are less active, allowing the speaker's associated microphone to hear better;

To provide a more intimate experience for a single listener, minimizing playback levels for others in the listening space, speakers located far from the listener may be severely penalized so that only the speaker closest to the listener is most significantly activated;

moving audio away from or towards a landmark, zone or area

Some locations near the listening space may be considered sensitive, such as a baby crib, office, reading area, learning area, etc. In this case, a penalty may be built for the cost of using speakers near the location, zone, or area;

Alternatively, for the same (or similar) situation above, the speaker system may have generated a measurement of the acoustic transmission from each speaker into the baby room, especially when one of the speakers (with attached or associated microphone) resides within the baby room. In this case, instead of using the physical proximity of the speaker to the baby room, a penalty may be built for using the cost of the speaker for high acoustic transmission to the room measured; and/or

Optimal use of the capabilities of the speaker

The capabilities of the different loudspeakers may vary significantly. For example, one popular type of smart speaker contains only a single 1.6 "full range driver with limited low frequency capability. On the other hand, another intelligent speaker contains a more capable 3 "woofer. These capabilities are typically reflected in the frequency response of the speaker and as such, a set of responses associated with the speaker may be utilized in a cost term. At a particular frequency, speakers that are weaker in capability relative to other speakers (as measured from their frequency response) may be penalized and therefore activated to a lesser extent. In some implementations, such frequency response values may be stored with the smart loudspeakers and then reported to a computing unit responsible for optimizing the flexible rendering;

Many speakers contain more than one driver, each of which is responsible for playing a different frequency range. For example, one popular smart speaker is a two-way design, including a woofer for lower frequencies and a tweeter for higher frequencies. Typically, such speakers contain crossover circuitry for dividing the full range playback audio signal into appropriate frequency ranges and transmitting to the corresponding drivers. Alternatively, such speakers may provide flexible renderer playback access for each individual driver, as well as information about the capabilities (e.g., frequency response) of each individual driver. By applying the cost term as described above, in some examples, the flexible renderer may automatically establish a frequency divider between two drivers based on its relative capabilities at different frequencies;

The above-described example use of the o frequency response focuses on the inherent capabilities of the speaker, but may not accurately reflect the capabilities of the speaker placed in the listening environment. In some cases, the speaker frequency response as measured at the intended listening position may be obtained by some calibration procedure. Such measurements may be used instead of pre-calculated responses to better optimize the use of the speaker. For example, a speaker may be inherently very capable at a particular frequency, but may produce a very limited response at an intended listening position due to its placement (e.g., behind a wall or piece of furniture). Capturing the response and feeding the measurement into an appropriate cost term may prevent significant activation of such speakers;

The o frequency response is only one aspect of the playback capability of the loudspeaker. Many smaller loudspeakers start to distort and then reach their offset limits as the playback level increases, especially for lower frequencies. To reduce this distortion, many loudspeakers implement dynamic processing that limits playback levels below some limiting threshold that may vary with frequency. In the case where the speakers are close to or at these thresholds, while other speakers participating in the flexible rendering are not, it makes sense to reduce the signal level in limiting the speakers and transfer this energy to other less burdened speakers. According to some embodiments, this behavior may be automatically achieved by appropriately configuring the associated cost items. Such cost terms may relate to one or more of the following:

■ A global playback volume associated with a limiting threshold of the loudspeaker is monitored. For example, a loudspeaker whose volume level is close to its limit threshold may be subject to more penalty;

■ The dynamic signal level (and possibly also frequency) associated with the loudspeaker limit threshold (and possibly frequency) is monitored. For example, a loudspeaker whose monitored signal level is close to its limit threshold may be subject to more penalty;

■ Parameters of the dynamic processing of the loudspeaker, such as limiting gain, are monitored directly. In some such examples, the parameter indicates that a more restrictive loudspeaker may be subject to more penalty; and/or

■ The actual instantaneous voltage, current and power delivered by the amplifier to the microphone is monitored to determine whether the microphone is operating in a linear range. For example, a loudspeaker that operates less linearly may be subject to more penalties;

Intelligent speakers with integrated microphones and interactive voice assistants typically employ some type of echo cancellation to reduce the level of audio signals played by the speaker picked up by the recording microphone. The greater this decrease, the greater the chance that the speaker will hear and understand the speaker in space. If the residual of the echo canceller is always high, this may indicate that the speaker is being driven into a non-linear region where prediction of the echo path becomes challenging. In this case, it may be meaningful to divert signal energy away from the speaker, and as such, it may be beneficial to consider a cost term of echo canceller performance. Such a cost term may assign high costs to speakers whose associated echo cancellers perform poorly;

in order to achieve predictable imaging when rendering spatial audio over multiple loudspeakers, playback over a set of loudspeakers is often required to be reasonably synchronized across time. For a wired loudspeaker this is given, but for a large number of wireless loudspeakers, synchronization can be challenging and the end result variable. In this case, it may be possible for each loudspeaker to report its relative degree of synchronization with the target, and this degree may then be fed into the synchronization cost term. In some such examples, loudspeakers with lower degrees of synchrony may be subject to more penalty and are therefore excluded from rendering. Additionally, certain types of audio signals may not need to be closely synchronized, e.g., components of an audio mix that are intended to be diffuse or non-directional. In some implementations, the components may be tagged with metadata as such, and the synchronization cost term may be revised such that the penalty is reduced.

Additional examples of embodiments are described next. Each new cost function term will be similar to the proximity cost defined in equations 25a and 25bIt may also be convenient to express as a weighted sum of absolute squares of speaker activations, for example as follows:

wherein W _j is a weight Describes the cost j associated with activating speaker i:

Combining equations 29a and 29b with the matrix quadratic version of the CMAP and FV cost functions given in equation 26 yields a potentially advantageous implementation of the generic extended cost function (of some embodiments) given in equation 28:

C(g)＝g^*Ag+Bg+C+g^*Dg+∑_j g^*W_jg＝g^*(A+D+∑_jW_j)g+Bg+C (30)

With this definition of the new cost function term, the overall cost function is still matrix quadratic and the optimal set of activations g _opt can be found by differentiation of equation 30 to produce

Treating each of the weight terms w _ij as a given continuous penalty value for each of the loudspeakersIs useful. In an example embodiment, the penalty value is the distance from the object (to be rendered) to the loudspeaker under consideration. In another example embodiment, the penalty value indicates that a given loudspeaker cannot reproduce some frequencies. Based on the penalty value, the weight term w _ij can be parameterized as:

Where α _j represents a pre-factor (which takes into account the global strength of the weight term), where τ _j represents a penalty threshold (which is about or exceeds which the weight term becomes significant), and where f _j (x) represents a monotonically increasing function. For example, have The weight term has the following form:

Where α _j、β_j、τ_j is an adjustable parameter that indicates the global strength of the penalty, the abrupt nature of the penalty initiation, and the degree of penalty, respectively. Care should be taken in setting these adjustable values so that the relative impact of cost term C _j with respect to any other additional cost term and C _spatial and C _proximity is appropriate for achieving the desired result. For example, empirically, if one wishes a particular penalty to clearly support other penalties, it may be appropriate to set its intensity α _j to about ten times the next maximum penalty intensity.

If all loudspeakers are penalized, it is often convenient to subtract the minimum penalty from all weight terms in the post-processing so that at least one of the loudspeakers is not penalized:

w_ij→w′_ij＝w_ij-min_i(w_ij) (34)

As described above, many possible use cases may be implemented using the new cost function terms described herein (and similar new cost function terms employed in accordance with other embodiments). Next, more specific details are described with the following three examples: moving audio toward a listener or speaker, moving audio away from a listener or speaker, and moving audio away from a landmark.

In a first example, what will be referred to herein as "attraction" is used to pull audio toward a location, which in some examples may be a listener or speaker's location, landmark location, furniture location, etc. The location may be referred to herein as an "attractive location" or "attractor location". As used herein, "attractive force" is a factor that favors relatively higher loudspeaker activation closer to the attractive force location. According to this example, the weight w _ij takes the form of equation 17, with the continuous penalty value p _ij being determined by the ith speaker from the fixed attractor locationIs given by the maximum of these distances for all loudspeakers and the threshold τ _j is given by:

(35 a)

To illustrate a use case of "pulling" audio towards a listener or speaker, in particular, α _j＝20、β_j =3 andSet as a vector corresponding to a listener/speaker position of 180 degrees (bottom center of the plot). Alpha _j、β_j and/>These values of (2) are merely examples. In some embodiments, α _j may be in the range of 1 to 100 and β _j may be in the range of 1 to 25. Fig. 18 is a diagram of speaker activation in an example embodiment. In this example, fig. 18 shows speaker activations 1505b, 1510b, 1515b, 1520b and 1525b, which include the best solutions of the cost functions for the same speaker locations in fig. 15 and 16, plus the attractive force denoted by w _ij. FIG. 19 is a diagram of object rendering locations in an example embodiment. In this example, FIG. 19 shows corresponding ideal object positions 1630b for a large number of possible object angles and corresponding actual rendering positions 1635b for those objects, connected to ideal object positions 1630b by dashed lines 1640 b. The actual rendering position 1635b is towards a fixed position/>The effect of the attractor weights on the best solution of the cost function is illustrated.

In the second and third examples, a "repulsive force" is used to "push" the audio away from a location, which may be a person's location (e.g., listener location, speaker location, etc.) or other location, such as a landmark location, furniture location, etc. In some examples, the repulsive force may be used to push audio out of an area or zone of the listening environment, such as an office area, a reading area, a bed or bedroom area (e.g., crib or bedroom), and so forth. According to some such examples, a particular location may be used as a representation of a zone or region. For example, the position representing the crib may be an estimated position of the head of the baby, an estimated sound source position corresponding to the baby, or the like. The position may be referred to herein as a "repulsive position" or "repulsive position". As used herein, "repulsive force" is a factor that favors relatively lower loudspeaker activation nearer the repulsive force location. According to this example, the repulsive position is relatively fixedP _ij and τ _j are defined, similar to the attractive forces in equations 35a and 35 b:

(35 c)

To illustrate the use case of pushing audio away from a listener or speaker, in one example, α _j＝5,β_j =2 may be specifically set, and willSet as a vector corresponding to a listener/speaker position of 180 degrees (at the bottom center of the plot). Alpha _j、β_j and/>These values of (2) are merely examples. As described above, in some examples, α _j may be in the range of 1 to 100 and β _j may be in the range of 1 to 25. Fig. 20 is a diagram of speaker activation in an example embodiment. According to this example, fig. 20 shows speaker activations 1505c, 1510c, 1515c, 1520c and 1525c, which include the best solutions of the cost functions of the same speaker positions as in the previous figures, plus the repulsive force denoted by w _ij. FIG. 21 is a diagram of object rendering locations in an example embodiment. In this example, FIG. 21 shows ideal object positions 1630c for a large number of possible object angles and corresponding actual rendering positions 1635c for those objects, connected to ideal object positions 1630c by dashed lines 1640 c. Actual rendering position 1635c leaves the fixed position/>The tilt orientation of (c) illustrates the effect of the repulsive sub-weights on the optimal solution of the cost function.

A third example use case is to "push" audio off an acoustically sensitive landmark, such as a door to a room of a sleeping infant. Similar to the last example, willSet to a vector corresponding to a gate position of 180 degrees (bottom center of the plot). In order to achieve a stronger repulsive force and tilt the sound field completely to the front of the main listening space, α _j＝20,β_j =5 is set. Fig. 22 is a diagram of speaker activation in an example embodiment. Again, in this example, fig. 22 shows speaker activations 1505d, 1510d, 1515d, 1520d, and 1525d, which include optimal solutions to the same set of speaker locations, plus a stronger repulsive force. FIG. 23 is a diagram of object rendering locations in an example embodiment. And again, in this example, fig. 23 shows ideal object positions 1630d for a large number of possible object angles and corresponding actual rendering positions 1635d for those objects, connected to ideal object positions 1630d by dashed lines 1640 d. The oblique orientation of the actual rendering position 1635d illustrates the effect of the stronger repulsive sub-weight on the best solution of the cost function.

In a further example of the method 1700 of FIG. 17, a use case is responsive to selection of two or more audio devices in an audio environment to perform a dodging, or other audio processing corresponding to a dodging solution, and to impose a penalty on the two or more audio devices corresponding to a dodging solution. According to the previous example, in some embodiments, the selection of two or more audio devices may take the form of a value f _i, which is a unitless parameter that controls the extent to which audio processing changes occur on audio device i. Many combinations are possible. In a simple example, the penalty assigned to audio device i for purposes of evasion may be directly selected as w _ij＝f_i. In some examples, one or more such weights may be determined by a dodging module (e.g., the dodging module 400 of fig. 6). In some such examples, the dodging solution 480 provided to the renderer may include one or more of such weights. In other examples, these weights may be determined by the renderer. In some such examples, one or more of such weights may be determined by the renderer in response to the dodging solution 480. According to some examples, one or more of such weights may be determined according to an iterative process (e.g., method 800 of fig. 8).

Further to the example of previously determined weights, in some embodiments, the weights may be determined as follows:

in the above equation, α _j、β_j、τ_j represents an adjustable parameter that indicates the global strength of the penalty, the abrupt nature of the penalty start and the degree of penalty, respectively, as described above with reference to equation 33.

The previous example also introduced Si expressed directly in terms of speech echo ratio improvement decibels at the audio device i as an alternative to the unitless parameter f _i describing the evasion. By representing the solution in this way, the penalty can alternatively be determined as follows:

Here, f _i has been replaced by converting the dB term s _i to a value in the range from 0 to infinity. As s _i becomes more negative, the penalty increases, moving more audio from device i.

In some examples (e.g., examples of the first two equations), one or more of the adjustable parameters α _j、β_j、τ_j may be determined by a evasion module (e.g., the evasion module 400 of fig. 6). In some such examples, the dodging solution 480 provided to the renderer may include one or more of such adjustable parameters. In other examples, one or more adjustable parameters may be determined by the renderer. In some such examples, one or more adjustable parameters may be determined by the renderer in response to the dodging solution 480. According to some examples, one or more of such adjustable parameters may be determined according to an iterative process (e.g., method 800 of fig. 8).

The above-described evasion penalty may be understood as part of a combination of multiple penalty terms resulting from multiple simultaneous use scenarios. For example, the penalty described in equations 35c-d may be used to "push" audio away from sensitive landmarks, while the term f _i or s _i, determined by decision-making aspects, may also be used to "push" audio away from the microphone position where improvement SER is desired.

Aspects of some disclosed embodiments include a system or apparatus configured (e.g., programmed) to perform one or more of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more of the disclosed methods or steps thereof. For example, the system may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including one or more of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more of the disclosed methods (or steps thereof) in response to data asserted thereto.

Some disclosed embodiments are implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more of the disclosed methods. Alternatively, some embodiments (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more of the disclosed methods or steps thereof. Alternatively, elements of some disclosed embodiments are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more of the disclosed methods or steps thereof, and the system further comprises other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more of the disclosed methods or steps thereof will typically be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.

Another aspect of some disclosed embodiments is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform any embodiment of one or more disclosed methods or steps thereof) for performing any embodiment of one or more disclosed methods or steps thereof.

Although specific embodiments and applications have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the materials described and claimed herein. It is to be understood that while certain embodiments have been illustrated and described, the disclosure is not limited to the specific examples described and illustrated or the specific methods described.

Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:

EEE 1. An audio processing method comprising:

Receiving, by the control system, output signals from one or more microphones in the audio environment, the output signals including signals corresponding to a current utterance of a person;

determining, by the control system, one or more audio processing variations responsive to the output signals and based at least in part on audio device location information and echo management system information, to apply to audio data rendered as loudspeaker feed signals for two or more audio devices in the audio environment, the audio processing variations comprising a reduction in loudspeaker reproduction level of one or more loudspeakers in the audio environment; and

One or more types of audio processing variations are applied by the control system.

The method of EEE 2. The EEE 1 wherein at least one of the one or more types of audio processing variations corresponds to an increased signal-to-echo ratio.

EEE 3. The method of claim 1 or EEE 2, wherein said echo management system information comprises a model of echo management system performance.

EEE 4. The method of EEE 3, wherein the model of echo management system performance comprises an Acoustic Echo Canceller (AEC) performance matrix.

EEE 5. The method of claim 3 or EEE 4, wherein the model of echo management system performance includes a measure of expected echo return loss enhancement provided by the echo management system.

The method of any of EEEs 1-5, wherein the one or more types of audio processing variations are based at least in part on acoustic models of inter-device echoes and intra-device echoes.

The method of any of EEEs 1-6, wherein the one or more types of audio processing variations are based at least in part on a mutual audibility matrix.

The method of any of EEEs 1-7, wherein the one or more types of audio processing variations are based at least in part on an estimated location of the person.

The method of EEE 9. The EEE 8 wherein the estimated location of the person is based at least in part on output signals from a plurality of microphones in the audio environment.

The method of EEE 10. EEE 8 or EEE 9 wherein the one or more types of audio processing variations involve changing a rendering process to distort the rendering of the audio signal away from the person's estimated location.

The method of any of EEEs 1-10, wherein the one or more types of audio processing variations are based at least in part on a listening objective.

The method of EEE 11 wherein the listening object comprises at least one of a spatial component or a frequency component.

The method of any of EEEs 1-12, wherein the one or more types of audio processing variations are based at least in part on one or more constraints.

The method of EEE 13, wherein the one or more constraints are based on a perceptual model.

The method of EEE 15, EEE 13 or EEE 14, wherein the one or more constraints comprise one or more of an audio content energy retention, an audio spatial retention, an audio energy vector, or a regularization constraint.

The method of any of EEEs 1-15, further comprising updating at least one of an acoustic model of the audio environment or a model of echo management system performance after applying the one or more types of audio processing changes.

The method of any of EEEs 1-16, wherein determining the one or more types of audio processing variations is based on optimization of a cost function.

The method of any of EEEs 1-17, wherein the one or more types of audio processing variations involve spectral modification.

EEE 19. The method of EEE 18, wherein said spectral modification involves reducing the level of audio data in a frequency band between 500Hz to 3 KHz.

The method of any of EEEs 1-19, wherein the current utterance comprises a wake-up word utterance.

EEE 21. An apparatus configured to perform the method of any one of EEEs 1 to 20.

EEE 22. A system configured to perform the method of any one of EEEs 1-20.

EEE 23. One or more non-transitory media having software stored thereon that includes instructions for controlling one or more devices to perform the method of any of EEEs 1-20.

Claims

1. An audio processing method, comprising:

One or more types of audio processing changes are caused to be applied by the control system.

2. The audio processing method of claim 1, wherein at least one of the one or more types of audio processing variations corresponds to an increased signal-to-echo ratio.

3. The audio processing method of claim 1 or claim 2, wherein the echo management system information comprises a model of echo management system performance.

4. The audio processing method of claim 3, wherein the model of echo management system performance comprises an Acoustic Echo Canceller (AEC) performance matrix.

5. An audio processing method as claimed in claim 3 or claim 4, wherein the model of echo management system performance comprises a measure of expected echo return loss enhancement provided by an echo management system.

6. The audio processing method of any of claims 1 to 5, wherein the one or more types of audio processing variations are based at least in part on acoustic models of inter-device echoes and intra-device echoes.

7. The audio processing method of any of claims 1 to 6, wherein the one or more types of audio processing variations are based at least in part on a mutual audibility matrix, the mutual audibility matrix representing energy of echo paths between the audio devices.

8. The audio processing method of any of claims 1 to 7, wherein the one or more types of audio processing variations are based at least in part on an estimated location of the person.

9. The audio processing method of claim 8, wherein the estimated location of the person is based at least in part on output signals from a plurality of microphones in the audio environment.

10. The audio processing method of claim 8 or claim 9, wherein the one or more types of audio processing variations involve changing a rendering process to warp the rendering of an audio signal away from the estimated location of the person.

11. The audio processing method of any of claims 1 to 10, wherein the one or more types of audio processing variations are based at least in part on a listening object.

12. The audio processing method of claim 11, wherein the listening object comprises at least one of a spatial component or a frequency component.

13. The audio processing method of any of claims 1 to 12, wherein the one or more types of audio processing variations are based at least in part on one or more constraints.

14. The audio processing method of claim 13, wherein the one or more constraints are based on a perceptual model.

15. The audio processing method of claim 13 or claim 14, wherein the one or more constraints comprise one or more of audio content energy preservation, audio spatial preservation, audio energy vectors, or regularization constraints.

16. The audio processing method of any of claims 1 to 15, further comprising updating at least one of an acoustic model of the audio environment or a model of echo management system performance after causing the one or more types of audio processing changes to be applied.

17. The audio processing method of any of claims 1 to 16, wherein determining the one or more types of audio processing variations is based on optimization of a cost function.

18. The audio processing method of any of claims 1 to 17, wherein the one or more types of audio processing variations involve spectral modification.

19. The audio processing method of claim 18, wherein the spectral modification involves reducing the level of audio data in a frequency band between 500Hz and 3 KHz.

20. The audio processing method of any of claims 1 to 19, wherein the current utterance comprises a wake-up word utterance.

21. An apparatus configured to perform the method of any one of claims 1 to 20.

22. A system configured to perform the method of any one of claims 1 to 20.

23. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-20.