EP3271744B1 - Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint - Google Patents
Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint Download PDFInfo
- Publication number
- EP3271744B1 EP3271744B1 EP16713210.9A EP16713210A EP3271744B1 EP 3271744 B1 EP3271744 B1 EP 3271744B1 EP 16713210 A EP16713210 A EP 16713210A EP 3271744 B1 EP3271744 B1 EP 3271744B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- spatial region
- ultrasonic
- people
- signal
- estimate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims description 17
- 230000003044 adaptive effect Effects 0.000 claims description 29
- 230000008859 change Effects 0.000 claims description 27
- 238000001514 detection method Methods 0.000 claims description 25
- 238000001914 filtration Methods 0.000 claims 6
- 230000002463 transducing effect Effects 0.000 claims 1
- 230000001131 transforming effect Effects 0.000 claims 1
- 238000002604 ultrasonography Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 230000007704 transition Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000002745 absorbent Effects 0.000 description 1
- 239000002250 absorbent Substances 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S15/00—Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
- G01S15/02—Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems using reflection of acoustic waves
- G01S15/04—Systems determining presence of a target
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S15/00—Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
- G01S15/02—Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems using reflection of acoustic waves
- G01S15/50—Systems of measurement, based on relative movement of the target
- G01S15/52—Discriminating between fixed and moving objects or between objects moving at different speeds
- G01S15/523—Discriminating between fixed and moving objects or between objects moving at different speeds for presence detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/403—Arrangements for multi-party communication, e.g. for conferences
- H04L65/4038—Arrangements for multi-party communication, e.g. for conferences with floor control
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/80—Responding to QoS
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/142—Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
Definitions
- the present disclosure relates to detecting the presence of people using ultrasonic sound.
- a video conference endpoint includes a camera and a microphone to capture video and audio of a participant in a meeting room, and a display to present video. While no participant is in the meeting room, the endpoint may be placed in a standby or sleep mode to conserve power. In standby, components of the endpoint, such as the camera and display, may be deactivated or turned-off When a participant initially enters the meeting room, the endpoint remains in standby until the participant manually wakes-up the endpoint using a remote control or other touch device. If the participant is unfamiliar with the endpoint or if the touch device is not readily available, the simple act of manually activating the endpoint may frustrate the participant and diminish his or her experience.
- US 2010/0226487 A1 discloses a video conferencing endpoint which controls its power state using information received by environmental sensors.
- a microphone is active while a video camera is inactive.
- the system Upon detecting sound energy above a threshold level and above a threshold frequency, the system transitions to a higher power state by applying power to the video camera. Captured video information is analysed to detect motion. If motion is detected, the system automatically transitions to a yet higher power state.
- Video conference environment 100 includes video conference endpoints 104 operated by local users/participants 106 (also referred to as "people" 106) and configured to establish audio-visual teleconference collaboration sessions with each other over a communication network 110.
- Communication network 110 may include one or more wide area networks (WANs), such as the Internet, and one or more local area networks (LANs).
- WANs wide area networks
- LANs local area networks
- a conference server 102 may also be deployed to coordinate the routing of audio-video streams among the video conference endpoints.
- Each video conference endpoint 104 may include multiple video cameras (VC) 112, a video display 114, a loudspeaker (LDSPKR) 116, and one or more microphones (MIC) 118.
- Endpoints 104 may be wired or wireless communication devices equipped with the aforementioned components, such as, but not limited to laptop and tablet computers, smartphones, etc.
- endpoints 104 capture audio/video from their local participants 106 with microphone 118/VC 112, encode the captured audio/video into data packets, and transmit the data packets to other endpoints or to the conference server 102.
- endpoints 104 decode audio/video from data packets received from the conference server 102 or other endpoints and present the audio/video to their local participants 106 via loudspeaker 116/display 114.
- Video conference endpoint 104 includes video cameras 112A and 112B positioned proximate and centered on display 114. Cameras 112A and 112B (collectively referred to as "cameras 112") are each operated under control of endpoint 104 to capture video of participants 106 seated around a table 206 opposite from or facing (i.e., in front of) the cameras (and display 114). The combination of two center video cameras depicted in FIG.
- microphone 118 is positioned adjacent to, and centered along, a bottom side of display 114 (i.e., below the display) so as to receive audio from participants 106 in room 204, although other positions for the microphone are possible.
- video conference endpoint 104 includes an ultrasonic echo canceler to detect whether participants are present (i.e., to detect "people presence") in room 204. Also, endpoint 104 may use people presence detection decisions from the ultrasonic echo canceler to transition the endpoint from sleep to awake or vice versa, as appropriate.
- the ultrasonic echo canceler is described below in connection with FIG. 4 .
- Controller 308 includes a network interface unit 342, a processor 344, and memory 348.
- the network interface (I/F) unit (NR7) 342 is, for example, an Ethernet card or other interface device that allows the controller 308 to communicate over communication network 110.
- Network I/F unit 342 may include wired and/or wireless connection capability.
- Processor 344 may include a collection of microcontrollers and/or microprocessors, for example, each configured to execute respective software instructions stored in the memory 348.
- the collection of microcontrollers may include, for example: a video controller to receive, send, and process video signals related to display 112 and video cameras 112; an audio processor to receive, send, and process audio signals (in human audible and ultrasonic frequency ranges) related to loudspeaker 116 and microphone array 118; and a high-level controller to provide overall control.
- Portions of memory 348 (and the instructions therein) may be integrated with processor 344.
- the terms "audio" and "sound" are synonymous and interchangeable.
- Processor 344 may send pan, tilt, and zoom commands to video cameras 112 to control the cameras. Processor 344 may also send wakeup (i.e., activate) and sleep (i.e., deactivate) commands to video cameras 112.
- the camera wakeup command is used to wakeup cameras 112 to a fully powered-on operational state so they can capture video, while the camera sleep command is used to put the cameras to sleep to save power. In the sleep state, portions of cameras 112 are powered-off or deactivated and the cameras are unable to capture video.
- Processor 344 may similarly send wakeup and sleep commands to display 114 to wakeup the display or put the display to sleep. In another embodiment, processor 344 may selectively wakeup and put to sleep portions of controller 308 while the processor remains active.
- endpoint 104 When any of cameras 112, display 114, and portions of controller 308 are asleep, endpoint 104 is said to be in standby or asleep (i.e., in the sleep mode). Conversely, when all of the components of endpoint 104 are awake and fully operational, endpoint 104 is said to be awake. Operation of the aforementioned components of endpoint 104 in sleep and awake modes, and sleep and wakeup commands that processor 344 may issue to transition the components between the sleep and awake modes are known to those of ordinary skill in the relevant arts.
- the memory 348 may comprise read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible (e.g., non-transitory) memory storage devices.
- ROM read only memory
- RAM random access memory
- magnetic disk storage media devices such as magnetic disks
- optical storage media devices such as magnetic tapes
- flash memory devices such as electrical, optical, or other physical/tangible (e.g., non-transitory) memory storage devices.
- the memory 348 may comprise one or more computer readable storage media (e.g., a memory device) encoded with software comprising computer executable instructions and when the software is executed (by the processor 344) it is operable to perform the operations described herein.
- the memory 348 stores or is encoded with instructions for control logic 350 to perform operations described herein to (i) implement an ultrasonic echo canceler to detect a change in people presence, and (ii) wakeup endpoint 104 or put the endpoint to sleep based on the detected people presence.
- memory 348 stores data/information 356 used and generated by logic 350, including, but not limited to, adaptive filter coefficients, power estimate thresholds indicative of people presence, predetermined timeouts, and current operating modes of the various components of endpoint 104 (e.g., sleep and awake states), as described below.
- Ultrasonic echo canceler 400 includes loudspeaker 116, microphone 118, analysis filter banks 404 and 406, a differencer S (i.e. a subtractor S), an adaptive filter 407 associated with adaptive filter coefficients, a power estimator 408, and a people presence detector 410.
- Analysis filter banks 404 and 406, differencer S, adaptive filter 407, power estimator 408, and detector 410 represent ultrasonic sound signal processing modules that may be implemented in controller 308.
- ultrasonic echo canceler 400 detects people presence in room 204 (i.e.
- controller 308 uses the people presence indications to selectively wakeup endpoint 104 when people are present (e.g., have entered the room) or put the endpoint to sleep when people are not present (e.g., have left the room), as indicated by the echo canceler.
- Echo canceler 104 and controller 308 perform the aforementioned operations automatically, i.e., without manual intervention. Also, echo canceler 104 and controller 308 are operational to perform the operations described herein while endpoint 104 (or components thereof) is both awake and asleep.
- Ultrasonic echo canceler 400 operates as follows. Controller 308 generates an ultrasonic signal x(n), where n is a time index that increases with time, and provides the ultrasonic signal x(n) to an input of loudspeaker 116. Loudspeaker 116 transmits ultrasonic signal x(n) into a spatial region (e.g., room 204). Ultrasonic signal x(n) has a frequency in an audio frequency range that is generally beyond the frequency range of human hearing, but which can be transmitted from most loudspeakers and picked up by most microphones.
- This frequency range is generally accepted as approximately 20Khz and above, however, embodiments described herein may also operate at frequencies below 20Khz (e.g., 19Khz) that most people would not be able to hear.
- the transmitted ultrasonic signal bounces around in room 204 before it is received and thereby picked up by microphone 118 via an echo path 420.
- Microphone 118 transduces sound received at the microphone into a microphone signal y(n), comprising ultrasonic echo u(n), local sound v(n), and background noise w(n).
- Microphone 118 provides microphone signal y(n) to analysis filter bank 406, which transforms the microphone signal y(n) into a time-frequency domain including multiple ultrasonic frequency subbands Y(m,1)-Y(m,N) spanning an ultrasonic frequency range. Also, analysis filter bank 404 transforms ultrasonic signal x(n) into a time-frequency domain including multiple ultrasonic frequency subbands X(m,1)-X(m,N) spanning an ultrasonic frequency range.
- adaptive filter 407 In a k'th one of the ultrasonic frequency subbands X(m,k), adaptive filter 407 generates an estimate ⁇ ( m, k ) of the subband echo signal U(m,k), where m denotes the time frame index. Differencer S subtracts the echo estimate ⁇ ( m,k ) from the subband microphone signal Y(m,k) output by analysis filter bank 406 to form an error (signal) Z(m,k) that is fed back into adaptive filter 407. Adaptive filter coefficients of adaptive filter 407 are adjusted responsive to the fed back error signal.
- Power estimator 408 computes a running estimate of the mean squared error (power) E
- the first term is the R X k -norm of the divergence between optimal and estimated filter coefficients
- the second term is the power of the local sound signal (v(n))
- the third terms is the power of the background noise (w(n)). It is assumed in the following that the power of the background noise is stationary and time-invariant.
- adaptive filter 407 When somebody enters room 204, the acoustic room impulse response will change abruptly and adaptive filter 407 will no longer be in a converged state, and it will therefore provide a poor estimate ⁇ ( m, k ) of subband echo signal U(m,k). Also, as long as there is movement in the room, adaptive filter 407 attempts to track the continuously changing impulse response and may never achieve the same depth of convergence. Furthermore, movement in the room may cause Doppler shift, so that some of the energy in one frequency subband leaks over to a neighboring subband. The Doppler Effect can result in both a changed impulse response for a subband and also a mismatch between audio content in the loudspeaker subband output from analysis filter bank 404 and the microphone subband output from analysis filter bank 406.
- power estimator 408 estimates the power of error signal Z(m,k).
- detector 410 receives the power estimates of the error signal and performs people presence detection based on the power estimates. As indicated above, as soon as a person enters room 204, the power estimate of the error signal will change from a relatively small level to relatively large level.
- detection may be performed by comparing the power estimate of the error signal over time with a threshold that may be set, for example, to a few dBs above the steady-state power (e.g., the steady-state power is the power corresponding to when adaptive filter 407 is in a converged state, such that the threshold is indicative of the steady-state or converged state of the adaptive filter), or, if a running estimate of the variance of the power signal is also computed, to a fixed number of standard deviations above the steady-state power.
- a threshold estimates a statistical model of the power signal, and bases the decision on likelihood evaluations. It is desirable to design adaptive filter 407 for deep convergence instead of fast convergence. This is a well known tradeoff that can be controlled in stochastic gradient descent algorithms like normalized least mean squares (NLMS) with a step size and/or a regularization parameter.
- NLMS normalized least mean squares
- a more robust method may be achieved using an individual adaptive filter in each of multiple frequency subbands X(m,1)-X(m,N) (i.e., replicate adaptive filter 407) for each frequency subband, to produce an estimate of echo for each frequency subband.
- an error signal is generated corresponding to each of the frequency subbands from analysis filter bank 404, and an error signal for each frequency subband is produced based on the estimate of the echo for that frequency subband and a corresponding one of the transformed microphone signal frequency subbands Y(m,k) from analysis filter bank 406.
- Power estimator 408 computes a power estimate of the error signal for each of the frequency subbands and then combines them all into a total power estimate across the frequency subbands.
- each power estimate may be a moving average of power estimates so that the total power estimate is a total of the moving average of power estimates.
- Ultrasonic signal x(n) may either be an ultrasonic signal that is dedicated to the task of detecting people presence, or an existing ultrasonic signal, such as an ultrasonic pairing signal, as long as endpoint 104 is able to generate and transmit the ultrasonic signal while the endpoint is asleep, i.e., in standby. Best performance may be achieved when ultrasonic signal x(n) is stationary and when there is minimal autocorrelation of the non-zero lags of the subband transmitted loudspeaker signal.
- the correlation matrix R X k of ultrasonic signal x(n) may be used to a certain degree to control the relative sensitivity of the people presence detection to the adaptive filter mismatch and the local sound from within the room.
- FIG. 5 there is a flowchart of an example method 500 of detecting people presence in a spatial region (e.g., room 204) using ultrasonic echo cancel er 400, and using the detections to selectively wakeup or put to sleep endpoint 104.
- Echo canceler 400 and controller 308 are fully operational while endpoint 104 is asleep and awake.
- processor 308 generates an ultrasonic signal (e.g., x(n)).
- loudspeaker 116 transmits the ultrasonic signal into a spatial region (e.g., room 204).
- microphone 118 transduces sound, including ultrasonic sound that includes an echo of the transmitted ultrasonic signal, into a received ultrasonic signal (e.g., y(n)).
- a received ultrasonic signal e.g., y(n)
- analysis filter banks 404 and 406 transform the ultrasonic signal (e.g., u(n)) and the received ultrasonic microphone signal into respective time-frequency domains each having respective ultrasonic frequency subbands.
- differencer S computes an error signal, representative of an estimate of an echo-free received ultrasonic signal, based on the transformed ultrasonic signal and the transformed received ultrasonic signal. More specifically, difference 406 subtracts an estimate of the echo signal in the time-frequency domain from the transformed received ultrasonic signal to produce the error signal. This is a closed-loop ultrasonic echo canceling operation performed in at least one ultrasonic frequency subband using adaptive filter 407, which produces the estimate of the echo signal, where the error signal is fed back to the adaptive filter.
- power estimator 408 computes power estimates of the error signal over time, e.g., the power estimator repetitively performs the power estimate computation as time progresses to produce a time sequence of power estimates.
- the power estimates may be a moving average of power estimates based on a current power estimate and one or more previous power estimates.
- detector 410 detects people presence in the spatial region (e.g., room 204) over time based on the power estimates of the error signal over time.
- detector 410 may detect a change in people presence in the spatial region over time based on a change in the power estimates (or a change in the moving average power estimates) of the error signal over time.
- processor 344 issues commands to selectively wakeup endpoint 104 or put the endpoint to sleep as appropriate based on the detections at 530.
- the detection of people presence as described above may activate only those components of endpoint 104, comprising video cameras 112, required by the endpoint to aid in additional processing by processor 344, comprising detecting faces and motion in room 204 based on video captured by the activated/awakened cameras.
- the people presence detection triggers face and motion detection by endpoint 104. If faces and/or motion are detected subsequent to people presence being detected, only then does processor 344 issue commands to fully wakeup endpoint 104.
- the face and motion detection is a confirmation that people have entered room 204, which may avoid unnecessary wakeups due to false (initial) detections of people presence. Any known or hereafter developed technique to perform face and motion detection may be used in the confirmation operation.
- detector 410 compares power estimates (or a moving average of power estimates computed using a rectangular window as in equation (6) or an exponentially decaying window as in equation (7)) to a power estimate (or moving average) threshold indicative of people presence over time.
- a detection threshold a few dBs (e.g., 2-5 dBs) above a steady-state power of the power estimates.
- the steady-state power estimates occurs or corresponds to when adaptive filter 407 is in a steady-state, i.e., a converged state.
- Another way would be to compute the mean and variance over time of the power estimates in steady-state, and to set the threshold automatically as a few standard deviations (e.g., 2-5) above the mean (steady-state power).
- standard deviations e.g. 2-5
- processor 308 issues commands to wakeup endpoint 104 if the endpoint was previously asleep.
- processor 308 issues commands to put endpoint 104 to sleep if the endpoint was previously awake.
- controller 308 may respectively issue wakeup and sleep commands to cameras 112, display 114, and/or portions of the controller that may be selectively awoken and put to sleep responsive to the commands. Also, timers may be used in operations 610 and 615 to ensure a certain level of hysteresis to dampen frequent switching between awake and sleep states of endpoint 104.
- operation 610 may require that the power estimate level remain above the threshold for a first predetermined time (e.g., on the order of several seconds, such as 3 or more seconds) measured from the time that the level reaches the threshold before issuing a command to wakeup endpoint 104, and operation 615 may require that the power estimate level remain below the threshold for a second predetermined time (e.g., also on the order of several seconds) measured from the time the level falls below the threshold before issuing a command to put endpoint 104 to sleep.
- a first predetermined time e.g., on the order of several seconds, such as 3 or more seconds
- second predetermined time e.g., also on the order of several seconds
- embodiments presented herein perform the following operations: play/transmit a stationary ultrasonic signal from a loudspeaker; convert sound picked-up by a microphone (i.e., a microphone signal) and the ultrasonic signal from the loudspeaker into the time-frequency domain; estimate an echo-free near-end signal (i.e., error signal) at the microphone with an ultrasonic frequency sub-band adaptive filter; (this is an ultrasonic echo canceling operation); compute an estimate on the power of the error signal (or a running estimate thereof); detect people presence (or a change in people presence) from the estimated power (or changes/variations) in the estimated power.
- detections are used to wakeup a camera that was previously asleep, and also cause additional processing to occur, comprising detection of faces and motion using video captured by the awakened camera.
- a video conference endpoint is provided as defined in claim 7.
- a (non-transitory) processor readable medium is provided as defined in claim 9.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
Description
- The present disclosure relates to detecting the presence of people using ultrasonic sound.
- A video conference endpoint includes a camera and a microphone to capture video and audio of a participant in a meeting room, and a display to present video. While no participant is in the meeting room, the endpoint may be placed in a standby or sleep mode to conserve power. In standby, components of the endpoint, such as the camera and display, may be deactivated or turned-off When a participant initially enters the meeting room, the endpoint remains in standby until the participant manually wakes-up the endpoint using a remote control or other touch device. If the participant is unfamiliar with the endpoint or if the touch device is not readily available, the simple act of manually activating the endpoint may frustrate the participant and diminish his or her experience.
-
US 2010/0226487 A1 discloses a video conferencing endpoint which controls its power state using information received by environmental sensors. In a lower powered state a microphone is active while a video camera is inactive. Upon detecting sound energy above a threshold level and above a threshold frequency, the system transitions to a higher power state by applying power to the video camera. Captured video information is analysed to detect motion. If motion is detected, the system automatically transitions to a yet higher power state. - The invention is defined by the attached independent claims. Embodiments of the invention are defined by the dependent claims. Any embodiments described herein which do not fall within the scope of the claims are to be interpreted as examples.
-
-
FIG. 1 is a block diagram of a video conference (e.g., teleconference) environment in which embodiments to automatically detect the presence of people proximate a video conference endpoint in a room and selectively wakeup the video conference endpoint or put the endpoint to sleep may be implemented, according to an example embodiment. -
FIG. 2 is an illustration of a video conference endpoint deployed in a room, according to an example embodiment. -
FIG. 3 is a block diagram of a controller of the video conference endpoint, according to an example embodiment. -
FIG. 4 is a block diagram of an ultrasonic echo canceler implemented in the video conference endpoint to detect whether people are present in a room, according to an example. -
FIG. 5 is a flowchart of a method of detecting whether people are present in a room using the ultrasonic echo canceler of the video conference endpoint, and using the detections to selectively wakeup the endpoint or put the endpoint to sleep, according to an example -
FIG. 6 is a series of operations expanding on detection and wakeup/sleep control operations from the method ofFIG. 5 , according to an example. - With reference to
FIG. 1 , there is depicted a block diagram of an example video conference (e.g., teleconference)environment 100 in which embodiments to automatically detect the presence of people (i.e., "people presence") proximate a video conference endpoint (EP) and selectively wakeup the endpoint or put the endpoint to sleep may be implemented.Video conference environment 100 includesvideo conference endpoints 104 operated by local users/participants 106 (also referred to as "people" 106) and configured to establish audio-visual teleconference collaboration sessions with each other over acommunication network 110.Communication network 110 may include one or more wide area networks (WANs), such as the Internet, and one or more local area networks (LANs). Aconference server 102 may also be deployed to coordinate the routing of audio-video streams among the video conference endpoints. - Each
video conference endpoint 104 may include multiple video cameras (VC) 112, avideo display 114, a loudspeaker (LDSPKR) 116, and one or more microphones (MIC) 118.Endpoints 104 may be wired or wireless communication devices equipped with the aforementioned components, such as, but not limited to laptop and tablet computers, smartphones, etc. In a transmit direction,endpoints 104 capture audio/video from theirlocal participants 106 withmicrophone 118/VC 112, encode the captured audio/video into data packets, and transmit the data packets to other endpoints or to theconference server 102. In a receive direction,endpoints 104 decode audio/video from data packets received from theconference server 102 or other endpoints and present the audio/video to theirlocal participants 106 vialoudspeaker 116/display 114. - Referring now to
FIG. 2 , there is depicted an illustration ofvideo conference endpoint 104 deployed in a conference room 204 (depicted simplistically as an outline inFIG. 2 ), according to an embodiment.Video conference endpoint 104 includesvideo cameras display 114.Cameras cameras 112") are each operated under control ofendpoint 104 to capture video ofparticipants 106 seated around a table 206 opposite from or facing (i.e., in front of) the cameras (and display 114). The combination of two center video cameras depicted inFIG. 2 is only one example of many possible camera combinations that may be used, including video cameras spaced-apart fromdisplay 114, as would be appreciated by one of ordinary skill in the relevant arts having read the present description. As depicted in the example ofFIG. 2 ,microphone 118 is positioned adjacent to, and centered along, a bottom side of display 114 (i.e., below the display) so as to receive audio fromparticipants 106 inroom 204, although other positions for the microphone are possible. - According to examples presented herein,
video conference endpoint 104 includes an ultrasonic echo canceler to detect whether participants are present (i.e., to detect "people presence") inroom 204. Also,endpoint 104 may use people presence detection decisions from the ultrasonic echo canceler to transition the endpoint from sleep to awake or vice versa, as appropriate. The ultrasonic echo canceler is described below in connection withFIG. 4 . - Reference is now made to
FIG. 3 , which shows an example block diagram of acontroller 308 ofvideo conference endpoint 104 configured to perform techniques described herein. There are numerous possible configurations forcontroller 308 andFIG. 3 is meant to be an example.Controller 308 includes anetwork interface unit 342, aprocessor 344, andmemory 348. The network interface (I/F) unit (NR7) 342 is, for example, an Ethernet card or other interface device that allows thecontroller 308 to communicate overcommunication network 110. Network I/F unit 342 may include wired and/or wireless connection capability. -
Processor 344 may include a collection of microcontrollers and/or microprocessors, for example, each configured to execute respective software instructions stored in thememory 348. The collection of microcontrollers may include, for example: a video controller to receive, send, and process video signals related to display 112 andvideo cameras 112; an audio processor to receive, send, and process audio signals (in human audible and ultrasonic frequency ranges) related toloudspeaker 116 andmicrophone array 118; and a high-level controller to provide overall control. Portions of memory 348 (and the instructions therein) may be integrated withprocessor 344. As used herein, the terms "audio" and "sound" are synonymous and interchangeable. -
Processor 344 may send pan, tilt, and zoom commands tovideo cameras 112 to control the cameras.Processor 344 may also send wakeup (i.e., activate) and sleep (i.e., deactivate) commands tovideo cameras 112. The camera wakeup command is used to wakeupcameras 112 to a fully powered-on operational state so they can capture video, while the camera sleep command is used to put the cameras to sleep to save power. In the sleep state, portions ofcameras 112 are powered-off or deactivated and the cameras are unable to capture video.Processor 344 may similarly send wakeup and sleep commands to display 114 to wakeup the display or put the display to sleep. In another embodiment,processor 344 may selectively wakeup and put to sleep portions ofcontroller 308 while the processor remains active. When any ofcameras 112, display 114, and portions ofcontroller 308 are asleep,endpoint 104 is said to be in standby or asleep (i.e., in the sleep mode). Conversely, when all of the components ofendpoint 104 are awake and fully operational,endpoint 104 is said to be awake. Operation of the aforementioned components ofendpoint 104 in sleep and awake modes, and sleep and wakeup commands thatprocessor 344 may issue to transition the components between the sleep and awake modes are known to those of ordinary skill in the relevant arts. - The
memory 348 may comprise read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible (e.g., non-transitory) memory storage devices. Thus, in general, thememory 348 may comprise one or more computer readable storage media (e.g., a memory device) encoded with software comprising computer executable instructions and when the software is executed (by the processor 344) it is operable to perform the operations described herein. For example, thememory 348 stores or is encoded with instructions forcontrol logic 350 to perform operations described herein to (i) implement an ultrasonic echo canceler to detect a change in people presence, and (ii)wakeup endpoint 104 or put the endpoint to sleep based on the detected people presence. - In addition,
memory 348 stores data/information 356 used and generated bylogic 350, including, but not limited to, adaptive filter coefficients, power estimate thresholds indicative of people presence, predetermined timeouts, and current operating modes of the various components of endpoint 104 (e.g., sleep and awake states), as described below. - With reference to
FIG. 4 , there is depicted a block diagram of exampleultrasonic echo canceler 400 implemented inendpoint 104 to detect people presence.Ultrasonic echo canceler 400 includesloudspeaker 116, microphone 118,analysis filter banks adaptive filter 407 associated with adaptive filter coefficients, apower estimator 408, and apeople presence detector 410.Analysis filter banks adaptive filter 407,power estimator 408, anddetector 410 represent ultrasonic sound signal processing modules that may be implemented incontroller 308. As will be described in detail below,ultrasonic echo canceler 400 detects people presence in room 204 (i.e. when people are and are not present in the room), andcontroller 308 uses the people presence indications to selectivelywakeup endpoint 104 when people are present (e.g., have entered the room) or put the endpoint to sleep when people are not present (e.g., have left the room), as indicated by the echo canceler.Echo canceler 104 andcontroller 308 perform the aforementioned operations automatically, i.e., without manual intervention. Also, echocanceler 104 andcontroller 308 are operational to perform the operations described herein while endpoint 104 (or components thereof) is both awake and asleep. -
Ultrasonic echo canceler 400 operates as follows.Controller 308 generates an ultrasonic signal x(n), where n is a time index that increases with time, and provides the ultrasonic signal x(n) to an input ofloudspeaker 116.Loudspeaker 116 transmits ultrasonic signal x(n) into a spatial region (e.g., room 204). Ultrasonic signal x(n) has a frequency in an audio frequency range that is generally beyond the frequency range of human hearing, but which can be transmitted from most loudspeakers and picked up by most microphones. This frequency range is generally accepted as approximately 20Khz and above, however, embodiments described herein may also operate at frequencies below 20Khz (e.g., 19Khz) that most people would not be able to hear. The transmitted ultrasonic signal bounces around inroom 204 before it is received and thereby picked up bymicrophone 118 via anecho path 420.Microphone 118 transduces sound received at the microphone into a microphone signal y(n), comprising ultrasonic echo u(n), local sound v(n), and background noise w(n).Microphone 118 provides microphone signal y(n) toanalysis filter bank 406, which transforms the microphone signal y(n) into a time-frequency domain including multiple ultrasonic frequency subbands Y(m,1)-Y(m,N) spanning an ultrasonic frequency range. Also,analysis filter bank 404 transforms ultrasonic signal x(n) into a time-frequency domain including multiple ultrasonic frequency subbands X(m,1)-X(m,N) spanning an ultrasonic frequency range. - In a k'th one of the ultrasonic frequency subbands X(m,k),
adaptive filter 407 generates an estimate Û(m, k) of the subband echo signal U(m,k), where m denotes the time frame index. Differencer S subtracts the echo estimate Û(m,k) from the subband microphone signal Y(m,k) output byanalysis filter bank 406 to form an error (signal) Z(m,k) that is fed back intoadaptive filter 407. Adaptive filter coefficients ofadaptive filter 407 are adjusted responsive to the fed back error signal.Power estimator 408 computes a running estimate of the mean squared error (power) E|Z(m,k)|2 of the error signal Z(m,k) anddetector 410 detects a changing people presence, e.g., when somebody walks intoroom 204 where nobody has been for a while, based on the mean squared power. - The following is an explanation of how the mean squared error E |Z(m, k)|2 is a good indicator of whether someone enters the
room 204. Let Xk(m) = [X(m,k), X(m-1,k) ... X(m-M+1, k)]T denote a delay line foradaptive filter 407, where M denotes the number of adaptive filter coefficients employed in the adaptive filter. Furthermore, let Ĥk (m) denote the vector of the M adaptive filter coefficients. The echo estimate can then be written:
where (·)H denotes the Hermitian operator. The time-frequency domain transformation of the microphone signal y(n) is given by:
where Hk(m) is the unknown optimal linear filter, and where it is assumed that the error introduced byanalysis filter bank 406 is negligible. The error is then given by the following equation (3):
and the mean squared error can be written as the following equation (4):
wherek (m) = RXk . Then the following relationship applies:
wherek -norm of the divergence between optimal and estimated filter coefficients, the second term is the power of the local sound signal (v(n)), and the third terms is the power of the background noise (w(n)). It is assumed in the following that the power of the background noise is stationary and time-invariant. - When nobody is in
room 204, and nobody has been in the room for a while, the acoustic room impulse response will be approximately static (no change) andadaptive filter 407 will be in a well converged state and provide a good estimate Û(m, k) of subband echo signal U(m,k). Therefore we have that - When somebody enters
room 204, the acoustic room impulse response will change abruptly andadaptive filter 407 will no longer be in a converged state, and it will therefore provide a poor estimate Û(m, k) of subband echo signal U(m,k). Also, as long as there is movement in the room,adaptive filter 407 attempts to track the continuously changing impulse response and may never achieve the same depth of convergence. Furthermore, movement in the room may cause Doppler shift, so that some of the energy in one frequency subband leaks over to a neighboring subband. The Doppler Effect can result in both a changed impulse response for a subband and also a mismatch between audio content in the loudspeaker subband output fromanalysis filter bank 404 and the microphone subband output fromanalysis filter bank 406. Both of these effects lead to residual echo and thuspower estimator 408 estimates the power of error signal Z(m,k). To do this, either a rectangular window of length L may be used as in:
or an exponential recursive weighting may be used as in:
where a is a forgetting factor in the range [0, 1]. - As mentioned above,
detector 410 receives the power estimates of the error signal and performs people presence detection based on the power estimates. As indicated above, as soon as a person entersroom 204, the power estimate of the error signal will change from a relatively small level to relatively large level. Thus, detection may be performed by comparing the power estimate of the error signal over time with a threshold that may be set, for example, to a few dBs above the steady-state power (e.g., the steady-state power is the power corresponding to whenadaptive filter 407 is in a converged state, such that the threshold is indicative of the steady-state or converged state of the adaptive filter), or, if a running estimate of the variance of the power signal is also computed, to a fixed number of standard deviations above the steady-state power. Another example estimates a statistical model of the power signal, and bases the decision on likelihood evaluations. It is desirable to designadaptive filter 407 for deep convergence instead of fast convergence. This is a well known tradeoff that can be controlled in stochastic gradient descent algorithms like normalized least mean squares (NLMS) with a step size and/or a regularization parameter. - With a single adaptive filter in one narrow ultrasonic frequency subband, e.g., k, as depicted in
FIG. 4 , the detection performance may be degraded due to a notch in the frequency response ofloudspeaker 116, a notch in the frequency response ofmicrophone 118, or an absorbent inroom 204 within that particular frequency subband. Therefore, according to an embodiment of the invention a more robust method may be achieved using an individual adaptive filter in each of multiple frequency subbands X(m,1)-X(m,N) (i.e., replicate adaptive filter 407) for each frequency subband, to produce an estimate of echo for each frequency subband. In that case, an error signal is generated corresponding to each of the frequency subbands fromanalysis filter bank 404, and an error signal for each frequency subband is produced based on the estimate of the echo for that frequency subband and a corresponding one of the transformed microphone signal frequency subbands Y(m,k) fromanalysis filter bank 406.Power estimator 408 computes a power estimate of the error signal for each of the frequency subbands and then combines them all into a total power estimate across the frequency subbands. For example, the total power estimate across the subbands for a given frame may be computed according to: PZ (m) = αPZ (m - 1) + (1 - α)Σk |Z(m, k)|2, where Σk indicates the sum over all subbands k that are in use. Alternatively, the total power estimate across the subbands may be computed according to: PZ(m) = Σk PZ(m,k) . In an embodiment, each power estimate may be a moving average of power estimates so that the total power estimate is a total of the moving average of power estimates. - Ultrasonic signal x(n) may either be an ultrasonic signal that is dedicated to the task of detecting people presence, or an existing ultrasonic signal, such as an ultrasonic pairing signal, as long as
endpoint 104 is able to generate and transmit the ultrasonic signal while the endpoint is asleep, i.e., in standby. Best performance may be achieved when ultrasonic signal x(n) is stationary and when there is minimal autocorrelation of the non-zero lags of the subband transmitted loudspeaker signal. The correlation matrix R Xk of ultrasonic signal x(n) may be used to a certain degree to control the relative sensitivity of the people presence detection to the adaptive filter mismatch and the local sound from within the room. - With reference to
FIG. 5 , there is a flowchart of anexample method 500 of detecting people presence in a spatial region (e.g., room 204) using ultrasonic echo cancel er 400, and using the detections to selectively wakeup or put tosleep endpoint 104.Echo canceler 400 andcontroller 308 are fully operational whileendpoint 104 is asleep and awake. - At 505,
processor 308 generates an ultrasonic signal (e.g., x(n)). - At 510,
loudspeaker 116 transmits the ultrasonic signal into a spatial region (e.g., room 204). - At 510,
microphone 118 transduces sound, including ultrasonic sound that includes an echo of the transmitted ultrasonic signal, into a received ultrasonic signal (e.g., y(n)). - At 515,
analysis filter banks - At 520, differencer S computes an error signal, representative of an estimate of an echo-free received ultrasonic signal, based on the transformed ultrasonic signal and the transformed received ultrasonic signal. More specifically,
difference 406 subtracts an estimate of the echo signal in the time-frequency domain from the transformed received ultrasonic signal to produce the error signal. This is a closed-loop ultrasonic echo canceling operation performed in at least one ultrasonic frequency subband usingadaptive filter 407, which produces the estimate of the echo signal, where the error signal is fed back to the adaptive filter. - At 525,
power estimator 408 computes power estimates of the error signal over time, e.g., the power estimator repetitively performs the power estimate computation as time progresses to produce a time sequence of power estimates. The power estimates may be a moving average of power estimates based on a current power estimate and one or more previous power estimates. - At 530,
detector 410 detects people presence in the spatial region (e.g., room 204) over time based on the power estimates of the error signal over time. In an example,detector 410 may detect a change in people presence in the spatial region over time based on a change in the power estimates (or a change in the moving average power estimates) of the error signal over time. - At 535,
processor 344 issues commands to selectivelywakeup endpoint 104 or put the endpoint to sleep as appropriate based on the detections at 530. - According to the invention, the detection of people presence as described above may activate only those components of
endpoint 104, comprisingvideo cameras 112, required by the endpoint to aid in additional processing byprocessor 344, comprising detecting faces and motion inroom 204 based on video captured by the activated/awakened cameras. In other words, the people presence detection triggers face and motion detection byendpoint 104. If faces and/or motion are detected subsequent to people presence being detected, only then doesprocessor 344 issue commands to fullywakeup endpoint 104. Thus, the face and motion detection is a confirmation that people have enteredroom 204, which may avoid unnecessary wakeups due to false (initial) detections of people presence. Any known or hereafter developed technique to perform face and motion detection may be used in the confirmation operation. - With reference to
FIG. 6 , there is a flowchart ofoperations 600, which expand onoperations method 500. - To detect people presence (or a change in people presence), at 605
detector 410 compares power estimates (or a moving average of power estimates computed using a rectangular window as in equation (6) or an exponentially decaying window as in equation (7)) to a power estimate (or moving average) threshold indicative of people presence over time. One way to detect people presence is to set a detection threshold a few dBs (e.g., 2-5 dBs) above a steady-state power of the power estimates. The steady-state power estimates occurs or corresponds to whenadaptive filter 407 is in a steady-state, i.e., a converged state. Another way would be to compute the mean and variance over time of the power estimates in steady-state, and to set the threshold automatically as a few standard deviations (e.g., 2-5) above the mean (steady-state power). These methods for detection apply to both the case when a single subband is used, and for the case when multiple subbands are used. - At 610, if the power estimates transition from a first level that is less than the power estimate threshold to a second level that is greater than or equal to the power estimate threshold,
processor 308 issues commands towakeup endpoint 104 if the endpoint was previously asleep. - At 615, if the power estimates transition from a first level that is greater than or equal to the threshold to a second level that is less than the threshold,
processor 308 issues commands to putendpoint 104 to sleep if the endpoint was previously awake. - In
operations controller 308 may respectively issue wakeup and sleep commands tocameras 112,display 114, and/or portions of the controller that may be selectively awoken and put to sleep responsive to the commands. Also, timers may be used inoperations endpoint 104. For example,operation 610 may require that the power estimate level remain above the threshold for a first predetermined time (e.g., on the order of several seconds, such as 3 or more seconds) measured from the time that the level reaches the threshold before issuing a command towakeup endpoint 104, andoperation 615 may require that the power estimate level remain below the threshold for a second predetermined time (e.g., also on the order of several seconds) measured from the time the level falls below the threshold before issuing a command to putendpoint 104 to sleep. - In summary, embodiments presented herein perform the following operations: play/transmit a stationary ultrasonic signal from a loudspeaker; convert sound picked-up by a microphone (i.e., a microphone signal) and the ultrasonic signal from the loudspeaker into the time-frequency domain; estimate an echo-free near-end signal (i.e., error signal) at the microphone with an ultrasonic frequency sub-band adaptive filter; (this is an ultrasonic echo canceling operation); compute an estimate on the power of the error signal (or a running estimate thereof); detect people presence (or a change in people presence) from the estimated power (or changes/variations) in the estimated power.
- According to the invention, detections are used to wakeup a camera that was previously asleep, and also cause additional processing to occur, comprising detection of faces and motion using video captured by the awakened camera.
- In summary, in one form, a method is provided as defined in
claim 1. - In another form, a video conference endpoint is provided as defined in claim 7.
- In yet another form, a (non-transitory) processor readable medium is provided as defined in claim 9.
- The above description is intended by way of example only. Various modifications and structural changes may be made therein without departing from the scope of the concepts described herein, the invention being defined solely by the scope of the claims.
Claims (12)
- A method performed by a video conference endpoint, the method comprising:transmitting (510) an ultrasonic signal into a spatial region;transducing (515) ultrasonic sound, including an echo of the transmitted ultrasonic signal, received from the spatial region at a microphone into a received ultrasonic signal;transforming (520) the ultrasonic signal and the received ultrasonic signal into respective time-frequency domains that cover respective ultrasonic frequency subbands;computing (525) an error signal, representative of an estimate of an echo-free received ultrasonic signal, based on the transformed ultrasonic signal and the transformed received ultrasonic signal by:adaptively filtering multiple ultrasonic frequency subbands of the transformed ultrasonic signal based on a set of adaptive filter coefficients adjusted responsive to the error signal individually to produce a respective estimate of the echo for each of the ultrasonic frequency subbands;differencing each echo estimate and a corresponding one of multiple ultrasonic frequency subbands of the transformed received ultrasonic signal to produce the error signal for each of the ultrasonic frequency subbands; andfeeding-back the error signal to the adaptively filtering operation;repetitively computing (530) a total power estimate based on the error signals across the ultrasonic frequency subbands overtime;detecting (535) a change in people presence in the spatial region overtime based on a change in the total power estimates of the error signal computed across the ultrasonic frequency subbands over time;if the detecting indicates that people are present in the spatial region, issuing (540) a command to wakeup a video camera that was previously asleep so that the video camera is able to capture video of at least a portion of the spatial region;performing face and motion detection based on video of the spatial region captured by the video camera; andissuing one or more commands to wakeup the video conference endpoint only if the face and motion detection confirms the presence of people in the spatial region.
- The method of claim 1, wherein:repetitively computing the power estimate includes computing a moving average of power estimates over time, wherein the moving average is based on a current power estimate of the error signal and one or more previous power estimates of the error signal; andthe detecting includes detecting a change in people presence over time based on a change in the moving average overtime.
- The method of claim 2, wherein the detecting includes:comparing the moving average of power estimates to a moving average power threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the moving average of power estimates has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declaring that people are present in the spatial region.
- The method of claim 1, wherein the detecting includes:comparing the total power estimate to a power estimate threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the total power estimate has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declaring people are present in the spatial region.
- A video conference endpoint (104) comprising:a loudspeaker (116) configured to transmit (510) an ultrasonic signal into a spatial region;a video camera (112) arranged to be able to capture video of at least a portion of the spatial region;a microphone (118) configured to transduce (515) ultrasonic sound, including an echo of the transmitted ultrasonic signal, received from the spatial region into a received ultrasonic signal;a processor coupled to the loudspeaker (116), the video camera (112) and the microphone (118) wherein the processor is configured to:transform (520) the ultrasonic signal and the received ultrasonic signal into respective time-frequency domains that cover respective ultrasonic frequency subbands;compute (525) an error signal, representative of an estimate of an echo-free received ultrasonic signal, based on the transformed ultrasound signal and the transformed received ultrasonic signals byadaptively filtering multiple ultrasonic frequency subbands of the transformed ultrasonic signal based on a set of adaptive filter coefficients adjusted responsive to the error signal individually to produce a respective estimate of the echo for each of the ultrasonic frequency subbands;differencing each echo estimate and a corresponding one of multiple ultrasonic frequency subbands of the transformed received ultrasonic signal to produce the error signal for each of the ultrasonic frequency subbands; andfeeding-back the error signal to the adaptively filtering operation;repetitively compute (530) a total power estimate based on the error signals across the ultrasonic frequency subbands over time; anddetect (535) a change in people presence in the spatial region over time based on a change in the total power estimates of the error signal computed across the ultrasonic frequency subbands overtime;if the detect operation indicates people are present in the spatial region, issue (540) a wakeup command to the video camera to wakeup if the video camera was previously asleep so that the video camera is able to capture video of at least a portion of the spatial region;perform face and motion detection based on video of the spatial region captured by the video camera; andissue one or more commands to wakeup the video conference endpoint only if the face and motion detection confirms the presence of people in the spatial region.
- The apparatus of claim 5, wherein the processor is further configured to repetitively compute the total power estimate by:computing a moving average of power estimates over time, wherein the moving average is based on a current power estimate of the error signal and one or more previous power estimates of the error signal; andthe detect operation includes detecting a change in people presence over time based on a change in the moving average over time.
- The apparatus of claim 6, wherein the detecting includes:comparing the moving average of power estimates to a moving average power threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the moving average of power estimates has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declaring that people are present in the spatial region.
- The apparatus of claim 5, wherein the detect operation includes:comparing the total power estimate to a power estimate threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the total power estimate has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declaring people are present in the spatial region.
- A non-transitory processor readable medium storing instructions that, when executed by a processor, cause the processor to:cause a loudspeaker (116) to transmit (510) an ultrasonic signal into a spatial region;access a received ultrasonic signal representative of transduced ultrasonic sound, including an echo of the transmitted ultrasonic signal, received from the spatial region at a microphone (118);transform (520) the ultrasonic signal and the received ultrasonic signal into respective time-frequency domains that cover respective ultrasonic frequency subbands;compute (525) an error signal, representative of an estimate of an echo-free received ultrasonic signal, based on the transformed ultrasonic signal and the transformed received ultrasonic signal by:adaptively filtering multiple ultrasonic frequency subbands of the transformed ultrasonic signal based on a set of adaptive filter coefficients adjusted responsive to the error signal individually to produce a respective estimate of the echo for each ultrasonic frequency subband;differencing each echo estimate and a corresponding one of multiple ultrasonic frequency subbands of the transformed received ultrasonic signal to produce the error signal for each ultrasonic frequency subband; andfeeding-back the error signal to the adaptively filtering operation;repetitively compute (530) a total power estimate based on the error signals across the ultrasonic frequency subbands overtime;detect (535) a change in people presence in the spatial region overtime based on a change in the total power estimates of the error signal computed across the ultrasonic frequency subbands over time;if the detect operation indicates that people are present in the spatial region, issue (540) a command to wakeup a video camera that was previously asleep so that the video camera is able to capture video of at least a portion of the spatial region;perform face and motion detection based on video of the spatial region captured by the video camera; andissue one or more commands to wakeup the video conference endpoint only if the face and motion detection confirms the presence of people in the spatial region.
- The processor readable medium of claim 9, wherein the instructions further cause the processor to:repetitively compute the power estimate by computing a moving average of power estimates over time, wherein the moving average is based on a current power estimate of the error signal and one or more previous power estimates of the error signal; anddetect a change in people presence over time based on a change in the moving average over time.
- The processor readable medium of claim 10, wherein the instructions further cause the processor to:compare the moving average of power estimates to a moving average power threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the moving average of power estimates has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declare that people are present in the spatial region.
- The processor readable medium of claim 9, wherein the instructions further cause the processor to:compare the total power estimate to a power estimate threshold indicative of a change in people presence in the spatial region; andif the comparing indicates that the total power estimate has changed from a first level below the threshold to a second level equal to or greater than the threshold, then declare people are present in the spatial region.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/662,691 US9319633B1 (en) | 2015-03-19 | 2015-03-19 | Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint |
PCT/US2016/022422 WO2016149245A1 (en) | 2015-03-19 | 2016-03-15 | Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3271744A1 EP3271744A1 (en) | 2018-01-24 |
EP3271744B1 true EP3271744B1 (en) | 2020-08-26 |
Family
ID=55642880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16713210.9A Active EP3271744B1 (en) | 2015-03-19 | 2016-03-15 | Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint |
Country Status (3)
Country | Link |
---|---|
US (1) | US9319633B1 (en) |
EP (1) | EP3271744B1 (en) |
WO (1) | WO2016149245A1 (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2989986B1 (en) * | 2014-09-01 | 2019-12-18 | Samsung Medison Co., Ltd. | Ultrasound diagnosis apparatus and method of operating the same |
US9838646B2 (en) | 2015-09-24 | 2017-12-05 | Cisco Technology, Inc. | Attenuation of loudspeaker in microphone array |
US10024712B2 (en) * | 2016-04-19 | 2018-07-17 | Harman International Industries, Incorporated | Acoustic presence detector |
US10473751B2 (en) | 2017-04-25 | 2019-11-12 | Cisco Technology, Inc. | Audio based motion detection |
US10141973B1 (en) | 2017-06-23 | 2018-11-27 | Cisco Technology, Inc. | Endpoint proximity pairing using acoustic spread spectrum token exchange and ranging information |
CN109429136A (en) * | 2017-08-31 | 2019-03-05 | 台南科技大学 | Mute ultralow frequency sound wave sleeping system and device |
CN107785027B (en) * | 2017-10-31 | 2020-02-14 | 维沃移动通信有限公司 | Audio processing method and electronic equipment |
CN108093350B (en) * | 2017-12-21 | 2020-12-15 | 广东小天才科技有限公司 | Microphone control method and microphone |
US10267912B1 (en) | 2018-05-16 | 2019-04-23 | Cisco Technology, Inc. | Audio based motion detection in shared spaces using statistical prediction |
US10297266B1 (en) | 2018-06-15 | 2019-05-21 | Cisco Technology, Inc. | Adaptive noise cancellation for multiple audio endpoints in a shared space |
GB2587231B (en) | 2019-09-20 | 2024-04-17 | Neatframe Ltd | Ultrasonic-based person detection system and method |
US11395091B2 (en) * | 2020-07-02 | 2022-07-19 | Cisco Technology, Inc. | Motion detection triggered wake-up for collaboration endpoints |
US10992905B1 (en) * | 2020-07-02 | 2021-04-27 | Cisco Technology, Inc. | Motion detection triggered wake-up for collaboration endpoints |
US12044810B2 (en) | 2021-12-28 | 2024-07-23 | Samsung Electronics Co., Ltd. | On-device user presence detection using low power acoustics in the presence of multi-path sound propagation |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3938428C1 (en) * | 1989-11-18 | 1991-04-18 | Standard Elektrik Lorenz Ag, 7000 Stuttgart, De | |
JPH07146988A (en) * | 1993-11-24 | 1995-06-06 | Nippon Telegr & Teleph Corp <Ntt> | Body movement detecting device |
US20090046538A1 (en) * | 1995-06-07 | 2009-02-19 | Automotive Technologies International, Inc. | Apparatus and method for Determining Presence of Objects in a Vehicle |
US6108028A (en) * | 1998-11-02 | 2000-08-22 | Intel Corporation | Method of activating and deactivating a screen saver in a video conferencing system |
US6374145B1 (en) | 1998-12-14 | 2002-04-16 | Mark Lignoul | Proximity sensor for screen saver and password delay |
US20080224863A1 (en) * | 2005-10-07 | 2008-09-18 | Harry Bachmann | Method for Monitoring a Room and an Apparatus For Carrying Out the Method |
US20100226487A1 (en) * | 2009-03-09 | 2010-09-09 | Polycom, Inc. | Method & apparatus for controlling the state of a communication system |
US8842153B2 (en) * | 2010-04-27 | 2014-09-23 | Lifesize Communications, Inc. | Automatically customizing a conferencing system based on proximity of a participant |
CN102893175B (en) | 2010-05-20 | 2014-10-29 | 皇家飞利浦电子股份有限公司 | Distance estimation using sound signals |
US8907929B2 (en) | 2010-06-29 | 2014-12-09 | Qualcomm Incorporated | Touchless sensing and gesture recognition using continuous wave ultrasound signals |
US9313454B2 (en) * | 2011-06-07 | 2016-04-12 | Intel Corporation | Automated privacy adjustments to video conferencing streams |
US9363386B2 (en) | 2011-11-23 | 2016-06-07 | Qualcomm Incorporated | Acoustic echo cancellation based on ultrasound motion detection |
WO2013187869A1 (en) * | 2012-06-11 | 2013-12-19 | Intel Corporation | Providing spontaneous connection and interaction between local and remote interaction devices |
-
2015
- 2015-03-19 US US14/662,691 patent/US9319633B1/en active Active
-
2016
- 2016-03-15 WO PCT/US2016/022422 patent/WO2016149245A1/en active Application Filing
- 2016-03-15 EP EP16713210.9A patent/EP3271744B1/en active Active
Non-Patent Citations (1)
Title |
---|
None * |
Also Published As
Publication number | Publication date |
---|---|
US9319633B1 (en) | 2016-04-19 |
WO2016149245A1 (en) | 2016-09-22 |
EP3271744A1 (en) | 2018-01-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3271744B1 (en) | Ultrasonic echo canceler-based technique to detect participant presence at a video conference endpoint | |
EP2783504B1 (en) | Acoustic echo cancellation based on ultrasound motion detection | |
US10473751B2 (en) | Audio based motion detection | |
EP3257236B1 (en) | Nearby talker obscuring, duplicate dialogue amelioration and automatic muting of acoustically proximate participants | |
EP2692123B1 (en) | Determining the distance and/or acoustic quality between a mobile device and a base unit | |
US10267912B1 (en) | Audio based motion detection in shared spaces using statistical prediction | |
US10924872B2 (en) | Auxiliary signal for detecting microphone impairment | |
EP3061242B1 (en) | Acoustic echo control for automated speaker tracking systems | |
US9462552B1 (en) | Adaptive power control | |
US8103011B2 (en) | Signal detection using multiple detectors | |
KR102409536B1 (en) | Event detection for playback management on audio devices | |
Enzner | Bayesian inference model for applications of time-varying acoustic system identification | |
EP2700161B1 (en) | Processing audio signals | |
US20190132452A1 (en) | Acoustic echo cancellation based sub band domain active speaker detection for audio and video conferencing applications | |
JP2022542962A (en) | Acoustic Echo Cancellation Control for Distributed Audio Devices | |
KR20170017381A (en) | Terminal and method for operaing terminal | |
US9225937B2 (en) | Ultrasound pairing signal control in a teleconferencing system | |
Favrot et al. | Adaptive equalizer for acoustic feedback control | |
EP3332558B1 (en) | Event detection for playback management in an audio device | |
US20230421952A1 (en) | Subband domain acoustic echo canceller based acoustic state estimator | |
Ahgren et al. | A study of doubletalk detection performance in the presence of acoustic echo path changes | |
KR20230087525A (en) | Method and device for variable pitch echo cancellation | |
CN116156256A (en) | Equipment state control method, device, equipment and medium | |
Fozunbal et al. | A decision-making framework for acoustic echo cancellation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20170718 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20190611 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20200320 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602016042695 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 1306887 Country of ref document: AT Kind code of ref document: T Effective date: 20200915 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG4D |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20201126 Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20201126 Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20201228 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20201127 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: MP Effective date: 20200826 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 1306887 Country of ref document: AT Kind code of ref document: T Effective date: 20200826 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20201226 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R097 Ref document number: 602016042695 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
26N | No opposition filed |
Effective date: 20210527 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20210331 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20210331 Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20210315 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20210331 Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20210315 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20210331 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230525 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO Effective date: 20160315 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20240319 Year of fee payment: 9 Ref country code: GB Payment date: 20240320 Year of fee payment: 9 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20240327 Year of fee payment: 9 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200826 |