WO2023048632A1 - A videoconferencing method and system with automatic muting - Google Patents

A videoconferencing method and system with automatic muting Download PDF

Info

Publication number
WO2023048632A1
WO2023048632A1 PCT/SE2022/050848 SE2022050848W WO2023048632A1 WO 2023048632 A1 WO2023048632 A1 WO 2023048632A1 SE 2022050848 W SE2022050848 W SE 2022050848W WO 2023048632 A1 WO2023048632 A1 WO 2023048632A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
videoconferencing
controller
terminal
videoconferencing terminal
Prior art date
Application number
PCT/SE2022/050848
Other languages
French (fr)
Inventor
Gunnar Weibull
Ola Wassvik
Thomas Craven-Bartle
Håkan Bergström
Original Assignee
Flatfrog Laboratories Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Flatfrog Laboratories Ab filed Critical Flatfrog Laboratories Ab
Publication of WO2023048632A1 publication Critical patent/WO2023048632A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • G06F3/1454Digital output to display device ; Cooperation and interconnection of the display device with other functional units involving copying of the display data of a local workstation or window to a remote workstation or window so that an actual copy of the data is displayed simultaneously on two or more displays, e.g. teledisplay
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/175Static expression

Definitions

  • the present disclosure relates to a method of videoconferencing and a videoconferencing system.
  • the present disclosure relates a method of videoconferencing which automatically changes the mute status of a videoconferencing terminal.
  • Examples of the present disclosure aim to address the aforementioned problems.
  • a method of video conferencing between a plurality of users at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending a control signal to modify a mute status of a microphone and / or a speaker at the first videoconferencing terminal based on the determined changed interaction status.
  • the determining comprises detecting facial gestures indicative of the at least one first user is about to speak.
  • the control signal is configured to unmute the first videoconferencing terminal.
  • the determining comprises detecting facial gestures indicative of the at least one first user is not speaking.
  • the determining comprises that that the at least one first user has not opened their mouth within a predetermined period of time.
  • the determining comprises tracking a gaze of the at least one first user.
  • control signal is sent in response to detecting that the one or more first user is looking away from the first videoconferencing terminal.
  • the determining comprises detecting that the gaze of the at least one first user is directed at another user currently interacting on the video conference.
  • control signal is sent in response to detecting that the one or more first user is looking at the other user currently interacting on the video conference.
  • control signal is configured to mute the first video conference terminal.
  • the method comprises receiving a signal comprising an audio stream of the at least one first user at the first videoconferencing terminal.
  • the method comprises analysing the audio stream and detecting at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the analysed audio stream.
  • the analysing comprises detecting one or more keywords, phrases, sounds, or silence.
  • the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal when detecting that the at least one first user is speaking, and the first videoconferencing terminal is muted.
  • the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal to modify the mute status of the first videoconferencing terminal in response to receiving the control signal.
  • a video conferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a microphone configured to capture sounds of the at least one first user at the videoconferencing terminal; a speaker configured to generate sounds at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one first user at the videoconferencing terminal based on the received one or more images; and send a control signal to modify a mute status of the microphone and I or the speaker based on the determined changed interaction status.
  • a method of videoconferencing between a plurality of participants at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the at least one first user.
  • the method comprising initiating the private videoconferencing session based on an acceptance to the invitation from the at least one second user.
  • the determining comprises tracking a gaze of the at least one first user.
  • the determining comprises detecting that the gaze of the at least one first user is directed at the at least one second user on the videoconference.
  • the detecting determines that the at least one first user moves towards the at least one second user.
  • the detecting determines that the at least one first user issues a gesture and the invitation is sent to the at least one second user in response to detecting the gesture.
  • a videoconferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one user at the videoconferencing terminal based on the received one or more images; and send an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the first user.
  • Figure 1 shows a schematic representation of a videoconferencing terminal according to an example
  • Figures 2 to 5 show schematic representations of a videoconferencing system according to an example
  • Figure 6 shows a flow diagram of the videoconferencing method according to an example
  • Figure 7 shows a schematic representation of a videoconferencing system according to an example.
  • Figure 8 shows a flow diagram of the videoconferencing method according to an example. Detailed Description
  • FIG 1 shows a schematic view of a videoconferencing terminal 100 according to some examples.
  • the videoconferencing terminal 100 is a first videoconferencing terminal 100 configured to be in communication with a second videoconferencing terminal 202 (as shown in Figure 2).
  • the first videoconferencing terminal 100 comprises a camera module 102 and a first display 104.
  • the first videoconferencing terminal 100 selectively controls the activation of the camera module 102 and the first display 104.
  • the camera module 102 and the first display 104 are controlled by a camera controller 106 and a display controller 108 respectively.
  • the camera module 102 comprises one or more cameras.
  • the videoconferencing terminal 100 comprises a videoconferencing controller 110.
  • the videoconferencing controller 110, the camera controller 106 and the display controller 108 may be configured as separate units, or they may be incorporated in a single unit.
  • the videoconferencing controller 110 comprises a plurality of modules for processing the videos and images received from a remotely from an interface 112 and videos and images captured locally.
  • the interface 112 and the method of transmitted and receiving videoconferencing data is known and will not be discussed any further.
  • the videoconferencing controller 110 comprises a face detection module 114 for detecting facial features and an image processing module 116 for modifying a first user display image 220 to be displayed (as shown in Figure 2) on the first display 104.
  • the videoconferencing controller 110 comprises an eye tracking module 118.
  • the eye tracking module 118 can be part of the face detection module 114 or alternatively, the eye tracking module 118 can be a separate module from the face detection module 114.
  • the face detection module 114, the image processing module 116, and the eye tracking module 118 will be discussed in further detail below.
  • the videoconferencing controller 110 comprises an audio processing module 122.
  • the audio processing module 122 is configured to detect and analyse the audio signal received from a microphone 208 (as best shown in Figure 2).
  • One or all of the videoconferencing controller 110, the camera controller 106 and the display controller 108 may be at least partially implemented by software executed by a processing unit 120.
  • the face detection module 114, the image processing module 116, the eye-tracking module 118, and the audio processing module 122 may be configured as separate units, or they may be incorporated in a single unit.
  • One or all of the face detection module 114, the image processing module 116, the eye-tracking module 118, and the audio processing module 122 may be at least partially implemented by software executed by the processing unit 120.
  • the processing unit 120 may be implemented by special-purpose software (or firmware) run on one or more general-purpose or special-purpose computing devices.
  • each "element” or “means” of such a computing device refers to a conceptual equivalent of a method step; there is not always a one-to-one correspondence between elements/means and particular pieces of hardware or software routines.
  • One piece of hardware sometimes comprises different means/elements.
  • a processing unit 120 may serve as one element/means when executing one instruction but serve as another element/means when executing another instruction.
  • one element/means may be implemented by one instruction in some cases, but by a plurality of instructions in some other cases.
  • one or more elements (means) are implemented entirely by analogue hardware components.
  • the processing unit 120 may include one or more processing units, e.g. a CPU ("Central Processing Unit"), a DSP ("Digital Signal Processor"), an ASIC ("Application- Specific Integrated Circuit"), discrete analogue and/or digital components, or some other programmable logical device, such as an FPGA ("Field Programmable Gate Array”).
  • the processing unit 120 may further include a system memory and a system bus that couples various system components including the system memory to the processing unit.
  • the system bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • the system memory may include computer storage media in the form of volatile and/or non-volatile memory such as read only memory (ROM), random access memory (RAM) and flash memory.
  • ROM read only memory
  • RAM random access memory
  • flash memory may include computer storage media in the form of volatile and/or non-volatile memory such as read only memory (ROM), random access memory (RAM) and flash memory.
  • the specialpurpose software and associated control parameter values may be stored in the system memory, or on other removable/non-removable volatile/non-volatile computer storage media which is included in or accessible to the computing device, such as magnetic media, optical media, flash memory cards, digital tape, solid state RAM, solid state ROM, etc.
  • the processing unit 120 may include one or more communication interfaces, such as a serial interface, a parallel interface, a USB interface, a wireless interface, a network adapter, etc, as well as one or more data acquisition devices, such as an A/D converter.
  • the special-purpose software may be provided to the processing unit 120 on any suitable computer
  • the first videoconferencing terminal 100 will be used together with another remote second videoconferencing terminal 202.
  • the first videoconferencing terminal 100 can be used with a plurality of second videoconferencing terminals 202.
  • the first videoconferencing terminal 100 is a presenter videoconferencing terminal 100, but the user of the first videoconferencing terminal 100 may not necessarily be designated as a presenter in the videoconference. Indeed, any of the remote second users 206 can present material in the videoconference.
  • FIG. 2 shows a schematic representation of a videoconferencing system 200.
  • the first videoconferencing terminal 100 as shown in Figure 2 is the same as described in reference to Figure 1.
  • the second videoconferencing terminal 202 is the same as described in reference to Figure 1 .
  • a first user 204 can present to one or more remote second users 206.
  • Figure 2 only shows one remote second user 206.
  • the first videoconferencing terminal 100 comprises additional functionality to the second videoconferencing terminals 202.
  • the first videoconferencing terminal 100 can be a large touch screen e.g. a first display 104 comprising a touch sensing apparatus (not shown). Touch sensing apparatuses are known and will not be discussed in any further detail.
  • the second videoconferencing terminals 202 can be a laptop, desktop computer, tablet, smartphone, or any other suitable device.
  • the first videoconferencing terminal 100 does not comprise a touch sensing apparatus 1000 and is e.g. a laptop.
  • the videoconferencing terminal 100 comprises at least one first camera 210 for capturing image data of the first user 204.
  • the camera module 102 comprises a first camera 210 and a second camera 212.
  • the videoconferencing terminal 100 comprises a first camera 210 and a second camera 212 mounted to the first display 104.
  • the first camera 210 and the second camera 212 are connected to the videoconferencing controller 110 and are configured to send image data of the first user 204 to the videoconferencing controller 110.
  • the first and second cameras 210, 212 are configured to capture images of the first user 204.
  • the captured images of the first user 204 are used for determining a gaze direction G and other facial features of the first user 204 with respect to the displayed first user display image 220 on the first display 104.
  • the captured images of the first user 204 can also be used for providing a video stream of the first user 204 to the remote second videoconferencing terminal 202 during the video conference.
  • the first and second cameras 210, 212 are cameras (e.g. RGB cameras) configured to capture colour images of the first user 204.
  • the first and second cameras 210, 212 can be near-infrared cameras.
  • the first videoconferencing terminal 100 comprises a third camera 224 for capturing images for the video stream of the first user 204.
  • the third camera 224 as shown in Figure 2 is mounted on the top of the first display 104. In other examples, the third camera 224 can be mounted in any suitable position or orientation with respect to the first display 104 and the first user 204.
  • the first and second cameras 210, 212 are solely used for determining the gaze direction G and other facial features of the first user 204 and a third camera 224 is used for capturing the video stream of the first user 204.
  • the third camera 224 is an RGB camera for capturing colour image data of the first user 204.
  • Figure 2 shows a plurality (e.g. two) cameras 210, 212 for determining a gaze direction G and other facial features of the first user 204
  • the videoconferencing terminal 100 can comprise a single camera 210 for determining a gaze direction G and the facial features of the first user 204.
  • first and second cameras 210, 212 are mounted on opposite sides of the first display 104.
  • Figure 2 shows the first and second cameras 210, 212 mounted on the top two corners of the first display 104, but the first and second cameras 210, 212 can be mounted at any position on the first display 104.
  • the first and second cameras 210, 212 can be mounted remote from the first display 104.
  • the first and second cameras 210, 212 can be mounted on the ceiling or the wall near the first display 104.
  • the first and second cameras 210, 212 are mounted behind the first display 104.
  • the first and second cameras 210, 212 are nearinfrared cameras and the first display 104 is optically transmissive to the near-infrared light.
  • the first and second cameras 210, 212 comprises a first illumination source 222 of near-infrared light for illuminating the first user 204.
  • the first illumination source 222 is mounted on the top of the first display 104, but the first illumination source 222 can be mounted in any suitable position.
  • the source of illumination can be a near-infrared light source such as an LED mounted to the first and / or the second camera 210, 212.
  • the first illumination source 222 is mounted on the first display 104 remote from the first and second cameras 210, 212.
  • the first videoconferencing terminal 100 comprises a microphone 208 configured to transmit an audio signal to the videoconferencing controller 110 during the videoconference.
  • the microphone 208 is configured to detect sound such as the voice of the first user 204 and send the audio signal to the videoconferencing controller 110.
  • the first videoconferencing terminal 100 comprises a speaker 236 configured to generate audio in dependence of a received audio signal from the videoconferencing controller 110 during the videoconference.
  • the microphone 208 is a separate microphone component remote from the first display 104.
  • the microphone 208 can be a separate unit mounted in front of the first display 104 e.g. on a desk or worksurface.
  • the microphone 208 can be integrated in to the first display 104, in the camera module 102 or any of the first, second or third cameras 210, 212, 224.
  • Figure 2 only shows a single microphone 208, but the functionality with respect to the single microphone 208 can be applied to a plurality of microphones connected to the first videoconferencing terminal 100.
  • the speaker 236 is a separate speaker component remote from the first display 104.
  • the first display 104 as shown in Figure 2 is showing a first user display image 220.
  • the first user display image 220 comprises one or more multiple image elements.
  • the first user display image 220 comprises a video stream application window 226 of the second user 206.
  • the video stream application window 226 can comprise any number of other second users 206 depending on the number of users participating in the videoconference.
  • the first user display image 220 may further optionally comprise other application windows 228 configured to share material e.g. a slide deck to the second video conferencing terminal 202.
  • the first user display image 220 is duplicated on a second user display 218 on the remote second videoconferencing terminal 202.
  • the second user video stream window 232 comprises a video stream of the first user 204.
  • the videoconferencing controller 110 is configured to display the first user display image 220 on the first display 104 and second user display image 216 the second user display 218 as shown in Figure 2.
  • the first user display image 220 and the second user display image 216 can be identical or share common elements e.g. an application window comprising a presentation.
  • the videoconferencing controller 110 detects whether the first user 204 changes their interaction status in the videoconference with the second user 206.
  • a change in the interaction status of the first user 204 can be when the first user 204 starts speaking or when the first user 204 becomes silent during the videoconference.
  • a change in the interaction status of the first user 204 means that the first user 204 increases or decreases their participation in the videoconference.
  • a decrease in the participation of the first user 204 in the videoconference means that background noise picked up by the microphone 208 is more likely to disrupt the other participants e.g. the second user 206 in the videoconference.
  • An increase in the participation of the first user 204 in the videoconference means that the first user 204 is more likely to speak which should be heard by the second user 206.
  • the change in interaction status is not limited to the first user 204 speaking or not speaking.
  • a change in interaction status can be when the first user 204 looks away from the first display 104, moves away from the first display 104, leaves or enters the room where the first videoconferencing terminal 100 located. This means cues of the first user 204 such as eye movement, gaze direction, head movement, head orientation, body movement, body orientation are detected and indicate how much the first user 204 is participating in the videoconference.
  • the videoconference controller 110 determining an interaction status of the first user 204 will now be discussed in further detail.
  • the videoconferencing controller 110 receives one or more images from the first and I or the second cameras 210, 212 of the first user 204 as shown in step 600 in Figure 6.
  • Figure 6 shows a flow diagram of a method of videoconferencing according to some examples.
  • the videoconferencing controller 110 then sends the one or more images to the face detection module 114.
  • the face detection module 114 determines the orientation and position of the face of the first user 204 based on feature detection.
  • the face detection module 114 detects the position of the eyes 214 of the first user 204 in a received image. In this way, the face detection module 114 determines the facial gestures and the gaze direction of the first user 204 as shown in steps 602 and step 604 of Figure 6.
  • the face detection module 114 uses feature detection on an image of the first user 204 to detect where the eyes 214 and the face of the first user 204 are with respect to the first display 104. For example, the face detection module 114 may determine that only one eye or no eyes 214 of the first user 204 are observable.
  • the face detection module 114 then sends a face detection signal or face position and or face orientation information of the first user 204 to the videoconferencing controller 110.
  • the videoconferencing controller 110 determines whether the first user 204 is looking at the first display 104 based on the received signal from the face detection module 114. If the videoconference controller 110 does not receive a face detection signal from the face detection module 114, then the videoconference controller 110 determines that the first user 204 is not looking at the first display 104.
  • the videoconferencing controller 110 is able to determine the general gaze direction G of the first user 204 based on a detection of the face of the first user 204. In other words, the videoconferencing controller 110 determines that the gaze direction G in that the first user 204 is looking at the first display 104 or not.
  • the videoconferencing controller 110 determines that the first user gaze direction G is towards the first display 104, the videoconferencing controller 110 determines that the first user 204 is currently engaged with the videoconference. The videoconferencing controller 110 then determines that the interaction status of the first user 204 in the videoconference is “high”. This means that the videoconferencing controller 110 issues a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “unmute” as shown in step 610 of Figure 6. In other words, the microphone 208 is activated and transmits an audio signal to the videoconferencing controller 110 from the first user 204.
  • the videoconferencing controller 110 can issue another signal to image processing module 116.
  • the image processing module 116 can modify a display signal sent to the second videoconferencing terminal 202 indicating that the first user 204 is unmuted. For example, this can be represented by a microphone symbol 234 in the first user display image 220 as shown in Figure 2.
  • the videoconferencing controller 110 automatically switches on the camera module 102 e.g. the third camera 224 when the first user 204 is unmuted. This means that if the first user 204 has switched off their third camera 224, the first user 204 shown in a video stream when they start talking. This means that the videoconference is more engaging for all users and is clearer who has started speaking in the videoconference.
  • the videoconferencing controller 110 determines that the first user 204 is not looking at the first display 104 in step 606 the videoconferencing controller 110 then determines that the interaction status of the first user 204 in the videoconference is “low”. The videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” as shown in step 610. In other words, the microphone 208 is deactivated and prevents transmission of an audio signal to the videoconferencing controller 110 from the first user 204.
  • the videoconferencing controller 110 can optionally issue a signal to image processing module 116.
  • the image processing module 116 can modify a display signal sent to the second videoconferencing terminal 202 indicating that the first user 204 is muted. For example, this can be represented by a muted microphone symbol 300 in the first user display image 220 as shown in Figure 3. This may be helpful when the first user 204 has turned away from the videoconferencing terminal 100 and is speaking directly to the local participants in the same room as the first user 204. This means that private sidebar discussions by the first user 204 during a videoconference will not disrupt the videoconference for the second user 206.
  • the videoconferencing controller 110 automatically switches off the camera module 102 e.g. the third camera 224 when the first user 204 is muted. This means that if the first user 204 wishes to turn off their third camera 224, the first user 204 is not shown in a video stream when they stop talking. This may help the first user 204 maintain some privacy during the videoconference when they have stopped speaking in the videoconference.
  • the videoconferencing controller 110 additionally or alternatively automatically switches off the speaker 236 when the first user 204 is muted.
  • the videoconferencing controller 110 can optionally issue a signal to image processing module 116 to provide a mute alert 302 on the first display 104 as shown in step 612 in Figure 6. This means that the first user 204 can be prompted to change the mute status of the microphone 208.
  • the videoconference controller 110 When the videoconferencing controller 110 receives a positive response from the first user 204 with respect to the mute alert 302, the videoconference controller 110 carries out the steps 608 and 610 and changes the mute states of the microphone 208 e.g. muting the microphone 208.
  • the videoconferencing controller 110 determines a more precise presenter gaze direction G. This will now be discussed in further detail.
  • the videoconferencing terminal 100 comprises a first illumination source 222 of near-infrared light configured to illuminate the first user 204.
  • the infrared light is transmitted to the first user 204 and the infrared light is reflected from the presenter eyes 214.
  • the first and second cameras 210, 212 detect the reflected light from the presenter eyes 214.
  • the first and second cameras 210, 212 are configured to send one or more image signals to the videoconferencing controller 110 as shown in step 600.
  • the videoconferencing controller 110 sends the image signals to the eye tracking module 118. Since the placement of the first and second cameras 210, 212 and the first illumination source 222 are known, the eye tracking module 118 determines through trigonometry the gaze direction G of the first user 204 as shown in step 604. Determining the first user gaze direction G from detection of reflected light from the eye 214 of the presenter is known e.g. as discussed in US 6,659,661 which is incorporated by reference herein.
  • the videoconferencing controller 110 determines the direction of the face of the first user 204 based on feature detection. For example, the eye tracking module 118 determines the location of eyes 214 of the first user 204 with respect to the nose 230 from the received image signals. In this way, the eye tracking module 118 determines the first user gaze direction G as shown in step 604. Determining the first user gaze direction G from facial features is known e.g. as discussed in DETERMINING THE GAZE OF FACES IN IMAGES A. H. Gee and R. Cipolla, 1994 which is incorporated by reference herein.
  • the eye tracking module 118 determines the first user gaze direction G based on a trained neural network classifying the direction of the presenter eyes 214 processing the received one or more image signals from the first and second cameras 210, 212 as shown in step 604.
  • Classifying the first user gaze direction G from a convolutional neural network is known e.g. as discussed in Real-time Eye Gaze Direction Classification Using Convolutional Neural Network Anjith George, and Aurobinda Routray 2016 wh ch is incorporated herein by reference.
  • the eye tracking module 118 determines the first user gaze direction G and sends a signal to the videoconferencing controller 110 comprising information relating to the presenter gaze direction G.
  • the videoconferencing controller 110 is able to better determine when the interaction status of the first user 204 changes.
  • the videoconferencing controller 110 determines which part of the first user display image 220 on the first display 104 that the first user gaze direction G intersects. For example, the videoconferencing controller 110 can determine whether the first user 204 is looking at an application window comprising a video stream of the videoconference or another application window 500 unrelated to the videoconference. Accordingly, the videoconferencing controller 110 carries out steps 606, 608 and 610 as previously mentioned.
  • the videoconferencing controller 110 determines that the first user gaze direction G is directed at the video stream application window 226 of the second user 206 e.g. as shown in Figure 5. The videoconferencing controller 110 then determines that since the first user 204 is looking at the video stream application window 226 of the second user 206, the interaction status of the first user 204 is “high”. The interaction status of the first user 204 may be determined to be high because the first user 204 may have been previous looking not at the first display 104. The videoconferencing controller 110 is then configured to send a control signal in response to detecting that the first user 204 is looking at the video stream application window 226 of the second user 206 currently interacting on the video conference. Accordingly, the videoconferencing controller 110 changes the mute status to “unmuted” and carries out steps 606, 608 and 610 as previously mentioned.
  • the videoconferencing controller 110 determines that the first user gaze direction G is directed to an unrelated application window 500 e.g. as shown in Figure 5. The videoconferencing controller 110 then determines that since the first user 204 is looking at the unrelated application window 500, the interaction status of the first user 204 is “low”. The interaction status of the first user 204 may be determined to be low because the first user 204 may have been previous looking the video stream application window 226. The videoconferencing controller 110 is then configured to send a control signal in response to detecting that the first user 204 is not looking at the video stream application window 226 currently not interacting on the video conference. Accordingly, the videoconferencing controller 110 changes the mute status to “muted” and carries out steps 606, 608 and 610 as previously mentioned.
  • the videoconference controller 110 sends an instruction to the audio processing module 122 to analyse the audio signal received from the microphone 208 as shown in step 614 in Figure 6.
  • the audio processing module 122 analyses the audio signal for one or more keywords.
  • the audio processing module 122 is configured to carry out automatic speech recognition algorithms.
  • the audio processing module 122 comprises a trained neural network classifying the audio signal into significant words. Machine learning speech recognition is known e.g. (End-to-End Deep Neural Network for Automatic Speech Recognition by William Song, Jim Cai incorporated by reference herein) and will not be discussed any further.
  • the audio processing module 122 can determine whether the first user 204 is using specific words which are relevant to the videoconference. In this case, the audio processing module 122 sends a signal to the videoconference controller 110.
  • the videoconferencing controller 110 determines that the interaction status of the first user 204 in the videoconference is “high”. The videoconference controller 110 then carries out steps 606, 608 and 610. In this way, if the first user 204 starts talking and engaging with the videoconference, then the microphone 208 is unmuted.
  • the face detection module 114 determines receives the images of the first user 204 from the first camera 210 and I or the second camera 212.
  • the face detection module 114 detects a mouth 304 of the first user 204.
  • the face detection module 114 heuristically detects the mouth 304 of the first user 204. Heuristically detecting facial features is known (e.g. HEURISTICBASED AUTOMATIC FACE DETECTION by Geovany Ramirez, Vittorio Zanella, Olac Fuentes, 2003 which is incorporated by reference herein). Accordingly, the face detection module 114 can detect the mouth 304 of the first user 204.
  • the face detection module 114 further detects the mouth orientation, mouth position and mouth movement as shown in step 602 of Figure 6. As shown in Figure 6, the other steps of detecting the first user gaze direction 604 and analysing the audio 614 are optional and may not be carried out.
  • the face detection module 114 determines that the mouth 304 of the first user 204 is closed and has remain closed for a predetermined period of time. Accordingly, the face detection module 114 sends a “mouth closed” signal to the videoconferencing controller 110. In this way, the videoconferencing controller 110 determines that the first user 204 is not talking and their interaction status in the video conference is currently “low” as shown in step 606.
  • the videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” as shown in step 610.
  • the microphone 208 is deactivated and prevents transmission of an audio signal to the videoconferencing controller 110 from the first user 204.
  • the face detection module 114 determines that the mouth 304 of the first user 204 is open.
  • the face detection module 114 may also detect that the mouth 304 of the first user 204 is moving e.g. an indication of speaking. Accordingly, the face detection module 114 sends a “mouth open” or a “mouth moving” signal to the videoconferencing controller 110. In this way, the videoconferencing controller 110 determines that the first user 204 is talking and their interaction status in the video conference is currently “high” as shown in step 606.
  • the videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “unmute” as shown in step 610.
  • the microphone 208 is activated and allows transmission of an audio signal to the videoconferencing controller 110 from the first user 204.
  • the face detection module 114 alternatively detects the mouth 304 of the first user 204 with a neural network. It is known to use a convolutional neural network for automatic lip reading (e.g. Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory by Yuanyao Lu and Hongbo Li, 2019 which is incorporated by reference herein).
  • the face detection module 114 classifies different images of the first user 204 with either a mouth open or a mouth closed.
  • the face detection module 114 can determine the particular lip and mouth 304 movement associated with the first user 204 being about to speak or using relevant words for the videoconference.
  • the face detection module 114 sends a signal to the videoconferencing controller 110 indicating whether first user’s mouth 304 is open or closed.
  • the videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” or “unmute” as shown in step 610.
  • the videoconferencing controller 110 can use the audio signal in addition to images of the first user 204 to determine their interaction status of the first user 204.
  • the videoconferencing controller 110 can simply detect that the first user 204 is talking into the microphone 208.
  • the videoconferencing controller 110 uses the audio processing module as discussed above. In this way, the videoconferencing controller 110 implements the previously discussed steps 602, 604, 614, 606, 608 and 610 sequentially or in parallel.
  • the videoconferencing controller 110 is configured to apply one or more rules for determining when to mute e.g.:
  • the videoconferencing controller 110 does not automatically mute current speaker when all other participants are muted. • The videoconferencing controller 110 does not automatically unmute a person when other participants are unmuted.
  • the videoconferencing controller 110 mutes or unmutes a user in determination of a user selection e.g. a user may have a visible checkbox for the activation or deactivation of automatic muting and unmuting.
  • the videoconferencing controller 110 is configured to determine to mute or unmute based on the combination of gaze direction and speech.
  • the videoconferencing controller 110 is configured to use signals received from two or more microphones (arrays) to determine the current speaker’s position and configured to use sound processing to filter out sound coming from other directions.
  • the videoconferencing controller 110 is configured to use two or more microphones (arrays) to determine the speaker’s position in combination with gaze and mouth movement detection in order to increase the accuracy of automatic decisions.
  • the videoconferencing controller 110 may determine that speaker is conducting a private conversation.
  • speaker may be continuing the general discussion e.g. the first user’s 204 is addressing a local attendee. This may mean that the videoconferencing controller 110 determines that the first user 204 is having a "private side-conversation" and the videoconferencing controller 110 decides that the first user 204 conversation should be muted. This may be undesirable if the first user 204 is continuing a relevant discussion to the videoconference.
  • the videoconferencing controller 110 is configured to perform content analysis of the videoconference. In some examples, the videoconferencing controller 110 is configured to use Al techniques to determine that the decision to automatically mute has a higher likelihood of being correct. In some examples, the videoconferencing controller 110 is configured to simultaneously analyse video data from all the participants in the video conference. In this way the videoconferencing controller 110 is able to determine from the conversation, gestures, and mimicry of one or more of the participants who may want to speak and who may want to be muted. Furthermore, the videoconferencing controller 110 is configured to inform participants if technical problems are occurring on one side or the other that cannot be solved automatically by the videoconferencing controller 110. The videoconferencing controller 110 is configured to recognize one or more gestures and verbal orders issued by the participants to manage microphones, speakers, and volume etc.
  • the videoconferencing controller 110 is configured to detect if more than one microphone 208 is present in the same room.
  • a meeting room may have a large touch screen and one or more laptops sharing. This means that the videoconferencing controller 110 is able to avoid activation of audio feedback during the videoconference session by issuing a control instruction to activate only one microphone 208 in a particular location.
  • the videoconferencing controller 110 is configured to use sound algorithms to detect audio feedback in order to mute one or more microphones 208 in the videoconference session.
  • the videoconferencing controller 110 can also apply the same control to speakers 236 in the videoconference session.
  • the videoconferencing controller 110 is configured to detect if more than one speaker 236 is present in the same room (e.g. large touch screen and one or more laptops sharing) to avoid activation of audio feedback.
  • the videoconferencing controller 110 may be configured to use sound algorithms to detect audio feedback in order to issue a control instruction to mute one or more speakers 236 in the videoconference session.
  • Figures 5 and 7 show a schematic representation of a videoconferencing system 200.
  • Figure 8 shows a method of videoconferencing according to an example.
  • the first user 204 may wish to have a breakout session with the second user 206 separate from the original videoconference.
  • the first user 204 may want a private meeting with one or more current participants of the ongoing videoconference. If this occurs, it is possible to use the existing techniques for muting and unmuting the first user 204 to establish the private session between the first user 204 and the second session.
  • the videoconferencing controller 110 is configured to use high level muting. In this way the videoconferencing controller 110 is configured to send the audio stream of the videoconferencing session to all involved processing units (e.g. the first and second videoconferencing terminals 100, 202 and all the other participating videoconferencing terminals. The videoconferencing controller 110 is configured to issue control instructions to each of the participating videoconferencing terminals 100, 202 to determine and control whether to output the audio or not on the participating videoconferencing terminals 100, 202 activated speakers 236.
  • all involved processing units e.g. the first and second videoconferencing terminals 100, 202 and all the other participating videoconferencing terminals.
  • the videoconferencing controller 110 is configured to issue control instructions to each of the participating videoconferencing terminals 100, 202 to determine and control whether to output the audio or not on the participating videoconferencing terminals 100, 202 activated speakers 236.
  • the videoconferencing controller 110 can selectively control and signal which of the microphones 208 and speakers 236 of the participating videoconferencing terminals 100, 202 should be activated or not. This enables the videoconferencing controller 110 to control the different simultaneous audio channels within the videoconferencing session.
  • videoconferencing controller 110 is configured to issue control instructions such that some of participating videoconferencing terminals 100, 202 receive audio in respect of a private channel whilst at the same time other participating videoconferencing terminals receive the audio for the videoconferencing session.
  • the “public” video conferencing may proceed at the same time as the private breakout session.
  • the videoconferencing system 200 and the first and second videoconferencing terminals 100, 202 are identical to the videoconferencing terminals discussed in reference to the previous Figures.
  • Figure 8 is identical to the method described in Figure 6 except that the last two steps have been modified.
  • the face detection module 114 determines the first user gaze direction G and sends a signal to the videoconferencing controller 110.
  • the videoconferencing controller 110 determines that the first user 204 is looking at the video stream application window 226 of the second user 206.
  • the videoconferencing controller 110 sends an invitation signal to the second videoconferencing terminal 202 in response to the determination that the first user 204 is looking at the video stream application window 226 of the second user 206.
  • Step 800 in Figure 8 shows the step of the videoconferencing controller 110 sending the invitation to the second videoconferencing terminal 202.
  • the videoconferencing controller 110 may issue a prompt as shown in step 806.
  • the step of issuing a prompt is similar to the previously discussed step 612 in Figure 6.
  • the first user 204 has to accept and approve before the videoconferencing controller 110 sends the invitation to the second user 206.
  • the videoconferencing controller 110 may then establish a private session between the first and second user 204, 206 as shown in step 802.
  • the videoconferencing controller 110 in some examples established the private session on receipt of an acceptance from the second user 206 to the invitation.
  • the videoconferencing controller 110 mutes the first and second user 204, 206 in the original videoconference.
  • the videoconferencing controller 110 does not change the mute status of the microphone 208 in order to allow the first and second user 204, 206 to talk in the private session. Instead, the videoconferencing controller 110 prevents the audio from the microphone 208 being sent to any other participants in the original videoconference. Alternatively, the videoconferencing controller 110 lowers the volume of the audio captured from the microphone 208. This means that the captured audio from the private discussion from the first and second user 204, 206 is less likely to disrupt the original videoconference.
  • the face detection module 114 determines that the first user 204 is moving their head or body (as shown by arrow B in Figure 7) towards the first display 104. For example, the face detection module 114 detects that the first user 204 is physically leaning towards the video stream application window 226 of the second user 206. The physical movement of the first user 204 is determined by the videoconferencing controller 110 as the first user’s 204 intention to establish a new private session between the first user 204 and the second user 206.
  • the videoconferencing controller 110 can apply one or more muting modes to the videoconferencing terminal 100.
  • a first “automatic” mode the videoconferencing controller 110 applies the functionality to the microphone 208 as described in the examples shown in Figures 1 to 8.
  • the videoconferencing controller 110 keeps the microphone 208 always on.
  • the videoconferencing controller 110 keeps the microphone 208 always off.
  • the videoconferencing controller 110 defaults to the first automatic mode.
  • the first videoconferencing terminal 100 and the second videoconferencing terminal 202 can be used as a permanent or “always on” communication tool between two sites.
  • the first videoconferencing terminal 100 is located in a first site such as a main office and the second videoconferencing terminal 202 is located in a second site such as a home office.
  • the first user 204 is one or more office workers in the office and the second user 206 is a remote user working from home. If the first videoconferencing terminal 100 and the second videoconferencing terminal 202 are always on, this can increase the feeling of being in a team and encourage spontaneous innovation discussion.
  • the first videoconferencing terminal 100 and the second videoconferencing terminal 202 are used as described in reference to the Figures for engagement detection to selectively actuate the microphone 208 and other components of the first videoconferencing terminal 100 and the second videoconferencing terminal 202. That is, the videoconferencing controller 110 is configured to automatically turn on sound and I or increase the sound volume and voice volume only when the first user 204 is looking at the first display 104. The videoconferencing controller 110 is configured to issue a control instruction to the first display 104 be dimmed if the first user 204 does not look at the first display 104. This may help reduce distractions for the first user 204 and the second user 206, if needed. The videoconferencing controller 110 can be configured to draw the attention of the second user 206 who is not looking into the second display 206.
  • the videoconferencing controller 110 is configured to issue a control instruction to the second videoconferencing terminal 202 to turn on sound and I or increase the volume of the second videoconferencing terminal 202 if the videoconferencing controller 110 detects that the first user 204 says the name of the second user 206 and/or looks at the second user 206 on the first display 104.
  • the videoconferencing controller 110 is configured to illustrate if a participant is muted or not.
  • the videoconferencing controller 110 is configured to add an indication such as icons, or decorations to participant tags that show if they can hear another user. That is, the videoconferencing controller 110 is configured to indicate to the first user 204 that a second user 206 can hear the first user 204.
  • the videoconferencing controller 110 is configured to manage a videoconference whereby one or more of the users 204, 206 are using a headset and/or using the conference system simultaneously.
  • the videoconferencing controller 110 is configured to illustrate to the second user 204 when the first user 204 is speaking 206 whether the first user 204 is muted, or whether the audio of the first user 204 captured by their microphone reaches the remote second videoconferencing terminal 202 of the second user 206, but the speaker system of the remote second videoconferencing terminal 202 of the second user 206 does not work.
  • the videoconference controller 110 can therefore determine between a user interaction e.g. the first user 204 muting themselves or hardware failure with the remote second videoconferencing terminal 202 of the second user 206.
  • the videoconference controller 110 is configured to base this determination on using the first users 204 microphones to check for the expected audio signal.
  • the video controller 110 can receive an audio signal from the first user 204 even if they are "muted”.
  • the videoconference controller 110 then issues a control signal to provide an indication to the second user 206 at the remote second videoconferencing terminal 202 whether the sound is off, or their mic is muted.
  • the videoconference controller 110 is configured to coordinate communicating terminals 100, 202 to determine if and where the chain of sound breaks irrespective of the type of microphone being used e.g. headset or videoconference terminal.
  • two or more examples are combined.

Abstract

A method of video conferencing between a plurality of users at a first videoconferencing terminal and a second videoconferencing terminal comprises receiving one or more images of at least one first user at the first videoconferencing terminal. The method further comprises determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images. The method further comprises sending a control signal to modify a mute status of a microphone and / or a speaker at the first videoconferencing terminal based on the determined changed interaction status.

Description

A VIDEOCONFERENCING METHOD AND SYSTEM WITH AUTOMATIC MUTING
Technical Field
The present disclosure relates to a method of videoconferencing and a videoconferencing system. In particular, the present disclosure relates a method of videoconferencing which automatically changes the mute status of a videoconferencing terminal.
Figure imgf000003_0001
Remote working is becoming increasingly important to employers and employees. For example, there is an increasing demand not to travel and face to face meetings are being replaced with alternatives such as videoconferencing.
One issue with videoconferencing is that the participants often need to mute their microphone to prevent unwanted background noise being recorded. Many participants mute their microphone during a videoconference to prevent disruption. A problem with this is that participants may forget to change their mute status before speaking or after speaking which becomes inconvenient and irritating.
Figure imgf000003_0002
Examples of the present disclosure aim to address the aforementioned problems.
According to an aspect of the present disclosure there is a method of video conferencing between a plurality of users at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending a control signal to modify a mute status of a microphone and / or a speaker at the first videoconferencing terminal based on the determined changed interaction status.
Optionally, the determining comprises detecting facial gestures indicative of the at least one first user is about to speak. Optionally, the control signal is configured to unmute the first videoconferencing terminal.
Optionally, the determining comprises detecting facial gestures indicative of the at least one first user is not speaking.
Optionally, the determining comprises that that the at least one first user has not opened their mouth within a predetermined period of time.
Optionally, the determining comprises tracking a gaze of the at least one first user.
Optionally, the control signal is sent in response to detecting that the one or more first user is looking away from the first videoconferencing terminal.
Optionally, the determining comprises detecting that the gaze of the at least one first user is directed at another user currently interacting on the video conference.
Optionally, the control signal is sent in response to detecting that the one or more first user is looking at the other user currently interacting on the video conference.
Optionally, the control signal is configured to mute the first video conference terminal.
Optionally, the method comprises receiving a signal comprising an audio stream of the at least one first user at the first videoconferencing terminal.
Optionally, the method comprises analysing the audio stream and detecting at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the analysed audio stream.
Optionally, the analysing comprises detecting one or more keywords, phrases, sounds, or silence. Optionally, the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal when detecting that the at least one first user is speaking, and the first videoconferencing terminal is muted.
Optionally, the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal to modify the mute status of the first videoconferencing terminal in response to receiving the control signal.
According to another aspect of the present disclosure there is a video conferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a microphone configured to capture sounds of the at least one first user at the videoconferencing terminal; a speaker configured to generate sounds at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one first user at the videoconferencing terminal based on the received one or more images; and send a control signal to modify a mute status of the microphone and I or the speaker based on the determined changed interaction status.
According to yet another aspect of the present disclosure there is a method of videoconferencing between a plurality of participants at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the at least one first user.
Optionally, the method comprising initiating the private videoconferencing session based on an acceptance to the invitation from the at least one second user.
Optionally, the determining comprises tracking a gaze of the at least one first user. Optionally, the determining comprises detecting that the gaze of the at least one first user is directed at the at least one second user on the videoconference.
Optionally, the detecting determines that the at least one first user moves towards the at least one second user.
Optionally, the detecting determines that the at least one first user issues a gesture and the invitation is sent to the at least one second user in response to detecting the gesture.
According to a further aspect of the present disclosure there is a videoconferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one user at the videoconferencing terminal based on the received one or more images; and send an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the first user.
Brief Description of the Drawings
Various other aspects and further examples are also described in the following detailed description and in the attached claims with reference to the accompanying drawings, in which:
Figure 1 shows a schematic representation of a videoconferencing terminal according to an example;
Figures 2 to 5 show schematic representations of a videoconferencing system according to an example;
Figure 6 shows a flow diagram of the videoconferencing method according to an example;
Figure 7 shows a schematic representation of a videoconferencing system according to an example; and
Figure 8 shows a flow diagram of the videoconferencing method according to an example. Detailed Description
Figure 1 shows a schematic view of a videoconferencing terminal 100 according to some examples. The videoconferencing terminal 100 is a first videoconferencing terminal 100 configured to be in communication with a second videoconferencing terminal 202 (as shown in Figure 2).
The first videoconferencing terminal 100 comprises a camera module 102 and a first display 104. The first videoconferencing terminal 100 selectively controls the activation of the camera module 102 and the first display 104. As shown in Figure 1 , the camera module 102 and the first display 104 are controlled by a camera controller 106 and a display controller 108 respectively. As discussed in more detail below, the camera module 102 comprises one or more cameras.
The videoconferencing terminal 100 comprises a videoconferencing controller 110. The videoconferencing controller 110, the camera controller 106 and the display controller 108 may be configured as separate units, or they may be incorporated in a single unit.
The videoconferencing controller 110 comprises a plurality of modules for processing the videos and images received from a remotely from an interface 112 and videos and images captured locally. The interface 112 and the method of transmitted and receiving videoconferencing data is known and will not be discussed any further. In some examples, the videoconferencing controller 110 comprises a face detection module 114 for detecting facial features and an image processing module 116 for modifying a first user display image 220 to be displayed (as shown in Figure 2) on the first display 104.
In some examples, the videoconferencing controller 110 comprises an eye tracking module 118. The eye tracking module 118 can be part of the face detection module 114 or alternatively, the eye tracking module 118 can be a separate module from the face detection module 114. The face detection module 114, the image processing module 116, and the eye tracking module 118 will be discussed in further detail below. In some examples, the videoconferencing controller 110 comprises an audio processing module 122. The audio processing module 122 is configured to detect and analyse the audio signal received from a microphone 208 (as best shown in Figure 2).
One or all of the videoconferencing controller 110, the camera controller 106 and the display controller 108 may be at least partially implemented by software executed by a processing unit 120. The face detection module 114, the image processing module 116, the eye-tracking module 118, and the audio processing module 122 may be configured as separate units, or they may be incorporated in a single unit. One or all of the face detection module 114, the image processing module 116, the eye-tracking module 118, and the audio processing module 122, may be at least partially implemented by software executed by the processing unit 120.
The processing unit 120 may be implemented by special-purpose software (or firmware) run on one or more general-purpose or special-purpose computing devices. In this context, it is to be understood that each "element" or "means" of such a computing device refers to a conceptual equivalent of a method step; there is not always a one-to-one correspondence between elements/means and particular pieces of hardware or software routines. One piece of hardware sometimes comprises different means/elements. For example, a processing unit 120 may serve as one element/means when executing one instruction but serve as another element/means when executing another instruction. In addition, one element/means may be implemented by one instruction in some cases, but by a plurality of instructions in some other cases. Naturally, it is conceivable that one or more elements (means) are implemented entirely by analogue hardware components.
The processing unit 120 may include one or more processing units, e.g. a CPU ("Central Processing Unit"), a DSP ("Digital Signal Processor"), an ASIC ("Application- Specific Integrated Circuit"), discrete analogue and/or digital components, or some other programmable logical device, such as an FPGA ("Field Programmable Gate Array"). The processing unit 120 may further include a system memory and a system bus that couples various system components including the system memory to the processing unit. The system bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory may include computer storage media in the form of volatile and/or non-volatile memory such as read only memory (ROM), random access memory (RAM) and flash memory. The specialpurpose software and associated control parameter values may be stored in the system memory, or on other removable/non-removable volatile/non-volatile computer storage media which is included in or accessible to the computing device, such as magnetic media, optical media, flash memory cards, digital tape, solid state RAM, solid state ROM, etc. The processing unit 120 may include one or more communication interfaces, such as a serial interface, a parallel interface, a USB interface, a wireless interface, a network adapter, etc, as well as one or more data acquisition devices, such as an A/D converter. The special-purpose software may be provided to the processing unit 120 on any suitable computer-readable medium, including a record medium, and a read-only memory.
As mentioned above, the first videoconferencing terminal 100 will be used together with another remote second videoconferencing terminal 202. In some examples, the first videoconferencing terminal 100 can be used with a plurality of second videoconferencing terminals 202.
In some examples, the first videoconferencing terminal 100 is a presenter videoconferencing terminal 100, but the user of the first videoconferencing terminal 100 may not necessarily be designated as a presenter in the videoconference. Indeed, any of the remote second users 206 can present material in the videoconference.
The first videoconferencing terminal 100 will now be discussed in more detail with respect to Figure 2. Figure 2 shows a schematic representation of a videoconferencing system 200. The first videoconferencing terminal 100 as shown in Figure 2 is the same as described in reference to Figure 1. In some examples, the second videoconferencing terminal 202 is the same as described in reference to Figure 1 . In this way, a first user 204 can present to one or more remote second users 206. For the purposes of clarity, Figure 2 only shows one remote second user 206. However, in some examples there can be one or more remote second users 206. For example, there can be any number of remote second users 206. In other examples, there can also be any number of remote conferencing remote videoconferencing terminals 202.
Optionally, the first videoconferencing terminal 100 comprises additional functionality to the second videoconferencing terminals 202. For example, the first videoconferencing terminal 100 can be a large touch screen e.g. a first display 104 comprising a touch sensing apparatus (not shown). Touch sensing apparatuses are known and will not be discussed in any further detail. In some examples, the second videoconferencing terminals 202 can be a laptop, desktop computer, tablet, smartphone, or any other suitable device. In some other examples, the first videoconferencing terminal 100 does not comprise a touch sensing apparatus 1000 and is e.g. a laptop.
T urning back to Figure 2, the face detection module 114, the image processing module 116, and the eye tracking module 118 will now be discussed in further detail. The videoconferencing terminal 100 comprises at least one first camera 210 for capturing image data of the first user 204. In some examples the camera module 102 comprises a first camera 210 and a second camera 212. As shown in Figure 2, the videoconferencing terminal 100 comprises a first camera 210 and a second camera 212 mounted to the first display 104. The first camera 210 and the second camera 212 are connected to the videoconferencing controller 110 and are configured to send image data of the first user 204 to the videoconferencing controller 110.
The first and second cameras 210, 212 are configured to capture images of the first user 204. The captured images of the first user 204 are used for determining a gaze direction G and other facial features of the first user 204 with respect to the displayed first user display image 220 on the first display 104. The captured images of the first user 204 can also be used for providing a video stream of the first user 204 to the remote second videoconferencing terminal 202 during the video conference. Optionally, the first and second cameras 210, 212 are cameras (e.g. RGB cameras) configured to capture colour images of the first user 204. Alternatively, the first and second cameras 210, 212 can be near-infrared cameras. Optionally the first videoconferencing terminal 100 comprises a third camera 224 for capturing images for the video stream of the first user 204. The third camera 224 as shown in Figure 2 is mounted on the top of the first display 104. In other examples, the third camera 224 can be mounted in any suitable position or orientation with respect to the first display 104 and the first user 204.
In some examples the first and second cameras 210, 212 are solely used for determining the gaze direction G and other facial features of the first user 204 and a third camera 224 is used for capturing the video stream of the first user 204. The third camera 224 is an RGB camera for capturing colour image data of the first user 204. Whilst Figure 2 shows a plurality (e.g. two) cameras 210, 212 for determining a gaze direction G and other facial features of the first user 204, the videoconferencing terminal 100 can comprise a single camera 210 for determining a gaze direction G and the facial features of the first user 204.
In some examples the first and second cameras 210, 212 are mounted on opposite sides of the first display 104. Figure 2 shows the first and second cameras 210, 212 mounted on the top two corners of the first display 104, but the first and second cameras 210, 212 can be mounted at any position on the first display 104. In other examples the first and second cameras 210, 212 can be mounted remote from the first display 104. For example, the first and second cameras 210, 212 can be mounted on the ceiling or the wall near the first display 104. By separating the first and second cameras 210, 212 by a large distance, the determination of the first user gaze direction G is more accurate.
In some other examples, the first and second cameras 210, 212 are mounted behind the first display 104. In this case, the first and second cameras 210, 212 are nearinfrared cameras and the first display 104 is optically transmissive to the near-infrared light. In some examples, the first and second cameras 210, 212 comprises a first illumination source 222 of near-infrared light for illuminating the first user 204. As shown in Figure 2 the first illumination source 222 is mounted on the top of the first display 104, but the first illumination source 222 can be mounted in any suitable position. The source of illumination can be a near-infrared light source such as an LED mounted to the first and / or the second camera 210, 212. Alternatively, the first illumination source 222 is mounted on the first display 104 remote from the first and second cameras 210, 212.
The first videoconferencing terminal 100 comprises a microphone 208 configured to transmit an audio signal to the videoconferencing controller 110 during the videoconference. In this way, the microphone 208 is configured to detect sound such as the voice of the first user 204 and send the audio signal to the videoconferencing controller 110. The first videoconferencing terminal 100 comprises a speaker 236 configured to generate audio in dependence of a received audio signal from the videoconferencing controller 110 during the videoconference.
As shown in Figure 2, the microphone 208 is a separate microphone component remote from the first display 104. The microphone 208 can be a separate unit mounted in front of the first display 104 e.g. on a desk or worksurface. In other examples, the microphone 208 can be integrated in to the first display 104, in the camera module 102 or any of the first, second or third cameras 210, 212, 224. In further examples, there can be a at least two microphones 208 for recording and transmitting a e.g. stereo audio signal. Figure 2 only shows a single microphone 208, but the functionality with respect to the single microphone 208 can be applied to a plurality of microphones connected to the first videoconferencing terminal 100. Similarly, the speaker 236 is a separate speaker component remote from the first display 104.
The first display 104 as shown in Figure 2 is showing a first user display image 220. The first user display image 220 comprises one or more multiple image elements. The first user display image 220 comprises a video stream application window 226 of the second user 206. The video stream application window 226 can comprise any number of other second users 206 depending on the number of users participating in the videoconference. The first user display image 220 may further optionally comprise other application windows 228 configured to share material e.g. a slide deck to the second video conferencing terminal 202.
The first user display image 220 is duplicated on a second user display 218 on the remote second videoconferencing terminal 202. The second user video stream window 232 comprises a video stream of the first user 204. The videoconferencing controller 110 is configured to display the first user display image 220 on the first display 104 and second user display image 216 the second user display 218 as shown in Figure 2. The first user display image 220 and the second user display image 216 can be identical or share common elements e.g. an application window comprising a presentation.
In some examples, the videoconferencing controller 110 detects whether the first user 204 changes their interaction status in the videoconference with the second user 206.
A change in the interaction status of the first user 204 can be when the first user 204 starts speaking or when the first user 204 becomes silent during the videoconference. A change in the interaction status of the first user 204 means that the first user 204 increases or decreases their participation in the videoconference. A decrease in the participation of the first user 204 in the videoconference means that background noise picked up by the microphone 208 is more likely to disrupt the other participants e.g. the second user 206 in the videoconference. An increase in the participation of the first user 204 in the videoconference means that the first user 204 is more likely to speak which should be heard by the second user 206.
The change in interaction status is not limited to the first user 204 speaking or not speaking. A change in interaction status can be when the first user 204 looks away from the first display 104, moves away from the first display 104, leaves or enters the room where the first videoconferencing terminal 100 located. This means cues of the first user 204 such as eye movement, gaze direction, head movement, head orientation, body movement, body orientation are detected and indicate how much the first user 204 is participating in the videoconference.
The videoconference controller 110 determining an interaction status of the first user 204 will now be discussed in further detail.
The videoconferencing controller 110 receives one or more images from the first and I or the second cameras 210, 212 of the first user 204 as shown in step 600 in Figure 6. Figure 6 shows a flow diagram of a method of videoconferencing according to some examples. The videoconferencing controller 110 then sends the one or more images to the face detection module 114. In one example, the face detection module 114 determines the orientation and position of the face of the first user 204 based on feature detection.
The face detection module 114 detects the position of the eyes 214 of the first user 204 in a received image. In this way, the face detection module 114 determines the facial gestures and the gaze direction of the first user 204 as shown in steps 602 and step 604 of Figure 6. The face detection module 114 uses feature detection on an image of the first user 204 to detect where the eyes 214 and the face of the first user 204 are with respect to the first display 104. For example, the face detection module 114 may determine that only one eye or no eyes 214 of the first user 204 are observable.
The face detection module 114 then sends a face detection signal or face position and or face orientation information of the first user 204 to the videoconferencing controller 110.
The videoconferencing controller 110 then determines whether the first user 204 is looking at the first display 104 based on the received signal from the face detection module 114. If the videoconference controller 110 does not receive a face detection signal from the face detection module 114, then the videoconference controller 110 determines that the first user 204 is not looking at the first display 104.
In this way, the videoconferencing controller 110 is able to determine the general gaze direction G of the first user 204 based on a detection of the face of the first user 204. In other words, the videoconferencing controller 110 determines that the gaze direction G in that the first user 204 is looking at the first display 104 or not.
If the videoconferencing controller 110 determines that the first user gaze direction G is towards the first display 104, the videoconferencing controller 110 determines that the first user 204 is currently engaged with the videoconference. The videoconferencing controller 110 then determines that the interaction status of the first user 204 in the videoconference is “high”. This means that the videoconferencing controller 110 issues a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “unmute” as shown in step 610 of Figure 6. In other words, the microphone 208 is activated and transmits an audio signal to the videoconferencing controller 110 from the first user 204.
When the videoconferencing controller 110 determines that the first user 204 is looking at the first display 104, in some examples the videoconferencing controller 110 can issue another signal to image processing module 116. The image processing module 116 can modify a display signal sent to the second videoconferencing terminal 202 indicating that the first user 204 is unmuted. For example, this can be represented by a microphone symbol 234 in the first user display image 220 as shown in Figure 2.
In some examples, the videoconferencing controller 110 automatically switches on the camera module 102 e.g. the third camera 224 when the first user 204 is unmuted. This means that if the first user 204 has switched off their third camera 224, the first user 204 shown in a video stream when they start talking. This means that the videoconference is more engaging for all users and is clearer who has started speaking in the videoconference.
When the videoconferencing controller 110 determines that the first user 204 is not looking at the first display 104 in step 606, the videoconferencing controller 110 then determines that the interaction status of the first user 204 in the videoconference is “low”. The videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” as shown in step 610. In other words, the microphone 208 is deactivated and prevents transmission of an audio signal to the videoconferencing controller 110 from the first user 204.
When the videoconferencing controller 110 determines that the first user 204 is not looking at the first display 104, in some examples the videoconferencing controller 110 can optionally issue a signal to image processing module 116. The image processing module 116 can modify a display signal sent to the second videoconferencing terminal 202 indicating that the first user 204 is muted. For example, this can be represented by a muted microphone symbol 300 in the first user display image 220 as shown in Figure 3. This may be helpful when the first user 204 has turned away from the videoconferencing terminal 100 and is speaking directly to the local participants in the same room as the first user 204. This means that private sidebar discussions by the first user 204 during a videoconference will not disrupt the videoconference for the second user 206.
In some examples, the videoconferencing controller 110 automatically switches off the camera module 102 e.g. the third camera 224 when the first user 204 is muted. This means that if the first user 204 wishes to turn off their third camera 224, the first user 204 is not shown in a video stream when they stop talking. This may help the first user 204 maintain some privacy during the videoconference when they have stopped speaking in the videoconference.
In some other examples, the videoconferencing controller 110 additionally or alternatively automatically switches off the speaker 236 when the first user 204 is muted.
As shown in Figure 3, optionally, the videoconferencing controller 110 can optionally issue a signal to image processing module 116 to provide a mute alert 302 on the first display 104 as shown in step 612 in Figure 6. This means that the first user 204 can be prompted to change the mute status of the microphone 208.
When the videoconferencing controller 110 receives a positive response from the first user 204 with respect to the mute alert 302, the videoconference controller 110 carries out the steps 608 and 610 and changes the mute states of the microphone 208 e.g. muting the microphone 208.
In some examples, the videoconferencing controller 110 determines a more precise presenter gaze direction G. This will now be discussed in further detail.
As mentioned previously, the videoconferencing terminal 100 comprises a first illumination source 222 of near-infrared light configured to illuminate the first user 204. The infrared light is transmitted to the first user 204 and the infrared light is reflected from the presenter eyes 214. The first and second cameras 210, 212 detect the reflected light from the presenter eyes 214.
The first and second cameras 210, 212 are configured to send one or more image signals to the videoconferencing controller 110 as shown in step 600. The videoconferencing controller 110 sends the image signals to the eye tracking module 118. Since the placement of the first and second cameras 210, 212 and the first illumination source 222 are known, the eye tracking module 118 determines through trigonometry the gaze direction G of the first user 204 as shown in step 604. Determining the first user gaze direction G from detection of reflected light from the eye 214 of the presenter is known e.g. as discussed in US 6,659,661 which is incorporated by reference herein.
Alternatively, in some examples, the videoconferencing controller 110 determines the direction of the face of the first user 204 based on feature detection. For example, the eye tracking module 118 determines the location of eyes 214 of the first user 204 with respect to the nose 230 from the received image signals. In this way, the eye tracking module 118 determines the first user gaze direction G as shown in step 604. Determining the first user gaze direction G from facial features is known e.g. as discussed in DETERMINING THE GAZE OF FACES IN IMAGES A. H. Gee and R. Cipolla, 1994 which is incorporated by reference herein.
Alternatively, in some other examples, the eye tracking module 118 determines the first user gaze direction G based on a trained neural network classifying the direction of the presenter eyes 214 processing the received one or more image signals from the first and second cameras 210, 212 as shown in step 604. Classifying the first user gaze direction G from a convolutional neural network is known e.g. as discussed in Real-time Eye Gaze Direction Classification Using Convolutional Neural Network Anjith George, and Aurobinda Routray 2016 wh ch is incorporated herein by reference.
The eye tracking module 118 determines the first user gaze direction G and sends a signal to the videoconferencing controller 110 comprising information relating to the presenter gaze direction G. By using a more accurate detection of the first user gaze direction G, the videoconferencing controller 110 is able to better determine when the interaction status of the first user 204 changes.
Once the videoconferencing controller 110 receives the information relating to the presenter gaze direction G, the videoconferencing controller 110 determines which part of the first user display image 220 on the first display 104 that the first user gaze direction G intersects. For example, the videoconferencing controller 110 can determine whether the first user 204 is looking at an application window comprising a video stream of the videoconference or another application window 500 unrelated to the videoconference. Accordingly, the videoconferencing controller 110 carries out steps 606, 608 and 610 as previously mentioned.
In some examples, the videoconferencing controller 110 determines that the first user gaze direction G is directed at the video stream application window 226 of the second user 206 e.g. as shown in Figure 5. The videoconferencing controller 110 then determines that since the first user 204 is looking at the video stream application window 226 of the second user 206, the interaction status of the first user 204 is “high”. The interaction status of the first user 204 may be determined to be high because the first user 204 may have been previous looking not at the first display 104. The videoconferencing controller 110 is then configured to send a control signal in response to detecting that the first user 204 is looking at the video stream application window 226 of the second user 206 currently interacting on the video conference. Accordingly, the videoconferencing controller 110 changes the mute status to “unmuted” and carries out steps 606, 608 and 610 as previously mentioned.
In some examples, the videoconferencing controller 110 determines that the first user gaze direction G is directed to an unrelated application window 500 e.g. as shown in Figure 5. The videoconferencing controller 110 then determines that since the first user 204 is looking at the unrelated application window 500, the interaction status of the first user 204 is “low”. The interaction status of the first user 204 may be determined to be low because the first user 204 may have been previous looking the video stream application window 226. The videoconferencing controller 110 is then configured to send a control signal in response to detecting that the first user 204 is not looking at the video stream application window 226 currently not interacting on the video conference. Accordingly, the videoconferencing controller 110 changes the mute status to “muted” and carries out steps 606, 608 and 610 as previously mentioned.
In some other examples, other techniques are additionally or alternatively used to detect indications of the first user 204 changing their interaction status on the videoconference.
In one example, the videoconference controller 110 sends an instruction to the audio processing module 122 to analyse the audio signal received from the microphone 208 as shown in step 614 in Figure 6. The audio processing module 122 analyses the audio signal for one or more keywords. In some examples, the audio processing module 122 is configured to carry out automatic speech recognition algorithms. For example, the audio processing module 122 comprises a trained neural network classifying the audio signal into significant words. Machine learning speech recognition is known e.g. (End-to-End Deep Neural Network for Automatic Speech Recognition by William Song, Jim Cai incorporated by reference herein) and will not be discussed any further. The audio processing module 122 can determine whether the first user 204 is using specific words which are relevant to the videoconference. In this case, the audio processing module 122 sends a signal to the videoconference controller 110.
In response to the received signal from the audio processing module 122, the videoconferencing controller 110 then determines that the interaction status of the first user 204 in the videoconference is “high”. The videoconference controller 110 then carries out steps 606, 608 and 610. In this way, if the first user 204 starts talking and engaging with the videoconference, then the microphone 208 is unmuted.
In some other examples, the face detection module 114 determines receives the images of the first user 204 from the first camera 210 and I or the second camera 212. The face detection module 114 detects a mouth 304 of the first user 204. In some examples, the face detection module 114 heuristically detects the mouth 304 of the first user 204. Heuristically detecting facial features is known (e.g. HEURISTICBASED AUTOMATIC FACE DETECTION by Geovany Ramirez, Vittorio Zanella, Olac Fuentes, 2003 which is incorporated by reference herein). Accordingly, the face detection module 114 can detect the mouth 304 of the first user 204. The face detection module 114 further detects the mouth orientation, mouth position and mouth movement as shown in step 602 of Figure 6. As shown in Figure 6, the other steps of detecting the first user gaze direction 604 and analysing the audio 614 are optional and may not be carried out.
In a first scenario as shown in Figure 3, the face detection module 114 determines that the mouth 304 of the first user 204 is closed and has remain closed for a predetermined period of time. Accordingly, the face detection module 114 sends a “mouth closed” signal to the videoconferencing controller 110. In this way, the videoconferencing controller 110 determines that the first user 204 is not talking and their interaction status in the video conference is currently “low” as shown in step 606.
The videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” as shown in step 610. In other words, the microphone 208 is deactivated and prevents transmission of an audio signal to the videoconferencing controller 110 from the first user 204.
In a second scenario as shown in Figure 4, the face detection module 114 determines that the mouth 304 of the first user 204 is open. The face detection module 114 may also detect that the mouth 304 of the first user 204 is moving e.g. an indication of speaking. Accordingly, the face detection module 114 sends a “mouth open” or a “mouth moving” signal to the videoconferencing controller 110. In this way, the videoconferencing controller 110 determines that the first user 204 is talking and their interaction status in the video conference is currently “high” as shown in step 606.
The videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “unmute” as shown in step 610. In other words, the microphone 208 is activated and allows transmission of an audio signal to the videoconferencing controller 110 from the first user 204. In some other examples, the face detection module 114 alternatively detects the mouth 304 of the first user 204 with a neural network. It is known to use a convolutional neural network for automatic lip reading (e.g. Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory by Yuanyao Lu and Hongbo Li, 2019 which is incorporated by reference herein). This means that the face detection module 114 classifies different images of the first user 204 with either a mouth open or a mouth closed. In addition, the face detection module 114 can determine the particular lip and mouth 304 movement associated with the first user 204 being about to speak or using relevant words for the videoconference.
Similarly, the face detection module 114 sends a signal to the videoconferencing controller 110 indicating whether first user’s mouth 304 is open or closed. The videoconferencing controller 110 can then issue a control signal to the microphone 208 as shown in step 608 and modifies the mute status of the microphone 208 to “mute” or “unmute” as shown in step 610.
In addition, the videoconferencing controller 110 can use the audio signal in addition to images of the first user 204 to determine their interaction status of the first user 204. The videoconferencing controller 110 can simply detect that the first user 204 is talking into the microphone 208. Alternatively, the videoconferencing controller 110 uses the audio processing module as discussed above. In this way, the videoconferencing controller 110 implements the previously discussed steps 602, 604, 614, 606, 608 and 610 sequentially or in parallel.
Any of the methods for determining the first user’s 204 interaction status discussed with respect to different examples can be combined to provide additional certainty of the current first user 204 interaction status in the videoconference.
In some non-exhaustive examples, the videoconferencing controller 110 is configured to apply one or more rules for determining when to mute e.g.:
The videoconferencing controller 110 does not automatically mute current speaker when all other participants are muted. • The videoconferencing controller 110 does not automatically unmute a person when other participants are unmuted.
• The videoconferencing controller 110 mutes or unmutes a user in determination of a user selection e.g. a user may have a visible checkbox for the activation or deactivation of automatic muting and unmuting.
• The videoconferencing controller 110 is configured to determine to mute or unmute based on the combination of gaze direction and speech.
• The videoconferencing controller 110 is configured to use signals received from two or more microphones (arrays) to determine the current speaker’s position and configured to use sound processing to filter out sound coming from other directions.
• The videoconferencing controller 110 is configured to use two or more microphones (arrays) to determine the speaker’s position in combination with gaze and mouth movement detection in order to increase the accuracy of automatic decisions.
When the videoconferencing controller 110, based on gaze and video information only, identifies a speaker that turns away from the video conferencing system 100, the videoconferencing controller 110 may determine that speaker is conducting a private conversation.
However, speaker may be continuing the general discussion e.g. the first user’s 204 is addressing a local attendee. This may mean that the videoconferencing controller 110 determines that the first user 204 is having a "private side-conversation" and the videoconferencing controller 110 decides that the first user 204 conversation should be muted. This may be undesirable if the first user 204 is continuing a relevant discussion to the videoconference.
In some examples, the videoconferencing controller 110 is configured to perform content analysis of the videoconference. In some examples, the videoconferencing controller 110 is configured to use Al techniques to determine that the decision to automatically mute has a higher likelihood of being correct. In some examples, the videoconferencing controller 110 is configured to simultaneously analyse video data from all the participants in the video conference. In this way the videoconferencing controller 110 is able to determine from the conversation, gestures, and mimicry of one or more of the participants who may want to speak and who may want to be muted. Furthermore, the videoconferencing controller 110 is configured to inform participants if technical problems are occurring on one side or the other that cannot be solved automatically by the videoconferencing controller 110. The videoconferencing controller 110 is configured to recognize one or more gestures and verbal orders issued by the participants to manage microphones, speakers, and volume etc.
In some further examples, the videoconferencing controller 110 is configured to detect if more than one microphone 208 is present in the same room. For example, a meeting room may have a large touch screen and one or more laptops sharing. This means that the videoconferencing controller 110 is able to avoid activation of audio feedback during the videoconference session by issuing a control instruction to activate only one microphone 208 in a particular location. Furthermore, the videoconferencing controller 110 is configured to use sound algorithms to detect audio feedback in order to mute one or more microphones 208 in the videoconference session. The videoconferencing controller 110 can also apply the same control to speakers 236 in the videoconference session. In some examples, the videoconferencing controller 110 is configured to detect if more than one speaker 236 is present in the same room (e.g. large touch screen and one or more laptops sharing) to avoid activation of audio feedback. The videoconferencing controller 110 may be configured to use sound algorithms to detect audio feedback in order to issue a control instruction to mute one or more speakers 236 in the videoconference session.
Turning to Figures 5, 7 and 8, another example will now be described. Figures 5 and 7 show a schematic representation of a videoconferencing system 200. Figure 8 shows a method of videoconferencing according to an example.
In some circumstances, the first user 204 may wish to have a breakout session with the second user 206 separate from the original videoconference. Alternatively, the first user 204 may want a private meeting with one or more current participants of the ongoing videoconference. If this occurs, it is possible to use the existing techniques for muting and unmuting the first user 204 to establish the private session between the first user 204 and the second session.
In some examples, the videoconferencing controller 110 is configured to use high level muting. In this way the videoconferencing controller 110 is configured to send the audio stream of the videoconferencing session to all involved processing units (e.g. the first and second videoconferencing terminals 100, 202 and all the other participating videoconferencing terminals. The videoconferencing controller 110 is configured to issue control instructions to each of the participating videoconferencing terminals 100, 202 to determine and control whether to output the audio or not on the participating videoconferencing terminals 100, 202 activated speakers 236.
This means that, the videoconferencing controller 110 can selectively control and signal which of the microphones 208 and speakers 236 of the participating videoconferencing terminals 100, 202 should be activated or not. This enables the videoconferencing controller 110 to control the different simultaneous audio channels within the videoconferencing session. For example, videoconferencing controller 110 is configured to issue control instructions such that some of participating videoconferencing terminals 100, 202 receive audio in respect of a private channel whilst at the same time other participating videoconferencing terminals receive the audio for the videoconferencing session. In other words, the “public” video conferencing may proceed at the same time as the private breakout session.
As shown in Figures 5 and 7, the videoconferencing system 200 and the first and second videoconferencing terminals 100, 202 are identical to the videoconferencing terminals discussed in reference to the previous Figures.
Figure 8 is identical to the method described in Figure 6 except that the last two steps have been modified.
Accordingly, the face detection module 114 determines the first user gaze direction G and sends a signal to the videoconferencing controller 110. The videoconferencing controller 110 determines that the first user 204 is looking at the video stream application window 226 of the second user 206. Depending on the setup of the videoconferencing controller 110 and I or the first video conferencing terminal 100, the videoconferencing controller 110 sends an invitation signal to the second videoconferencing terminal 202 in response to the determination that the first user 204 is looking at the video stream application window 226 of the second user 206. Step 800 in Figure 8 shows the step of the videoconferencing controller 110 sending the invitation to the second videoconferencing terminal 202. In order to prevent accidental invitations being sent to the second user 206, the videoconferencing controller 110 may issue a prompt as shown in step 806. The step of issuing a prompt is similar to the previously discussed step 612 in Figure 6. Similarly, the first user 204 has to accept and approve before the videoconferencing controller 110 sends the invitation to the second user 206.
The videoconferencing controller 110 may then establish a private session between the first and second user 204, 206 as shown in step 802. The videoconferencing controller 110 in some examples established the private session on receipt of an acceptance from the second user 206 to the invitation.
Once the private session has been established between the first and second user 204, 206, the videoconferencing controller 110 mutes the first and second user 204, 206 in the original videoconference. The videoconferencing controller 110 does not change the mute status of the microphone 208 in order to allow the first and second user 204, 206 to talk in the private session. Instead, the videoconferencing controller 110 prevents the audio from the microphone 208 being sent to any other participants in the original videoconference. Alternatively, the videoconferencing controller 110 lowers the volume of the audio captured from the microphone 208. This means that the captured audio from the private discussion from the first and second user 204, 206 is less likely to disrupt the original videoconference.
Additionally or alternatively, the face detection module 114 determines that the first user 204 is moving their head or body (as shown by arrow B in Figure 7) towards the first display 104. For example, the face detection module 114 detects that the first user 204 is physically leaning towards the video stream application window 226 of the second user 206. The physical movement of the first user 204 is determined by the videoconferencing controller 110 as the first user’s 204 intention to establish a new private session between the first user 204 and the second user 206.
In some examples, the videoconferencing controller 110 can apply one or more muting modes to the videoconferencing terminal 100. In a first “automatic” mode, the videoconferencing controller 110 applies the functionality to the microphone 208 as described in the examples shown in Figures 1 to 8. In a second mode, the videoconferencing controller 110 keeps the microphone 208 always on. In a third mode, the videoconferencing controller 110 keeps the microphone 208 always off. In some examples, the videoconferencing controller 110 defaults to the first automatic mode.
In another example, the first videoconferencing terminal 100 and the second videoconferencing terminal 202 can be used as a permanent or “always on” communication tool between two sites. The first videoconferencing terminal 100 is located in a first site such as a main office and the second videoconferencing terminal 202 is located in a second site such as a home office. In this way, the first user 204 is one or more office workers in the office and the second user 206 is a remote user working from home. If the first videoconferencing terminal 100 and the second videoconferencing terminal 202 are always on, this can increase the feeling of being in a team and encourage spontaneous innovation discussion.
In some examples, the first videoconferencing terminal 100 and the second videoconferencing terminal 202 are used as described in reference to the Figures for engagement detection to selectively actuate the microphone 208 and other components of the first videoconferencing terminal 100 and the second videoconferencing terminal 202. That is, the videoconferencing controller 110 is configured to automatically turn on sound and I or increase the sound volume and voice volume only when the first user 204 is looking at the first display 104. The videoconferencing controller 110 is configured to issue a control instruction to the first display 104 be dimmed if the first user 204 does not look at the first display 104. This may help reduce distractions for the first user 204 and the second user 206, if needed. The videoconferencing controller 110 can be configured to draw the attention of the second user 206 who is not looking into the second display 206. The videoconferencing controller 110 is configured to issue a control instruction to the second videoconferencing terminal 202 to turn on sound and I or increase the volume of the second videoconferencing terminal 202 if the videoconferencing controller 110 detects that the first user 204 says the name of the second user 206 and/or looks at the second user 206 on the first display 104.
In some examples, the videoconferencing controller 110 is configured to illustrate if a participant is muted or not. The videoconferencing controller 110 is configured to add an indication such as icons, or decorations to participant tags that show if they can hear another user. That is, the videoconferencing controller 110 is configured to indicate to the first user 204 that a second user 206 can hear the first user 204.
The videoconferencing controller 110 is configured to manage a videoconference whereby one or more of the users 204, 206 are using a headset and/or using the conference system simultaneously. The videoconferencing controller 110 is configured to illustrate to the second user 204 when the first user 204 is speaking 206 whether the first user 204 is muted, or whether the audio of the first user 204 captured by their microphone reaches the remote second videoconferencing terminal 202 of the second user 206, but the speaker system of the remote second videoconferencing terminal 202 of the second user 206 does not work.
The videoconference controller 110 can therefore determine between a user interaction e.g. the first user 204 muting themselves or hardware failure with the remote second videoconferencing terminal 202 of the second user 206. The videoconference controller 110 is configured to base this determination on using the first users 204 microphones to check for the expected audio signal. The video controller 110 can receive an audio signal from the first user 204 even if they are "muted". The videoconference controller 110 then issues a control signal to provide an indication to the second user 206 at the remote second videoconferencing terminal 202 whether the sound is off, or their mic is muted.
The videoconference controller 110 is configured to coordinate communicating terminals 100, 202 to determine if and where the chain of sound breaks irrespective of the type of microphone being used e.g. headset or videoconference terminal. In another example, two or more examples are combined. Features of one example can be combined with features of other examples. Examples of the present disclosure have been discussed with particular reference to the examples illustrated. However, it will be appreciated that variations and modifications may be made to the examples described within the scope of the disclosure.

Claims

Claims
1. A method of video conferencing between a plurality of users at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending a control signal to modify a mute status of a microphone and / or a speaker at the first videoconferencing terminal based on the determined changed interaction status.
2. A method according to claim 1 wherein the determining comprises detecting facial gestures indicative of the at least one first user is about to speak.
3. A method according to claims 1 or 2 wherein the control signal is configured to unmute the first videoconferencing terminal.
4. A method according to any of claims 1 to 3 wherein the determining comprises detecting facial gestures indicative of the at least one first user is not speaking.
5. A method according to 4 wherein the determining comprises that that the at least one first user has not opened their mouth within a predetermined period of time.
6. A method according to claims 4 or 5 wherein the determining comprises tracking a gaze of the at least one first user.
7. A method according to claims 6 wherein the control signal is sent in response to detecting that the one or more first user is looking away from the first videoconferencing terminal.
27
8. A method according to claim 6 wherein the determining comprises detecting that the gaze of the at least one first user is directed at another user currently interacting on the video conference.
9. A method according to claim 8 wherein the control signal is sent in response to detecting that the one or more first user is looking at the other user currently interacting on the video conference.
10. A method according to any of claims 1 to 9 wherein the control signal is configured to mute the first video conference terminal.
11. A method according to any of the preceding claims wherein the method comprises receiving a signal comprising an audio stream of the at least one first user at the first videoconferencing terminal.
12. A method according to claim 11 wherein the method comprises analysing the audio stream and detecting at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the analysed audio stream.
13. A method according to claim 12 wherein the analysing comprises detecting one or more keywords, phrases, sounds, or silence.
14. A method according to claim 13 wherein the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal when detecting that the at least one first user is speaking and the first videoconferencing terminal is muted.
15. A method according to any of the preceding claims wherein the method comprises issuing a prompt to the one or more first user at the first videoconferencing terminal to modify the mute status of the first videoconferencing terminal in response to receiving the control signal.
16. A video conferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a microphone configured to capture sounds of the at least one first user at the videoconferencing terminal; a speaker configured to generate sounds at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one first user at the videoconferencing terminal based on the received one or more images; and send a control signal to modify a mute status of the microphone and I or the speaker based on the determined changed interaction status.
17. A method of videoconferencing between a plurality of participants at a first videoconferencing terminal and a second videoconferencing terminal comprising: receiving one or more images of at least one first user at the first videoconferencing terminal; determining at least one change in the interaction status of the at least one first user at the first videoconferencing terminal based on the received one or more images; and sending an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the at least one first user.
18. A method according to claim 17 wherein the method comprising initiating the private videoconferencing session based on an acceptance to the invitation from the at least one second user.
19. A method according to claims 17 or 18 wherein the determining comprises tracking a gaze of the at least one first user.
20. A method according to claim 19 wherein the determining comprises detecting that the gaze of the at least one first user is directed at the at least one second user on the videoconference.
21. A method according to claims 17 to 20 wherein the detecting determines that the at least one first user moves towards the at least one second user.
22. A method according to claims 17 to 21 wherein the detecting determines that the at least one first user issues a gesture and the invitation is sent to the at least one second user in response to detecting the gesture.
23. A videoconferencing terminal comprising: at least one camera configured to capture one or more images of at least one first user at the videoconferencing terminal; a controller comprising a processor, the controller configured to determine at least one change in the interaction status of the at least one user at the videoconferencing terminal based on the received one or more images; and send an invitation for a private videoconferencing session from the at least one first user to at least one second user based on the determined changed interaction status of the first user.
PCT/SE2022/050848 2021-09-24 2022-09-23 A videoconferencing method and system with automatic muting WO2023048632A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE2130256 2021-09-24
SE2130256-7 2021-09-24

Publications (1)

Publication Number Publication Date
WO2023048632A1 true WO2023048632A1 (en) 2023-03-30

Family

ID=85719570

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2022/050848 WO2023048632A1 (en) 2021-09-24 2022-09-23 A videoconferencing method and system with automatic muting

Country Status (1)

Country Link
WO (1) WO2023048632A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080297589A1 (en) * 2007-05-31 2008-12-04 Kurtz Andrew F Eye gazing imaging for video communications
US20130002802A1 (en) * 2011-06-28 2013-01-03 Mock Wayne E Accessing Settings of a Videoconference Using Touch-Based Gestures
US20170264864A1 (en) * 2006-03-18 2017-09-14 Steve H MCNELLEY Advanced telepresence environments
US20200045261A1 (en) * 2018-08-06 2020-02-06 Microsoft Technology Licensing, Llc Gaze-correct video conferencing systems and methods
US20200110572A1 (en) * 2018-10-08 2020-04-09 Nuance Communications, Inc. System and method for managing a mute button setting for a conference call
WO2020153890A1 (en) * 2019-01-25 2020-07-30 Flatfrog Laboratories Ab A videoconferencing terminal and method of operating the same
WO2021011083A1 (en) * 2019-07-18 2021-01-21 Microsoft Technology Licensing, Llc Dynamic detection and correction of light field camera array miscalibration
US20220191638A1 (en) * 2020-12-16 2022-06-16 Nvidia Corporation Visually tracked spatial audio

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170264864A1 (en) * 2006-03-18 2017-09-14 Steve H MCNELLEY Advanced telepresence environments
US20080297589A1 (en) * 2007-05-31 2008-12-04 Kurtz Andrew F Eye gazing imaging for video communications
US20130002802A1 (en) * 2011-06-28 2013-01-03 Mock Wayne E Accessing Settings of a Videoconference Using Touch-Based Gestures
US20200045261A1 (en) * 2018-08-06 2020-02-06 Microsoft Technology Licensing, Llc Gaze-correct video conferencing systems and methods
US20200110572A1 (en) * 2018-10-08 2020-04-09 Nuance Communications, Inc. System and method for managing a mute button setting for a conference call
WO2020153890A1 (en) * 2019-01-25 2020-07-30 Flatfrog Laboratories Ab A videoconferencing terminal and method of operating the same
WO2021011083A1 (en) * 2019-07-18 2021-01-21 Microsoft Technology Licensing, Llc Dynamic detection and correction of light field camera array miscalibration
US20220191638A1 (en) * 2020-12-16 2022-06-16 Nvidia Corporation Visually tracked spatial audio

Similar Documents

Publication Publication Date Title
US10776073B2 (en) System and method for managing a mute button setting for a conference call
US10499136B2 (en) Providing isolation from distractions
US9154730B2 (en) System and method for determining the active talkers in a video conference
US10586131B2 (en) Multimedia conferencing system for determining participant engagement
US10142483B2 (en) Technologies for dynamic audio communication adjustment
US20100085415A1 (en) Displaying dynamic caller identity during point-to-point and multipoint audio/videoconference
US20150002611A1 (en) Computer system employing speech recognition for detection of non-speech audio
US11405584B1 (en) Smart audio muting in a videoconferencing system
NO341316B1 (en) Method and system for associating an external device to a video conferencing session.
TW201543902A (en) Muting a videoconference
JP6149433B2 (en) Video conference device, video conference device control method, and program
US20240064081A1 (en) Diagnostics-Based Conferencing Endpoint Device Configuration
WO2023048632A1 (en) A videoconferencing method and system with automatic muting
US20220360635A1 (en) Intelligent configuration of personal endpoint devices
US20220308825A1 (en) Automatic toggling of a mute setting during a communication session
JP2009060220A (en) Communication system and communication program
JP7292343B2 (en) Information processing device, information processing method and information processing program
US20240094976A1 (en) Videoconference Automatic Mute Control System
US11877130B2 (en) Audio controls in online conferences
TW201906404A (en) Method of switching videoconference signals and the related videoconference system
CN118056397A (en) Automatic mute control system for video conference
Zhang et al. Fusing array microphone and stereo vision for improved computer interfaces
JP2023118335A (en) Communication terminal, communication system, and communication server
JP2019140517A (en) Information processing device and program
JP2018205470A (en) Interaction device, interaction system, interaction method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22873283

Country of ref document: EP

Kind code of ref document: A1