WO2024145189A1

WO2024145189A1 - System and method for multi-camera control and capture processing

Info

Publication number: WO2024145189A1
Application number: PCT/US2023/085537
Authority: WO
Inventors: Peng Sun; Xiwu Cao; Hung Khei Huang; Stephanie Ann Suzuki; Don Francis Purpura
Original assignee: Canon U.S.A., Inc.
Priority date: 2022-12-29
Filing date: 2023-12-21
Publication date: 2024-07-04

Abstract

An apparatus and control method for controlling an online meeting is provided for receiving, from a camera, a captured video of the meeting room, transmitting, via a first server, the captured video of the meeting room to an online meeting client, specifying an ROI (Region Of Interest) in the meeting room, controlling an optical zoom magnification of the camera for capturing a still image of the ROI in the meeting room, and transmitting, via a second server that is different from the first server processing the captured video of the meeting room, the still image that the camera captures after the control for the optical zoom magnification to the online meeting client.

Description

Title

System and Method for Multi-Camera Control and Capture Processing

Cross-Reference to Related Applications

[0001] This application claims priority from U.S. Provisional Patent Application Serial No. 63/477,770 filed on December 29, 2022, the entirety of which is incorporated herein by reference.

Background

Field of the disclosure

[0002] The present disclosure relates to a system and method for controlling an online meeting.

Related Art

[0003] Online meeting services such as Teams, Zoom, and Skype are known. Typically, during an online meeting using such services, a camera implemented in a laptop captures and provides a video to the other attendees.

[0004] In the conventional meeting service, it may be easy to see a face or a facial expression of each attendee who is located in front of a camera implemented in a laptop but it may not be easy to see the other information such as a whiteboard in a meeting room, ROIs specified in a meeting room, a face of a presenter who is not facing a laptop, or the like.

Summary

[0005] In one embodiment a control apparatus and method for controlling an online meeting is provided. The apparatus includes one or more processors; and one or more memories storing instructions that, when executed, configures the one or more processors, to receive, from a first camera, a captured video of a meeting room; determine from the captured video that a predetermined gesture is performed; control a second camera to perform image capture of region surrounding a position in the captured image where the gesture was determined to be performed; and transmit the captured image of the region for display in a user interface.

[0006] In another embodiment, an image capture message is communicated to the second camera including one or more camera parameters and the second camera is controlled to perform image capture using the one or more camera parameters. [0007] In a further embodiment, the one or more image capture parameters are translated based on pre-capture calibration that identifies a positional relationship between the first camera an second camera.

[0008] In yet another embodiment, position information is obtained from a frame of the captured video captured by the first camera, adjusted position information is generated based on a predetermined positional relationship between the first and second camera and the adjusted position information is used to control the second camera to control the second camera to perform image capture.

[0009] In another embodiment, a field of view of the first camera and the second camera is calibrated using one or more common points in real space such that a starting position of each of the first and second camera are identified; and the control of the second camera to perform image capture is performed based on the calibrated starting position of the second camera.

[0010] These and other objects, features, and advantages of the present disclosure will become apparent upon reading the following detailed description of exemplary embodiments of the present disclosure, when taken in conjunction with the appended drawings, and provided claims.

Brief Description of the Drawings

[0011] Figure 1 illustrates the system architecture according to the present disclosure

[0012] Figure 2 depicts a flowchart illustrating an operation of the control apparatus according to the present disclosure.

[0013] Figure 3 depicts a flowchart illustrating one or more processes shown in Figure 2.

[0014] Figure 4 illustrates name and position information.

[0015] Figure 5 is a flowchart illustrating one or more processes shown in Figure 2.

[0016] Figure 6 is a flowchart illustrating one or more processes shown in Figure 2.

[0017] Figure 7 illustrates a display screen for presenter switching.

[0018] Figure 8A illustrates a first presenter switching process.

[0019] Figure 8B illustrates a second presenter switching process.

[0020] Figure 9A illustrates a display screen generated by the control apparatus.

[0021] Figure 9B illustrates a display screen generated by the control apparatus.

[0022] Figure 10A illustrates a display screen generated by the control apparatus.

[0023] Figure 10B illustrates a display screen generated by the control apparatus.

[0024] Figure 11 illustrates a display screen generated by the control apparatus and provided to one or more client computers. [0025] Figure 12 illustrates a display screen generated by the control apparatus and provided to one or more client computers.

[0026] Figure 13 illustrates a display screen generated by the control apparatus and provided to one or more client computers.

[0027] Figure 14 illustrates a display screen generated by the control apparatus and provided to one or more client computers.

[0028] Figure 15 illustrates a display screen generated by the control apparatus and provided to one or more client computers.

[0029] Figure 16 illustrates a display screen generated by the control apparatus and provided to one or more client computers.

[0030] Figure 17 illustrates information regarding still images.

[0031] Figure 18 is a flowchart illustrating an operation of Client computers 106 and 107.

[0032] Figure 19 illustrates exemplary hardware configuration.

[0033] Figures 20A - 20C illustrates hand gestures for capturing information.

[0034] Figure 21 is a flow chart illustrating hang gesture capture processing.

[0035] Figure 22 is a flow chart illustrating hang gesture capture processing.

[0036] Figure 23 illustrates the system architecture according to the present disclosure.

[0037] Figure 24 illustrates a flow diagram according to the present disclosure.

[0038] Figure 25 illustrates exemplary operation of the system according to the present disclosure.

[0039] Figure 26 illustrates a flow diagram according to the present disclosure.

[0040] Figure 27 illustrates exemplary operation of the system according to the present disclosure.

[0041] Figure 28 illustrates exemplary operation of the system according to the present disclosure.

[0042] Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject disclosure will now be described in detail with reference to the figures, it is done so in connection with the illustrative exemplary embodiments. It is intended that changes and modifications can be made to the described exemplary embodiments without departing from the true scope and spirit of the subject disclosure as defined by the appended claims.

Detailed Description [0043] Figure 1 illustrates a system architecture according to an exemplary embodiment. The system includes a camera 102, a control apparatus 103, a first server 104, a second server 105, a client computer A 106, a client computer B 107 and a user recognition service 121. In this embodiment, the camera 102, the control apparatus 103 and the client computer B 107 may be located in a meeting room 101 but this is not seen to be limiting. Figure 1 illustrates that each of the camera 102, the control apparatus 103 and the user recognition service 121 is implemented in a different device shown in Figure 1 but it’s not seen to be limiting. For example, the control apparatus 103 may be able to work as the user recognition service 121. [0044] In an exemplary embodiment, the client computer A 106 and the client computer B 107 executes the same computer programs for an online meeting to work as the online meeting clients. However, the client computer A 106 and the client computer B 107 are described by a different name according to whether the computer is located in the meeting room 101 or not, for explanation purpose.

[0045] Figure 2 is a flowchart illustrating an operation of the control apparatus 103 according to an exemplary embodiment. The operation of the control apparatus 103 according to an exemplary embodiment will be described in detail below with reference to Figure 1 and Figure 2. The operation described with reference to Figure 2 will be started in response to a trigger event that the control apparatus 103 detects a predetermined gesture for starting an online meeting from a video captured by the camera 102. In an exemplary embodiment, when the control apparatus 103 keeps detecting a thumbs-up gesture (see Figure 20 A) within a predetermined range from a face region for a predetermined time period (e.g. 3 seconds), the control apparatus 103 outputs a predetermined sound A to notify of the user that the control apparatus 103 detects the hand gesture for starting an online meeting, and if the hand gesture further keeps a predetermined time period (e.g. 2 seconds), the control apparatus 103 outputs a predetermined sound B to notify of the user that the control apparatus 103 starts an online meeting as per the hand gesture, then flow proceeds to S101. In another embodiment, in addition to, or instead of the sounds, the control apparatus 103 may cause a visual indicator to be operated. This may include, for example, sending a control signal to a visual indicator present on the camera capturing the images such that the visual indicator is caused to flash in a certain pattern or in different colors thereby unobtrusively indicating, to the user, that the gesture has been detected.

[0046] However, this is not seen to be limiting. For example, the operation reference to Figure 2 may be started by other hand gesture, voice controls, keyboard operations or mouse operations by a user of the control apparatus 103. Also, each of the steps described with reference to Figure 2 is realized by one or more processors (CPU 501) of the control apparatus 103 reading and executing a pre-determined program stored in a memory (ROM 503).

[0047] In S 101 , the control apparatus 103 receives, from the camera 102, a captured video of the meeting room 101. The camera 102 is set in the Meeting room 101, performs real-time image capturing in the Meeting room 101, and transmits the captured video to the control apparatus 103.

[0048] In S 102, the control apparatus 103 performs detection process for detecting one or more face regions of respective users from the captured video. More specifically, the control apparatus 103 identifies one or more video frames from among a plurality of video frames constituting the captured video, and perform face detection process on the identified video frame(s). As illustrated in Figure 1, there are three people (108, 109 and 110) in the meeting room 101, so three face regions are detected during the processing performed in S102. The control apparatus 103 crops the three face regions from the identified video frame(s). Each of these three cropped face regions may be used as respective video input feeds that may be selectable and viewable by remote users as will be discussed below. While the detection processing is described as being performed by the control apparatus, the detection processing can be performed directly by the camera 102. In so doing, the in-camera detection may provide location information identifying pixel areas in the image being captured that contain a detected face and the control apparatus 103 may use this information as discussed below.

[0049] In S 103, the control apparatus 103 transmits the three cropped face regions (face images) to the user recognition service 121 to obtain Usernames corresponding to the three face regions. The user recognition service 121 comprises the database 122 which stores Facial data and Username information associated with respective facial data. The user recognition service 121 is able to compare the face images received from the control apparatus 103 with Facial data in the database 122 to identify the Username corresponding to the face region detected from the video. The identified username is provided from the user recognition service 121 to the control apparatus 103.

[0050] In S104, the control apparatus 103 transmits, via the first server 104, to the client computers 106 and 107, (i) a video 113 of the meeting room 101, (ii) a video 114 of the face region of the person 108 which is cropped from the video of the meeting room 101, (iii) a video 115 of the face region of the person 109 which is cropped from the video of the meeting room 101 and (iv) a video 116 of the face region of the person 110 which is cropped from the video of the meeting room 101. The control apparatus 103 communicates with the first server 104 based on a first communication protocol (e.g. WebRTC) on which a bit rate of a media content is changed according to an available bandwidth of a communication path between the control apparatus 103 and the first server 104.

[0051] In S105, the control apparatus 103 transmits, via the second server 105, to the client computers 106 and 107, the name and position information 120 which contains name and position of each region. The control apparatus 103 communicates with the second server 105 based on a second communication protocol (e.g. HTTP) on which a bit rate of a media content is not changed according to an available bandwidth of a communication path between the control apparatus 103 and the second server 105. Figure 4 illustrates an example of the name and position information 120 which is provided by the control apparatus 103 to the client computers 106 and 107.

[0052] As shown in Figure 4, the name and position information 120 contains ID 401, the coordinates of upper left corner 402, Width of region 403, Height of region 404, Type 405 and Name 406. The control apparatus 103 assigns the ID 401 to each video stream. Also the control apparatus 103 specifies the coordinates of upper left corner 402, the width of region 403 and the height of region 404 of each region based on face detection process, presenter designation process, RO1 designation process, whiteboard detection process and the like. More details of these process will be described later.

[0053] The control apparatus 103 may also specify the type 405 of each video stream based on face detection process, presenter designation process, ROI designation process, whiteboard detection process and the like. For example, if the face detection process detects a region, the type 405 of the region may be “Attendee”, and if the whiteboard detection process detects a region, the type 405 of the region may be “Whiteboard”. The control apparatus 103 obtains the Name 406 of “Attendee” from the user recognition service 121 which performs the user recognition process using the database 122 as described above. The name 406 of “Whiteboard” is determined according to the detection or designation order. For example, the name 406 of a whiteboard that is detected (or designated) first may be “Whiteboard A” and the name 406 of a whiteboard that is detected (or designated) may be “Whiteboard B”.

[0054] The name 406 of “ROI” is determined according to the detection or designation order. For example, the name 406 of an ROI firstly detected or designated may be “ROI-1” and the name 406 of an ROI that is secondly detected or designated may be “ROI-2”. If the number of ROI is limited to One, and a new ROI is detected or designated while an ROI is already existing as shown in ID = 07 in Figure 4, then the information 402, 403 and 404 of the row whose ID - 7 may be updated based on the newly detected or designated ROI region instead of adding a new row for “ROI-2”. In other words, the number of Whiteboards and/or ROIs may be limited to a predetermined number, and after reaching to the predetermined number, the position information 402, 403 and 404 of the oldest one may be updated based on the newly detected or designated region.

[0055] Returning to Figure 2, in S106, the control apparatus 103 transmits, via the second server 105, to the client computers 106 and 107, (i) a still image 117 of the Whiteboard A l li which is cropped from a video frame of the meeting room 101, (ii) a still image 118 of the Whiteboard B 112 which is cropped from a video frame of the meeting room 101 and (iii) a still image 1 19 of the ROI (Region Of Interest) which is cropped from a video frame of the meeting room 101. The detailed explanation of the process for whiteboard is provided later with reference to Figure 5, and the detailed explanation of the process for ROI is provided later with reference to Figure 3.

[0056] In S107, the control apparatus 103 determines whether the online meeting is closed. The online meeting will be closed in response to a trigger event that the control apparatus 103 detects a predetermined gesture for finishing an online meeting from a video captured by the camera 102. In the present exemplary embodiment, when the control apparatus 103 keeps detecting a hand gesture showing a palm (see Figure 20B) within a predetermined range from a face region for a predetermined time period (e.g. 3 seconds), the control apparatus 103 outputs a predetermined sound C to notify of the user that the control apparatus 103 detects the hand gesture for finishing the online meeting, and if the hand gesture further keeps a predetermined time period (e.g. 2 seconds), the control apparatus 103 outputs a predetermined sound D to notify of the user that the control apparatus 103 closes the online meeting as per the hand gesture, then flow proceeds to END. If the online meeting is not over, flow proceeds to S108. However, this is not seen to be limiting. For example, the online meeting may be closed by other hand gestures, voice controls, keyboard operations or mouse operations by a user of the control apparatus 103.

[0057] In S108, the control apparatus 103 performs process regarding ROI. The detailed explanation of this ROI process will be provided later with reference to Figure 3.

[0058] In S109, the control apparatus 103 performs process regarding Whiteboards. The detailed explanation of this Whiteboard process will be provided later with reference to Figure 5.

[0059] In SI 10, the control apparatus 103 performs process regarding Presenter. The detailed explanation of this Presenter process will be provided later with reference to Figure 6. After completion of SI 10, flow returns to S 101.

[0060] Each of the client computers 106 and 107 is able to display an online meeting window. [0061] Figure 18 is a flowchart illustrating an operation of the client computer 106 according to an exemplary embodiment. The operation described with reference to Figure 18 will be started in response to a trigger event that the client computer 106 detects a predetermined user operation for joining the online meeting. In the present exemplary embodiment, the client computer 106 may detect that the user clicks a join button on its display screen, and then flow proceeds to T101 - T103. The below explanation with reference to Figure 18 will mainly focus on the operations of the client computer 106 but the client computer 107 is able to perform the same steps as the client computer 106.

[0062] In T101, the client computer 106 receives, via the first server 104, from the control apparatus 103, (i) a video 113 of the meeting room 101, (ii) a video 114 of the face region of the person 108 which is cropped from the video of the meeting room 101, (iii) a video 115 of the face region of the person 109 which is cropped from the video of the meeting room 101 and (iv) a video 116 of the face region of the person 110 which is cropped from the video of the meeting room 101.

[0063] In T102, the client computer 106 receives, via the second server 105, from the control apparatus 103, the name and position information 120 which contains name and position of each region. Figure 4 illustrates an example of the name and position information which is provided by the control apparatus 103 to the client computers 106 in T102.

[0064] In T103, the client computer 106 receives, via the second server 105, from the control apparatus 103, (i) a still image 117 of the Whiteboard A l l i which is cropped by the control apparatus 103 from a video frame of the meeting room 101, (ii) a still image 118 of the

[0065] Whiteboard B 112 which is cropped by the control apparatus 103 from a video frame of the meeting room 101 and (iii) a still image 119 of the ROI (Region Of Interest) which is cropped by the control apparatus 103 from a video frame of the meeting room 101.

[0066] The client computer 106 may be able to display the online meeting window 900 based on the information received in T101-T103. Figure 11 illustrates the online meeting window 900 of the present exemplary embodiment. As Figure 11 illustrates, the online meeting window 900 contains a single view indicator 901, a two view indicator 902, a three view indicator 903, an HR (High Resolution) image selector 904, a video selector 905, video icons 906-910, a display region 911 and a leave button 918. The HR image selector 904, the video selector 905 and the video icons 906 - 910 are located within a menu region 917.

[0067] When the single view indicator 901 is selected, the online meeting window 900 may contain the display region 911. In the present exemplary embodiment, if the two view indicator 902 is selected as shown in Figure 13, the online meeting window 900 may contain two display regions (911 and 912), and if the three view indicator 903 is selected as shown in Figure 14, the online meeting window 900 may contain three display regions (911, 912 and 913). The user of client computer 106 may be able to choose any indicator from the indicators 901 , 902 and 903 to layout the online meeting window 900 based on how many regions the user wants to see in parallel and the size of the display region desired.

[0068] Also, the user of the client computer 106 is able to choose one or more icons from among a meeting room icon 906, a presenter icon 907, a whiteboard icon A 908, a whiteboard icon B 909 and an ROI icon 910 as shown in Figure 1 1 . In the present exemplary embodiment, the choice is performed by a drag-and-drop operation on the icon from the menu region 917 to the display regions 911, 912 or 913 respectively. Figure 13 illustrates the online meeting window 900 when the meeting room icon 906 has been dropped into the display region 911 and the presenter icon 90 has been dropped into the display region 912.

[0069] Figure 11 illustrates a state where the HR image selector 904 is disabled, the video selector 905 is enabled and the video icons 906 - 910 are displayed on the menu region 917. On the other hand, Figure 15 illustrates a state where the HR image selector 904 is enabled, the video selector 905 is disabled and the HR image icons 914 - 916 are displayed on the menu region 917. That is, a user can click or tap on either of the HR image selector 904 and the video selector 905 to switch the icons to be displayed on the menu region 917 between the video icons 906 - 910 and the HR image icons 914 - 916. The images corresponding to the HR image icons 914 - 916 are high resolution images which are obtained by capturing with an optical zoom control of the camera 101. According to the switching mechanism using the HR image selector 904 and the video selector 905, it may be easier for users to choose a desired icon even if many HR still images are generated.

[0070] In the present exemplary embodiment, a display order of the video icons 906 - 910 within the menu region 917 is determined based on a generation order of each media stream. For example, if the video 113 of the meeting room 101 is firstly defined among the all videos provided from the control apparatus 103 to the client computer 106, the meeting room icon 906 corresponding to the video 113 is located at the right most within the menu region 917. Similarly, in the present exemplary embodiment, Figure 11 illustrates the online meeting window 900 when the presenter region corresponding to the presenter icon 907 is secondly defined/designated and the whiteboard region A corresponding to the whiteboard icon A 908 is thirdly defined/designated among the regions in the captured video. The generation order of the video streams is represented by ID 401 of the name and position information 120 explained with reference to Figure 4. [0071] In the present exemplary embodiment, the display order of the HR images within the menu region 917 is also determined based on a generation order of each HR image. In other words, as shown in Figure 15, the HR image icon 914 corresponding to a high resolution still image (Captured image A) captured earlier than the other high resolution still images (Captured images B and C) is displayed at the right most position within the menu region 917. The generation order information is provided by the control apparatus 103 to the client computer 106 in S 105. Figure 17 illustrates HR image information in the present exemplary embodiment. The HR image information contains a Room ID 1001 , a Meeting ID 1002, a Shooting date/time 1003, a Shooting ID and an Image data location 1005. The client computer 106 may identify the generation order of the HR still images based on the shooting date/time 1003 to determine the display order of the HR image icons within the menu region 917. However, this is not seen to be limiting. For example, the display order of the HR image icons may be determined based on the Room ID 1001 and/or the Meeting ID 1002. As another example, the client computer 106 may determine the display order of the HR image icons such that the HR image icon corresponding to the oldest HR image is located at the left most position within the menu region 917 and the HR image icon corresponding to the second oldest HR image is located at the second from the left within the menu region 917.

[0072] Note that the client computer 106 may be able to remove any of the icons 906 - 910 and 914 - 916 as per user operations. In the present exemplary embodiment, when a mouse cursor 919 moves onto an arbitrary icon (e.g. icon 914), a removal button 920 for a remove instruction is displayed, as shown in Figure 15. If the user clicks or taps on the removal button 920, the corresponding icon may be removed from the online meeting window 900.

[0073] Figure 11 illustrates the online meeting window 900 with a case where one meeting room 101 is connected to the control apparatus 103. However, two or more meeting rooms may be able to connect to via respective control apparatus 103 that each connect to servers thereby granting access to one of the meeting rooms to the other of the meeting rooms.. If two meeting rooms are connected, two meeting room icons may be displayed on the menu region 917, and name and position information with reference to Figure 4 may contain name and position information for the second meeting room.

[0074] Returning to Figure 18, in T104, the client computer 106 identifies one or more videos/images to be displayed on the online meeting window 900. As described above, a user is able to instruct based on the drug-and-drop operations on the online meeting window 900, and the number of videos/images which can be displayed on the window 900 is depending on which of the indicators 901, 902 and 903 is selected. [0075] In T 105, the client computer 106 determines whether to display usemame/position on the online meeting window 900. Figure 12 illustrates the online meeting window 900 which contains the usernames and the position of each face region while Figure 11 illustrates an example of the online meeting window 900 which does not contain the usernames and position information. The user of the client computer 106 may switch between a state to display the username/position and a state not to display the usemame/position. If the user instructs to display the username/position, the client computer 106 refers the name and position information 120 received in T102 from the control apparatus 103 to obtain the usernames and positions and superimpose them onto a video displayed within the online meeting window 900. [0076] In T106, the client computer 106 updates display contents on the online meeting window 900 based on the process in T101 - T105.

[0077] In T107, the client computer 106 determines whether to leave the online meeting. In the present exemplary embodiment, when the user of the client computer 106 clicks or taps the leave button 918 on the online meeting window 900, the client computer 106 determines to leave the online meeting. In addition, the client computer 106 may determine to leave the online meeting when the control apparatus 103 inform the client computer 106 of the meeting is over. If the client computer 106 determines not to leave the online meeting, flow returns to T101 - T103. If the client computer 106 determines to leave or over the online meeting, flow proceeds to END.

[0078] ROI process described in S 108 of Figure 2 according to an exemplary embodiment will be described in detail below with reference to Figure 3. S108 may be skipped until a first predefined hand gesture for an ROI designation is detected. If the first predefined hand gesture is detected, flow proceeds to Al 01.

[0079] In A 101 , the control apparatus 103 determines whether the first predefined hand gesture is being detected for a first predetermined time period (e.g. 2 seconds). In the present exemplary embodiment, the control apparatus 103 detects an open-hand gesture (see Figure 20C) as the first predefined hand gesture. If the control apparatus 103 determines that the control apparatus 103 continuously detects the first predefined hand gesture for the first predetermined time period, flow proceeds to A 102.

[0080] In A102, the control apparatus 103 performs a control for output a predetermined sound E for notifying the user that the first predefined hand gesture is detected for the first predetermined time period and it’s time to change the hand gesture to a second predefined hand gesture. After outputting the predetermined sound E, flow proceeds to A 103. [0081] In A 103, the control apparatus 103 determines whether the second predefined hand gesture is detected within a second predetermined time period (e.g. 3 seconds) after outputting the predetermined sound E. In the present exemplary embodiment, the control apparatus 103 detects a closed-hand gesture (see Figure 20B) as the second predefined hand gesture. If the control apparatus 103 determines that the second predefined hand gesture is detected within a second predetermined time period, flow proceeds to A104.

[0082] In A104, the control apparatus 103 determines whether the second predefined hand gesture is being detected for a third predetermined time period (e.g. 2 seconds). If the control apparatus 103 determines that the control apparatus 103 continuously detects the second predefined hand gesture for the third predetermined time period, flow proceeds to A105. When the control apparatus 103 determines “No" in either of A101, A103 and A104, flow proceeds to Al l 1 and the control apparatus 103 determines notifies a user of an error during the ROI designation process.

[0083] In A 105, the control apparatus 103 performs a control for output a predetermined sound F for notifying the user that the second predefined hand gesture is detected for the third predetermined time period and the ROI designation process is successfully completed. After outputting the predetermined sound F, flow proceeds to A 106.

[0084] In A106, the control apparatus 103 adds a new media stream according to the ROI designation. More specifically, the control apparatus 103 adds a new media stream 119 to periodically transmit the ROI images cropped from a video frame of the meeting room 101 to the client computers 106 and 107 via the second server 105. If the number of ROIs already designated by the user is larger than a threshold number, the control apparatus 103 may update the oldest ROI position with the new ROI position instead of adding the new media stream.

[0085] In A107, the control apparatus 103 suspends transmitting video streams 113 - 116 and transmits image data which includes a message indicating that an ROI capturing is in progress. Figure 16 illustrates an online meeting window 900 which contains the message and is displayed during an optical zoom magnification control performed in A108.

[0086] In A108, the control apparatus 103 controls an optical zoom magnification of the camera 102 according to the ROI position. In an exemplary embodiment, the center of the ROI is identical to the center of the second hand gesture detected in A 104, and the dimension of the ROI is 20% of the field of view of the camera 102. For example, if an original captured image is 1280 [pixel] * 960 [pixel], the ROI is a 256 [pixel] * 192 [pixel] range within the captured image. In A108, the camera 102 performs zoom-in process into the ROI to improve a resolution of the ROI. [0087] In A109, the control apparatus 103 causes the camera 102 to perform capturing process to obtain a HR (High Resolution) still image of the ROI. When the control apparatus 103 obtains the HR still image of the ROI from the camera 102, the control apparatus 103 transmits a URL to the client computers 106 and 107 via the second server 105. The client computers 106 and 107 are able to get the HR still image of the ROI via the second server 105 by accessing the URL. Also, the control apparatus 103 periodically crops the ROI from a video frame of the meeting room 101 and provides the ROI image to the client computers 106 and 107 via the second server 105 unless the ROI detected in A 104 is deleted by the user.

[0088] In A110, the control apparatus 103 controls an optical zoom magnification of the camera 102 to the original value. In other words, in Al 10, the optical zoom parameters of the camera 102 returns to parameters before the optical zoom control in A108. After this returning process, the control apparatus 103 resumes transmitting the video streams to the client computers 106 and 107, and flow proceeds to S109 in Figure 2.

[0089] After the ROI designation in S 108, the control apparatus 103 periodically crops the ROI within a video frame from the camera 102 and the ROI is provided to the client computers 106 and 107 via the second server 105.

[0090] The processing performed in A106 - A109 is further described in Figures 21 and 22. The processing described therein remedies the drawbacks associated with a hybrid meeting environment where some participants are in-person at a first location (e.g. meeting room) and others are remotely located and are connected to the meeting room using an online collaboration tool. These drawbacks are particularly difficult when a single camera is being used to capture the full meeting room view being transmitted to the remote users. In these environments with single cameras, the camera field of view is often a compromise between obtaining a wide angle to capture the entire view of the room but narrow enough to identify relevant information and people in the room. If the camera is too wide, it is difficult for the remote viewers to identify/review objects within the meeting room. If the view is too narrow, the remote user is unable to view the context of the entire meeting.

[0091] The processing performed in A106 - A109 advantageously provides a combination view which uses both a digital zoom and capture ROI for live view imaging and static optical zoom and capture for a high quality view of a particular area within the live view captured frame captured using the digital zoom of the image capture device.

[0092] In order to take the high quality static image, the system will take over the room camera and pan/zoom to the region-of-interest which is identified in the manner described herein throughout the present disclosure and capture a high-quality image of the identified ROI. Upon completing the capture, the camera will be controlled to return back to the room view position as defined immediately preceding the static image capture. In doing so, a reposition time value is determined that represents a length of time required to reposition the camera (e.g. X seconds) is and a buffering process that buffers the live video is started. The output frame rate of the live video is reduced to a predetermined frame rate less than a present frame rate. In one embodiment, the predetermined frame rate is substantially 50% of the normal frame rate. At the expiration of the reposition time value (e.g. X seconds have elapsed), the control apparatus will send a control signal for controlling the PTZ to reposition such that the PTZ camera can optically zoom in and capture a still image of predetermined region in the wide angle view of the room as identified by the detected gesture and take the high quality image. In one embodiment, the high quality image is captured at a maximum resolution of the image capture device. For example, the image may be captured at 4K resolution. The control apparatus will continue sending the buffered video at predetermined frame rate to the remote computing devices while the repositioning of the camera is occurring. When the reposition and reset is complete, normal frame rate video will resume.

[0093] The algorithm for generating these dual views using the single camera is shown in Fig. 21. An online meeting is started in 2101 and the control apparatus 103 causes video data to be captured by the camera 102. During the processing of video capturing, the control apparatus 103, in 2102, detects a gesture such as a transition from the gesture in Fig. 20B to Fig. 20C from one or more users indicating that region of interest within a frame is desired. In actual operation, the user positions their hand in front of an object or area in the room that is in the field of view being captured by the camera 102 and performs the predetermined gesture which is detected by the control apparatus 103 from the video data being captured by the camera 102. The manner in which gesture is recognized and causes an ROI to be identified is described throughout this disclosure and is incorporated herein by reference. In response to detecting a predetermined gesture of the one or more users in the meeting room within the frame of video, the control apparatus 103 determines, in 2103, the coordinates of the ROI based on the position of the detected gesture. This process is similarly described herein and is incorporated herein by reference. Upon detecting the coordinates of the identified ROI, the control apparatus 103 digitally crops the ROI in 2104 from the video data based on the determined coordinates. This cropped ROI represents a digital zoom of the ROI and is communicated to the remote users as a live view ROI and provided as 2201 in Fig. 22 described below. It should be noted that the wide angle view of the room being captured full frame by the camera is still also being captured and is caused to be communicated to the remote computing devices as an individual view different from the crop ROI live view. That process is described throughout and need not be repeated here as the presently described algorithm is focusing on the dual capture of live view ROI regions from within a video frame and a still image having a higher image quality than the captured live view ROI.

[0094] In an instance when, not only does a user want to present the live view ROI to the remote user but also wants a higher quality view of the particular ROI, the control apparatus 103 can control the camera 102 to capture a still image having an image quality higher than an image quality being captured via the live-view capture. Tn one embodiment, the gesture indicating that an ROI within a frame should be captured can initiate the still image capture process that follows. In another embodiment, a further gesture may be required to initiate the still image capture of the ROI at the higher image quality and, upon detection thereof in accordance with the manner described herein, still image capture can be initiated.

[0095] In response to control by the control apparatus 103 to capture a still image, the control apparatus 103 determines, in 2105, one or more camera control parameters that will be used to physically control the position of the camera to position the camera to capture a still image of the identified ROI. In one embodiment, the one or more camera parameters includes a pan direction, a pan speed, a tilt direction, a tilt speed and an optical zoom amount required to capture a still image of the area within the ROI. The one or more camera parameters are obtained based on the pixel-by-pixel dimensions of the ROI that corresponds to the region sounding a region within the frame that is identified by the detected gesture. The one or more camera parameters further includes a reposition time value that represents an amount of time it will take the camera to move into the correct position and capture the particular ROI as determined by the detected gesture. In one embodiment, the reposition time value can be determined by calculating an X and Y reposition distance which can occur because the position of the camera is known as is the target position representing the ROI. This value is then multiplying by a constant factor representing the relocation time per unit of distance (ie, 1 MS per pixel). The result is the reposition time value that represents the estimated time it would take the camera to reposition to the new location to capture the still image of the ROI.

[0096] The control apparatus 103, after calculating the one or more camera parameters, causes the live video data of the ROI being capture during the live view ROI capture processing to be sent to one or more video buffers in 2106. The control apparatus 103 causes the live view ROI video data in the buffer to be output, in 2107, at a frame rate less than a current frame rate at which the live view ROI is being captured. At a point in time substantially equal to half the reposition time value in 2108, the determined one or more camera parameters are provided, in 2109, by the control apparatus 103 to the camera 102 which causes the camera 102 to be controlled to reposition in 2110 based on the one or more camera parameters and communicate an image capture command in 211 1 that causes the camera 102 to capture a still image having a higher image quality than the live view ROI video image that is being output by the buffer at the lower frame rate. The control apparatus receives, in 2112, the captured still image having a higher image quality than an individual frame of the live view ROI image and communicates this captured image to the remote computing devices. In embodiment, the still image captured is transmitted in 2210 to the remote computing devices via a communication channel different from the live view ROI video stream. In another embodiment, this still image is stored in a memory that is specific to a particular user or organization that controls the online meeting. After the high resolution still image capture described above, the video capture rate is caused to return to the rate being captured prior to 2107 such that the live view of the meeting room can be captured and provided as described herein

[0097] In a further embodiment, shown in Fig. 22, the live view ROI video data 2201 from Fig. 21 is communicated as a data stream 2202 for display in 2203 within a window on a user interface of the remote computing device and is displayed concurrently with the higher resolution still image captured 2210 from Fig. 21 which is shown in a separate different window within the user interface. The high quality image 2210 is provided as a second different data stream in 2211 and displayed in 2212 in a window different from the window used to display in 2203. In this embodiment, the live view ROI image at the lower quality can be synchronized in 2216 with the higher quality still image which is received as a stream whereby the control apparatus continually sends individual higher resolution images captured during ROI still image capture process. In this embodiment, the control apparatus 103 controls the display of two views of the ROI, one is the crop live-view video having a first resolution of the ROI and the second is the still image capture performed by PTZ optical zoom of the ROI which is at a second resolution higher than the first resolution. The optical zoom has the capability of further digital zoom to see additional detail. The synchronization is to allow the live- view ROI to track the digital zoom of the second view. This synchronization can be performed based on the one or more camera parameters that control the camera repositioning to obtain the still image of the ROI. In this instance, the zoom parameters of ROI-1 and the crop of the live view (ROI-2) would be nearly the same and can be mirrored the digital zoom/pan of ROI-1 (still image capture of ROI) and perform the same digital zoom/pan on ROI-2 (live view ROI).

[0098] Turning back to Figure 2 and the whiteboard processing in S 109, an exemplary embodiment will be described in detail below with reference to Figure 5. [0099] In Bl 01, the control apparatus 103 determines whether a whiteboard region is detected. The control apparatus 103 may detect whiteboard regions based on image recognition process and/or based on user operations. A user is able to designate four comers of a whiteboard to designate a whiteboard region. If the control apparatus 103 determines that the whiteboard region is not detected, flow proceeds to SI 10. If the control apparatus 103 determines that the whiteboard region is detected, flow proceeds to B102.

[00100] In B 102, the control apparatus 103 highlights the whiteboard region thereby a user of the control apparatus 103 is able to see which region is designated as the whiteboard region. Figure 9 A illustrates a state where a user designates the four comers 124, 125, 126 and 127 of a certain whiteboard region 112 in B101, and Figure 9B illustrates a state where the designated whiteboard region 112 is highlighted in Bl 02. The control apparatus 103 displays these information on a display screen located in the meeting room 101 and the user in the meeting room 101 confirms the whiteboard designation is correctly performed.

[00101] In Bl 03, the control apparatus 103 adds a new video stream according to the whiteboard detection. More specifically, the control apparatus 103 adds a new video stream to periodically send the Whiteboard images cropped from a video frame of the meeting room 101 to the client computers 106 and 107 via the second server 105. If the number of whiteboards already detected is larger than a threshold number, the control apparatus 103 may update the oldest whiteboard position with the new whiteboard position instead of adding the new stream. [00102] In B104, the control apparatus 103 performs keystone correction on a video frame of the video of the meeting room 101 and crops the whiteboard region from the keystone corrected video frame to obtain the still image of the whiteboard and the cropped whiteboard region is transmitted to the client computers 106 and 107. As illustrated in Figure 1, if two or more whiteboard regions are detected, the control apparatus 103 performs the process for the two or more whiteboard regions respectively. Unless the whiteboard region detected in B101 is deleted/released by the user, the control apparatus 103 periodically crops the whiteboard region from a video frame of the meeting room 101 and provides the whiteboard image to the client computers 106 and 107 via the second server 105. In an exemplary embodiment, the control apparatus 103 obtains the whiteboard image without optical zoom control.

[00103] The presenter processing of SI 10 of Figure 2 according to an exemplary embodiment will be described in detail below with reference to Figure 6.

[00104] In Cl 01, the control apparatus 103 determines whether the presenter name has been switched after the previous determination. If the control apparatus 103 determines that the presenter name has not been changed, flow proceeds to each of S104 - S106. If the control apparatus 103 determines that the presenter name has been changed, flow proceeds to Cl 02. In an exemplary embodiment, the presenter name is able to be switched based on user operations on the control apparatus 103. Figure 7 illustrates a display region for switching the presenter of the present exemplary embodiment. In this embodiment, the presenter is set as “None” as initial settings, and a display region 701 represents that Ken Ikeda is selected as the presenter, and a display region 702 represents that the presenter is switched from Ken Ikeda to Dorothy Moore.

[00105] In Cl 02, the control apparatus 103 identifies a username of a current presenter from the name and position information 120 and changes the type 405 of the identified username from “Presenter” to “Attendee”. In an exemplary embodiment as illustrated in Figure 7, the control apparatus 103 identifies Ken Ikeda as the username of the current presenter and change the Type 405 of Ken Ikeda from “Presenter” to “Attendee”. Figure 8A illustrates the change of the Type 405 of Ken Ikeda.

[00106] In Cl 03, the control apparatus 103 searches for a username of the new presenter from the name and position information as illustrated in Figure 8 A. The control apparatus 103 may find Dorothy Moore as the new presenter and flow proceeds to Cl 04.

[00107] In Cl 04, the control apparatus 103 changes the Type 405 of the new presenter Dorothy Moore from “Attendee” to “Presenter”. Figure 8A illustrates the change of the Type 405 of Dorothy Moore.

[00108] In Cl 05, the control apparatus 103 crops the face region of the new presenter from each video frame of the video of the meeting room 101 and transmits the cropped video to the client computers 106 and 107 via the first server 104. Until the presenter is switched, the control apparatus 103 continuously crops the face region of Dorothy Moore from a video frame of the meeting room 101 and provides the cropped video to the client computers 106 and 107 via the first server 104. After Cl 05, flow proceeds to each of S 104 - S 106.

[00109] As another exemplary embodiment of Cl 05, the type 405 may not have the type “Presenter” and the control apparatus 103 and the client computers 106 and 107 identify the presenter by referring presenter information 801 that is separately stored from the name and position information 120. Figure 8B illustrates the name and position information 120 and the presenter information 801 of an exemplary embodiment. As illustrated in Figure 8B, all human are labeled as “Attendee” at Type 405, and the presenter information 801 is used for identifying the presenter. In this exemplary embodiment, the control apparatus 103 may change the presenter name indicated in the presenter information 801 from Ken Ikeda to Dorothy Moore as illustrated in Figure 8B. [00110] As described above, the control apparatus 103 may transmit a video of the meeting room 101 and videos of the face regions via the first server 104 and may transmit the images of whiteboards, the images of the ROI and the name and position information via the second server 105. However, this is not seen to be limiting. Another exemplary embodiment, the control apparatus 103 may transmit the video of the meeting room via the first server 104 and may transmit the videos of the face regions, the images of the whiteboards, the images of the ROI and Name and Position information via the second server 105. As another example, the control apparatus 103 may transmit the video of the meeting room, the videos of the face regions and the images of the whiteboards and the images of the ROI cropped from the video frames of the meeting room 101 via the first server 104 and may transmit the images of the HR image of the ROI and Name and Position information via the second server 105.

[00111] Figure 23 illustrates an embodiment of the above disclosure whereby there are two image capture devices positioned in a meeting room as described above. With respect to Fig. 23, elements that appear with dashed lines are indicative of data and/or command messages that are generated by respective components of the control apparatus 103, wherein the respective components include executable instructions that are executed by one or more processors.

[00112] Overall control of the meeting room application and processing is the same with the only difference being the logic control of the ROI processing. In this instance, a first camera 2301 is provided and has a field of view of the entire room and all persons and objects in the room. In addition, a second camera 2302 is provided that is dedicated to capturing one or more ROIs within the room as determined by a user performing a gesture as discussed above. The first camera 2301 and second camera 2302 are in communication with the control apparatus (103 in Fig. 1) which includes the various modules and applications that control the online meeting operation as described herein. In exemplary operation a gesture is detected in the field of view of the first camera 2301 and, as a result, the control apparatus on which the meeting room module/application is executing and which controls the online meeting, controls the second camera 2302 to operate to optically zoom in on an area surrounding where the gesture was detected and capture one of a still image and/or a video image using the full image frame of the second camera thereby eliminating the need of the first camera to pause live video recording during ROI capture processing. This operation will be now be described

[00113] In exemplary operation, two camera controller instances are generated for controlling the first and second cameras. As shown herein, the first camera controller 2303 is a controller that is in communication with both the first camera 2301 and the second camera 2302. In one embodiment, the second camera controller 2304 is in communication with the second camera 2304. As described herein, these instances represent separate modules that includes instructions which are executed by a processor. In another embodiment, these instances are be subroutines of a single camera controller which is in communication with both the first camera 2301 and the second camera 2304. The first camera controller 2303 controls the first (main) camera 2301 and the second camera controller 2304 controls the second (ROI) camera. As used herein, the first (main) camera 2301 is the camera that is capturing the entire field of view of the room and which is responsible for capturing the image frames that are processed for detecting gestures and identifying presenters as discussed above (see. Fig. 1). Further, the second (ROI) camera 2302 is a camera that is selectively controlled to take a higher resolution image of a region of interest in the field of view being captured by the first camera 2301. In summary operation, the embodiment described herein includes an initial calibration process being performed to calculate an offset or compensation value that is needed when controlling the second camera 2302 to capture a ROI image. This compensation value is used to compensate for different perspective due to camera position between the first camera 2301 and second camera 2304. Further, the improved ROI image capture processing includes, upon determining an ROI is to be captured, computing pan/tilt/zoom values for camera based on x,y associated with the determined ROI as determined by the first camera 2301, send a capture command to the second camera 2302 which includes compensation information to ensure that the second camera 2302 properly focuses on the determined ROI, await focus/zoom operation being performed using the compensation information based on the command sent to the second camera 2302 and capture at least one of a still image of the ROI and/or initiate a live video stream of the ROI in the same manner.

[00114] A calibration step is performed whereby frames from each of the first and second cameras are acquired. In one embodiment, this calibration step is performed by the first (main) camera controller 2303 which can bidirectionally communicate with the first camera controller 2303. The calibration process results in obtaining, for each of the cameras, camera control parameter values that are used to compute the pan/tilt/zoom control values of the ROI camera, given an x,y ROI center value that is specified based on a main camera frame capture by the first camera 2301. This calibration process is needed dependent on the position of the first camera 2301 and second camera 2302 relative to one another. In another embodiment, the first and second cameras (2301, 2302) may be positioned substantially adjacent each other such that the field of views being captured at a given time are substantially the same. In this embodiment, a calibration process may not be needed. Instead, the x,y values obtained from the first camera can merely be shifted a predetermined amount in one or more directions to allow for the second camera to perform PTZ processing and capture the area indicated by the first camera to be the ROI. In another embodiment, even if no shift is applied, in a case where two cameras are adjacent to one another as close as possible, the ROI processing is able to obtain the ROI with a high degree of accuracy.

[00115] ROI processing according to this embodiment will now be described with components of Fig. 23 which shows the components embodied in the control apparatus 103 which is described hereinabove. During ROI processing, image frames that are captured by the first camera 2301 are provided to the various detection modules. As is relevant here, these image frames from the first camera 2301 are provided, by the first camera controller 2303, to a gesture (hands) detection module 2305. These image frames captured by the first camera 2301 are provided to a drivers module 2306 which replicates the capture image frames into one or more virtual video driver instances as described above with respect to Fig. 1 so that these various video feeds can be communicated to a server which is facilitating the online meeting such that the video feeds are provided and accessible to one or more remote participants of the online meeting. As shown herein, drivers module 2306 causes at least one feed driver 2309 to be created and sent to the server and the images (or content) contained in the at least one feed driver 2309 can be viewed by the one or more remote participants as described above. In one embodiment, the feed driver 2309 includes a series of images representing the entire field of view of the meeting room being captured by the first camera 2301 which is described above. Drivers module 2306 further generates a MR Driver 2308 representing the entire field of view captured by the first image capture device 2301 and which is then provided as input to the meeting room module 2307 which provides those images to a user device 2312 for display thereof. This user device 2312 is a user device that is used by a person in the meeting room being captured by the first camera 2301 and provides for various meeting room control functionality described hereinabove.

[00116] In case where the gesture detection module 2305 receives the captured image frames from the first camera controller 2303 and determines that the hand positions in one or more successively captured image frames indicates that a spotlight gesture is being performed, an detection message 2320 including ROI coordinates is generated. The detection message indicates that an ROI at the coordinates contained in the detection message is to be captured. In another embodiment, the meeting room module 2307 may receive input from a user, via user VO device 2312 that designates an area in the field of view captured by the first camera 2302 as a spotlight center to also cause the ROI capture message 2320 to be generated. [00117] The detection message 2320 is provided to the first camera controller 2303 which uses the ROI information in the detection message 2320 to generate a capture message 2330 including the ROI (x,y) information. The capture message 2330 is provided to the second camera controller 2304 which receives the ROI capture message 2330 and controls the second camera 2302 to move and focus on an area having the x,y values which represent a center of the ROI/spotlight area. In the case where the calibration step was performed, the initial calibration parameters are used to adjust the main camera x,y value (from the detection message 2320) to a new, compensated x,y values (in the capture message 2330) that correspond to the same position to be captured by the second camera 2302 to capture the desired ROI camera frame. The x,y values are used to compute the pan/tilt/zoom values to be sent to the camera. Upon the second camera 2302 receiving the capture command 2330, the second camera 2303 is controlled to move (e.g. pan/tilt/zoom) to the specified ROI position and perform auto focus on the desired ROI. Once focus is completed, the second camera 2302 is caused to capture a still image 2350 which is provided from the second camera controller 2304 and is sent to the server 2314 which then provides the still image 2350 to the one or more remote participants. Additionally, the second camera controller 2304 when receiving images from the second camera 2302, causes the capture image frames 2340 to be provided to the drivers module 2306 to feed an ROI driver 2310. The images 2340 feed to the ROI driver 2310 is provided to the server 2314 and operates as a user selectable feed within the online meeting which can then be viewed by the one or more remote participants as discussed above. In one embodiment, the still image 2350 being captured is captured in 4K as are the image frames being feed to the ROI feed 2310.

[00118] Figure 24 is a flow diagraming detailing the algorithm for perform ROI capturing based on gesture detection. This algorithm can be combined with Figure 3 as discussed above whereby steps A106-A110 are replaced with steps A200 - A210 of Figure 24. A200 begins with a determination as to whether a gesture has been performed by a user and was detected in one or more image frames captured by the first camera. In addition to the manner in which gesture detection is determined as described hereinabove, gesture detection may also consider the pose and position of the users head and/or face relative to the first camera. As such, in addition to the correct hand motions being performed, gesture determination processing includes determining whether the head or face of the user is looking directly at the first camera. Upon those conditions being met, the determination in A200 that a gesture has been made, results in proceeding to A202 where the system outputs an indicator that a successful gesture has been detected. In one embodiment, the output indicator is a sound output by a computing device. In another embodiment, the indicator is a visual indicator such as a light that is disposed either in the room or on the first image capture device. In this instance, the light is controlled to blink a predetermined pattern to visually indicate to the user that a gesture has been correctly detected.

[00119] In A203, an x,y position from within a frame captured by the first camera that is substantially the center position of the desired ROI is provided to the second camera. In this operation, the center position of the desired ROI is determined based on a current position of the hand of the user that was making the gesture and which had been captured by the first camera. In A204, the second camera receives the x,y values obtained from the first camera and shifts the x,y coordinate values based on the camera control parameters determined during the calibration process which allows for the second camera to pan, tilt and/or zoom to the correct area being indicated as an ROI. In another embodiment, A204 merely shifts the x,y values a predetermined amount and direction. In a further embodiment, the second camera uses the received x,y values and operates to capture that area. The calibration processing and associated shift processing referenced here will be discussed hereinbelow.

[00120] In A205, the second camera is controlled using the adjusted x,y parameter values to perform pan, tilt and zoom operation to capture the desired region surrounding the position of the hand of the user as detected during the gesture detection processing. Once the second camera has been moved to capture the region surrounding the adjusted x,y parameter, image capture is performed to obtain a still image representing the ROI in A206. The captured ROI image is stored and a location indicator (URL) is associated with the ROI image and is provided to the meeting room control application which adds or updates an video stream that is available to all remote users in A207. Based on this, a full frame capture of the ROI is available to the remote users for selective display in the remote user interface described above. [00121] There are certain problems associated will ensuring that two camera control for ROI processing is performed properly. For example, while a user can manipulate a PTZ camera’ s pan, yaw and zoom with hand gestures as discussed above, it is not uncommon that the PTZ camera may lose the view of the user for a certain combinations of the pan yaw and zoom setting. In this case, the PTZ camera cannot respond until the user reenters the scene. The second camera used to capture the ROI advantageously ensures the first camera can continue to capture the images of the room and the user as discussed above. Thus, the first camera can send ROI capture messages with desired camera control parameters (PTZ settings) to the second camera. These desired PTZ settings for the second camera need to be derived based on the position of the second camera relative to that of the position of the first camera and the position of the user in the frame being captured by the first camera. Part of the problem can be thought as a multi-camera calibration problem but a high quality multi-camera calibration done with as few user involvement as possible remains a challenging problem which may be resolved according to the disclosure below.

[00122] Figure 25 shows a two camera setting in which a first camera (Cam A) (2310 in Fig. 23) attends to the presenter and responds to gestures being performed by a user and captured by CamA. The second camera (Cam B) adjusts its pan, yaw or zoom to focus on the desired region of interest in the scene. Cam A and Cam B communicate with one another and know their relative positions and poses using the first and second camera controllers 2303 and 2304, respectively. The control solution described herein estimates the pose of each PTZ camera relative to one another using a camera calibration process that minimizes user input. In other embodiments, during runtime, recalibration is conducted, if necessary, when it is detected that the two cameras are repositioned in space. In this embodiment, the control apparatus optimally assign roles to first and second PTZ cameras (CamA and CamB) so that at least one is attending to the presenter (or other users in the room), and other cameras may focus on regions of interest in the scene, as instructed by the presenter (or other user). Figure 25 illustrates a two camera scene. Note that this setting can be extended to include more than two cameras. In exemplary operation, Cam A captures the presenter’s hand gesture and sends live streams to control apparatus 103 (illustrated herein as PC) that is running the meeting room application discussed above. The meeting room application interprets the hand gesture to determined that a predetermined gesture is made and sends control signal to Cam B to adjust its PTZ setting. The relative pose of Cam A and Cam B are estimated by the meeting room application based on their respective live streams as discussed below.

[00123] Figure 26 is a flow diagram of the algorithm for controlling the first and second cameras (2301 and 2302 in Fig. 23). The algorithm is embodied as a set of instructions stored in a memory and executed by a CPU. The algorithm includes two modules that run in parallel: the Calibration module 2601 and the Operation module 2610. The Calibration module 2601 estimates (or re-estimates) relative camera poses of all the PTZ cameras (e.g. first and second cameras) in a given environment. The Operation module 2610 captures and interprets the hand gestures and computes the desired adjusted camera control parameters (e.g. new PTZ settings) based on current camera poses. The operations module 2610 includes various aspects of the gesture detection module 2305 and one or both of the first and second camera controllers 2303 and 2304, respectively). The adjusted camera control parameters will update camera pose information saved in the memory and shared with the Calibration module 2601. [00124] The calibration module 2601 determines the coordination between the first (PTZ) camera 2301 and second (PTZ) camera 2302. As such, the calibration module 2601 needs to know a current position and field of view associated with each of the first and second cameras (2301 and 2302) so that the calibration module 2601 can estimate the relative camera poses when necessary. In one embodiment, the estimation is performed done when the meeting room application is first launched and prior to an online meeting being intiated.

[00125] In one embodiment, multiple cameras are calibrated using a calibration approach which requires user input and participation. In this approach a user presents a checkboard so that all cameras sees the same checkboard simultaneously from their respective point of views. With known plenary 3D structure, the checkboard yields multiple 2D projections in each of the cameras. The relative camera poses can then be estimated using 3D projective geometry.

[00126] In another embodiment, automatic calibration is performed without user input by estimating the cameras’ relative poses without direct user inputs. To achieve this, a predetermined region that has been identified and which is viewable by both cameras is needed. In operation, the whiteboard (as shown in Figs 1 - 3) is predefined in the meeting room application using four corner points of the whiteboard. Based on whiteboard correction algorithms performed in order to obtain a substantially rectangular shape of the whiteboard based on the predefined location information, an accurate estimation of the aspect ratio of the whiteboard’s rectangular shape is obtained. With known aspect ratio and plenary 3D structure, together with the four comer points showing up in each camera view, a calibration can be performed.

[00127] More specifically, this process is performed according to the follow algorithmic steps. P represents the homogeneous coordinates of a point in real space. C is the intrinsic camera parameter matrix. M is the rotation and translation matrix and P_t is the projection of the point on the image plane, then

Pl = M_tP

Similarly, for the point’s projection on the second camera we have the following equation

P_r = M_rP With four points that each provides two constraining equations (both x and y in the image coordinate), a 6 degree freedom rotation and translation matrix M_t and M_r are obtained. Note that our knowledge of the aspect ratio of the whiteboard yields enough information to solve for M_t or M_r. On the other hand, P_t and P_r are related through the same point P. Therefore there a matrix M exists such that

Pi = MP_r

Here, M is a 4X4 matrix that translate and rotate P_r to P_t, i.e., the relative pose between the 1^? t I

„ „ where R is the rotation matrix and t is the translation 0 11 vector. From the above three equations M is obtained as follows:

M = M_rM^T _t

[00128] In another embodiment, automatic calibration is performed using registered 2D features of a user’s body such that their corresponding 3D information can be inferred when the 2D image is provided as input to a trained machine learning model that has been trained to relate image pixels to 3D human body models. Examples of such networks include the Denspose which, given images of a human body, returns a UV map for each body part, or OpenPose which returns 2D points representing a human body’s joints. Fig.27 illustrates this the idea of multi-camera calibration with estimated human body joints. The above equations are applied to find rotation and translation matrix as discussed above with the exception that the points are now projected human joint instead of the four comers of a whiteboard.

[00129] Human body joint estimations tend to be noisy, even ones obtained through state of the art deep networks. Large noise can have large impact on calibration accuracies. To reduce the impact of noise, we process a sequence of image frames (e.g. a short video clip of a human figure in action), and apply further constraints on the estimated joints while obtaining M_r, Mi and M. That is, the rotation and translation matrix are obtained using, for example RANSAC or similar techniques. Further constraints are applied such that some additional computation cost terms are minimized. Such terms may include fixed limb lengths or smooth motion between frames. Note that solving for rotation and translation for a given sequence of image frames is possible with standard structure from motion techniques (SFM) that tend to yield more robust result than when dealing with a single static image. [00130] For human body joints to be used as features for the purpose of multi-camera calibration, the user must appear in a reasonable distance away from the camera so that the whole body is visible to the camera. In situations where the human figure is too close to the camera, e.g., only the upper half of the body is visible, then image frames may be more useful for calibration than others. In those frames, the body joints contain richer information in relative depth. For example, the image frame in which the user is extending out their arms is very useful for the calibration purpose. In yet more extreme cases where only the human face can be seen, similar ideas and techniques apply too. For example, one can obtain 3D facial landmarks for the face and use those landmarks to solve for camera poses. Therefore, in some embodiment only part of the body joints or only the face is available, we will use those features that are more reliable and apply the same techniques to solve for camera poses.

[00131] In yet another embodiment, when no known or estimated 3D structures are available, the algorithm obtains as many 2D correspondence features from the images being captured and estimate relative camera poses based on epipolar geometry. More specifically, let E be the essential matrix that relates the two projections P[ and P_r on the left and right image plane for a point P in space which yields the following

P^EP_t = 0

The essential matrix E is decomposed into translation matrix and rotation vector which specifies the relative poses between the cameras. Because 3D information is not available in this case, the camera’s intrinsic matrix cannot be estimated. Rather, it has to be obtained through other means. Nevertheless, this embodiment advantageously calibrates the environment where 3D information in the scene is unavailable. Corresponding 2D features are obtained between camera views from the two cameras. For example, computer vision techniques are used to obtain 2D corresponding features (e.g. the Lucas-Kanade algorithm). This embodiment for calibrating the two camera environment enables calibration where the separation of the two cameras is not large. For large separations, 2D correspondence features are not easy to establish. In these cases, networks trained to identify and re-identify users or objects in different scenes are used. The essential matrix E in can still be obtained by treating the identified users or objects in different scenes as 2D correspondence features. Fig. 29 illustrates this idea. Objects (01, 02) and users (Ul, U2) are identified in both camera scenes and can therefore be used as correspondence features. For example, one may obtain relative camera poses based on techniques described in at least one of the embodiments above. While

T1 one may obtain camera poses based on whiteboard or on human joints or other features, a more robust calibration can be performed based on a combination of all the features, optimally weighted according to their respective reliabilities in different situations. Although explained in terms of two PTZ cameras, the ideas and techniques mentioned above can easily be extended to multiple cameras case. For example, camera calibration may be conducted between a pair of cameras first, and then a bundle optimization may be conducted for all cameras globally.

[00132] Recalibration of the cameras may also be performed during a runtime operation, particularly, when it is determined that one or more of the cameras have been moved to a different position. Recalibration processing may be determined as necessary based on user input or by checking for the presence of a predetermined object or other invariant geometric features such as vanishing point that corresponds to the level ground. In certain instances, the predetermined object may be a whiteboard. In others, the predetermined object may be a table or other furniture in the room. If no new calibration is needed, the current camera poses as determined from the calibration processing described above are read from the memory.

[00133] While the camera setup described herein describes the first and second cameras, it should be understood that, at startup of the meeting room application, the assignment of which camera is the first camera and which is the second camera can be dynamically performed based on a determination that one of the cameras has a better view of a particular user such as the presenter. This can either be set manually by the user, or can be set automatically. The presenter camera may adjust its setting to maintain a good view of the presenter through tracking.

[00134] In addition to the calibration module, the system includes an operation module which executes concurrently. In this module, the camera designated the first camera is capturing a live view of the entire room and provides, as input to the hand detection module, captured image frames that are used to detect and interpret the hand gestures captured therein. If the hand gesture suggests for a change in the region of interest in the scene, the first camera controls the second camera to adjust one or more camera control parameters (PTZ settings) to refocus the field of view of the second camera to the region surrounding the detected gesture captured by the first camera.

[00135] A gesture detection and interpretation algorithm is continually running and analyzes image frames captured by the first camera. The algorithm should run on frames only from the first camera so the computational load remains the same in the multi-camera case as the in single camera case. The desired new region of interest is measured in pixel values in the image coordinates of the first camera. This is denoted by p'_m. To generate adjusted camera settings representing the new coordinates to which the second camera is to capture so that it may refocus onto the desired region, the corresponding new region in pixels in the second camera image, denoted by p'_s, are obtained. Using the camera settings from the calibration module C_m and C_s are the intrinsic matrices of the two cameras, respectively. M = is

the transformation matrix that relates the two cameras in Eq. 4, where R and t represent rotation and translation respectively. E is the essential matrix either obtained by solving Eq. 5, or derived from M based on (3). As such, the relationship between p'_m and p'_s is given by

P'mFp's = 0

Here F = C^EC_m is the fundamental matrix that directly relates p'_s to p'_m . With p’_s computed given p'_m , the meeting room application can then notify the second camera to refocus on p'_s in its own image pixel coordinates. In some embodiments, the derived p'_s might be outside of the second camera’ s current view. This may occur when the second camera happens to be zooming in onto a region in the scene by adjusting down its original focal length. Depending on user experience, one embodiment, may automatically control the second camera to revert back to the saved calibrated settings thereby zooming out and restoring the original focal length to bring the desired region back into the scene. Refocusing may then follow. Alternatively, the desired angle change may be computed based on p'_s and adjusting the second camera’s pose without changing the focal length. The two alternatives are mathematically equivalent. Which one should be used depends on which solution yields better visual experience. Once a refocus is conducted successfully, the new PTZ setting will be updated in the memory to reflect the current states of all the cameras at work. These settings are shared with the Calibration module.

[00136] Figure 19 illustrates the hardware that represents any of the camera 102, the control apparatus 103, the first server 104, the second server 105, the client computers 106/107 and the user recognition service 121 that can be used in implementing the above described disclosure. The apparatus includes a CPU 501, a RAM 502, a ROM 503, an input unit, an external interface, and an output unit. The CPU 501 controls the apparatus by using a computer program (one or more series of stored instructions executable by the CPU 501) and data stored in the RAM 502 or ROM 503. Here, the apparatus may include one or more dedicated hardware or a graphics processing unit (GPU), which is different from the CPU 501, and the GPU or the dedicated hardware may perform a part of the processes by the CPU 501. As an example of the dedicated hardware, there are an application specific integrated circuit (ASIC), a field- programmable gate array (FPGA), and a digital signal processor (DSP), and the like. The RAM 502 temporarily stores the computer program or data read from the ROM 503, data supplied from outside via the external interface, and the like. The ROM 503 stores the computer program and data which do not need to be modified and which can control the base operation of the apparatus. The input unit is composed of, for example, a joystick, a jog dial, a touch panel, a keyboard, a mouse, or the like, and receives user's operation, and inputs various instructions to the CPU 501. The external interface communicates with external device such as PC, smartphone, camera and the like. The communication with the external devices may be performed by wire using a local area network (LAN) cable, a serial digital interface (SDI) cable, WIFI connection or the like, or may be performed wirelessly via an antenna. The output unit is composed of, for example, a display unit such as a display and a sound output unit such as a speaker, and displays a graphical user interface (GUI) and outputs a guiding sound so that the user can operate the apparatus as needed.

[00137] The scope of the present disclosure includes a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform one or more embodiments of the invention described herein. Examples of a computer-readable medium include a hard disk, a floppy disk, a magneto-optical disk (MO), a compact-disk read-only memory (CD-ROM), a compact disk recordable (CD-R), a CD-Rewritable (CD-RW), a digital versatile disk ROM (DVD-ROM), a DVD-RAM, a DVD- RW, a DVD+RW, magnetic tape, a nonvolatile memory card, and a ROM. Computerexecutable instructions can also be supplied to the computer-readable storage medium by being downloaded via a network.

[00138] The use of the terms “a” and “an’' and “the” and similar referents in the context of this disclosure describing one or more aspects of the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the subject matter disclosed herein and does not pose a limitation on the scope of any invention derived from the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential.

[00139] It will be appreciated that the instant disclosure can be incorporated in the form of a variety of embodiments, only a few of which are disclosed herein. Variations of those embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. Accordingly, this disclosure and any invention derived therefrom includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

Claims

Claims We claim,

1. A control apparatus for controlling an online meeting, the apparatus comprising: one or more processors; and one or more memories storing instructions that, when executed, configures the one or more processors, to: receive, from a first camera, a captured video of a meeting room; determine from the captured video that a predetermined gesture is performed; control a second camera to perform image capture of region surrounding a position in the captured image where the gesture was determined to be performed; transmit the captured image of the region for display in a user interface.

2. The control apparatus according to claim 1, wherein execution of the stored instructions further configures the one or more processors to: communicate an image capture message to the second camera including one or more camera parameters; and control the second camera to perform image capture using the one or more camera parameters.

3. The control apparatus according to claim 2, wherein the one or more image capture parameters are translated based on pre-capture calibration that identifies a positional relationship between the first camera an second camera.

4. The control apparatus according to claim 1, wherein execution of the stored instructions further configures the one or more processors to: obtain position information from a frame of the captured video captured by the first camera; generate adjusted position information based on a predetermined positional relationship between the first and second camera; and provide the adjust position information to the second camera to control the second camera to perform image capture.

5. The control apparatus according to claim 1, wherein execution of the stored instructions further configures the one or more processers to: calibrate a field of view of the first camera and the second camera using one or more common points in real space such that a starting position of each of the first and second camera are identified; and wherein the control of the second camera to perform image capture is performed based on the calibrated starting position of the second camera.

6. A method of controlling an online meeting comprising: receiving, from a first camera, a captured video of a meeting room; determining from the captured video that a predetermined gesture is performed; controlling a second camera to perform image capture of region surrounding a position in the captured image where the gesture was determined to be performed; and transmitting the captured image of the region for display in a user interface.

7. The control method according to claim 6, further comprising communicating an image capture message to the second camera including one or more camera parameters; and controlling the second camera to perform image capture using the one or more camera parameters.

8. The control method according to claim 7, wherein the one or more image capture parameters are translated based on pre-capture calibration that identifies a positional relationship between the first camera an second camera.

9. The control method according to claim 6, further comprising obtaining position information from a frame of the captured video captured by the first camera; generating adjusted position information based on a predetermined positional relationship between the first and second camera; and providing the adjust position information to the second camera to control the second camera to perform image capture.

10. The control method according to claim 6, further comprising calibrating a field of view of the first camera and the second camera using one or more common points in real space such that a starting position of each of the first and second camera are identified; and wherein the control of the second camera to perform image capture is performed based on the calibrated starting position of the second camera.

11. A non-transitory computer readable storage medium storing instructions that, when executed by at least one processor, configures a control apparatus that is in communication with at least two cameras to perform a control method, the control method comprising: receiving, from the first camera, a captured video of a meeting room; determining from the captured video that a predetermined gesture is performed; controlling the second camera to perform image capture of region surrounding a position in the captured image where the gesture was determined to be performed; and transmitting the captured image of the region for display in a user interface.