CN117294945A

CN117294945A - Intelligent conference method capable of automatically aligning face of speaker through guide rail camera

Info

Publication number: CN117294945A
Application number: CN202311215403.3A
Authority: CN
Inventors: 曾泳豪; 朱正辉; 明德; 余吉昌; 池旺钊; 区文焯
Original assignee: Guangdong Baolun Electronics Co ltd
Current assignee: Guangdong Baolun Electronics Co ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-26

Abstract

The invention relates to the field of video conferences, and particularly discloses an intelligent conference method for automatically aligning the face of a speaker through a guide rail camera, which comprises the steps of establishing a virtual model of a conference table and the guide rail camera, and marking the position coordinates of a conference seat on the virtual model; recognizing conference seats of speakers, controlling a moving end to move to a guide rail position closest to corresponding position coordinates of the conference seats, and controlling a steering mechanism to enable a camera to always face the position coordinates; recognizing a human face in the camera and a plane thereof by using a human face detection algorithm; calculating the center coordinate of the face, and controlling the steering mechanism to drive the camera to face the center coordinate; the control guide rail movement assembly drives the camera to move to be perpendicular to the plane. The invention saves the cost of the camera equipment, can move to face any one participant, and can automatically track the faces of the participants to keep the faces in the center of the image, thereby improving the shooting effect and the participant experience of the video conference.

Description

Intelligent conference method capable of automatically aligning face of speaker through guide rail camera

Technical Field

The invention relates to the field of video conferences, in particular to an intelligent conference method for automatically aligning the face of a speaker through a guide rail camera.

Background

With the development of network technology, feasibility is provided for real-time video communication and video conference, and related video conference equipment is gradually perfected, and development is advanced to equipment and technology aspects of higher resolution, lower transmission delay and the like. In order to improve the video quality of the video conference at present, the adopted professional image pickup equipment has higher and higher pixel and frame number, and the equipment cost is also higher and higher.

In the existing video conference system, the position of the camera is usually fixed, so that the shooting angle is fixed, the camera does not have the function of automatically adjusting the azimuth, and the figure slightly moves or is positioned at the center of the image, so that a conference speaker can only sit straight for a long time or stand at a position opposite to the camera in order not to deviate from the center of the image, the speaker is difficult to relax, and the burden of the conference is increased. Meanwhile, when the camera is not provided with at least one conference seat separately, part of speakers cannot face the camera when speaking on the seats. The panoramic camera in common use can only shoot the side surfaces of most participants, and the burden of the participants can be increased when the participants speak sideways.

In order to solve the above problems, the existing scheme adopts a multi-camera mode to shoot a speaker, and although a few visual angles are increased, a plurality of camera devices have extremely high cost, and the switching between images and the alignment position of the cameras are still inflexible. In addition, when a plurality of persons speak in turns on the conference table, if the number of cameras is insufficient, it is difficult to face all speakers, and the cost is too high to configure the imaging device for each speaker's seat separately.

Disclosure of Invention

In order to solve the problems that partial speakers of the existing video conference system cannot face towards the cameras when speaking on seats of the speakers, and the cost of arranging a plurality of cameras is too high, the invention provides an intelligent conference method for automatically aligning the faces of the speakers through guide rail cameras.

The invention provides an intelligent conference method for automatically aligning the face of a speaker through a guide rail camera, wherein the guide rail camera comprises the following steps: the camera is rotationally arranged on the moving end through the steering mechanism;

an intelligent conference method for automatically aligning the face of a speaker through a guide rail camera comprises the following steps:

establishing virtual models of a conference table and a guide rail camera, and marking position coordinates of a conference seat on the virtual models;

identifying conference seats of speakers, controlling the motion ends to move to the guide rail positions closest to the corresponding position coordinates of the conference seats, and simultaneously controlling the steering mechanism to enable the cameras to always face the position coordinates;

recognizing a face in the camera by using a face detection algorithm;

calculating the center coordinates of the face, and controlling a steering mechanism to drive the camera to face the center coordinates according to the position relation of the center coordinates relative to the camera;

recognizing the plane of a face in the camera by using a face detection algorithm;

and calculating a normal vector of the central coordinate of the face relative to the plane of the face, calculating a guide rail coordinate of the vertical plane of the normal vector and intersecting the guide rail, and controlling the guide rail motion assembly to drive the camera to move to the guide rail coordinate.

Preferably, the face detection algorithm is used for identifying the face in the camera, specifically:

connecting the cameras and capturing a video stream by using a python programming language and a library, and acquiring a plurality of frame images in the video stream;

for each acquired frame of image, detecting whether a face exists in the image or not through a Haar classifier;

if so, determining the position and the size of the face in the image by using a cascade classifier on the image;

otherwise, continuing to detect the next frame of image.

Preferably, the step of determining the position and the size of the face in the image by using a cascade classifier comprises the following specific implementation steps:

converting the image into a gray scale map: converting the image into a gray image using a cv2.cvtdcolor () function of OpenCV;

face detection using a Haar classifier: detecting the face position on the gray image using a Haar classifier by a detectMultiScale () function that returns a rectangular list containing detected face position information;

drawing a human face frame: for each detected face position, a rectangular box is drawn on the original image using the cv2.Rectangle () function of OpenCV as the position and size information of the face.

Preferably, the conference seat for identifying the speaker specifically includes:

a speaking key is respectively arranged on the conference table opposite to each conference seat;

when one of the talk buttons is triggered, the conference seat opposite to the talk button is identified as the conference seat where the speaker is located.

Preferably, the guide rail camera further comprises an omnidirectional pickup microphone, and the omnidirectional pickup microphone is fixedly connected with the moving end.

positioning the sound source direction of a speaker relative to the motion end through the omnidirectional pickup microphone by adopting a sound source positioning algorithm;

and calculating the conference seat where the speaker is located through the direction of the sound source and the current position of the motion end in the virtual model.

at least one panoramic camera is fixedly arranged in a conference room, and faces of all participants are shot through the panoramic camera;

marking the position coordinates of each panoramic camera in the virtual model;

recognizing the faces of all participants in the image shot by the panoramic camera through a face detection algorithm;

identifying the current speaker from the participants through a mouth shape identification intelligent algorithm;

and calculating the conference seat where the speaker is located through the position of the speaker in the shot image and the position coordinates of the panoramic camera in the virtual model.

Preferably, the method further comprises the steps of:

when the moving end is detected to be in a moving state, the image of the video conference is switched to the image shot by the panoramic camera.

Preferably, the face detection algorithm is used for recognizing the face in the camera, and the method further comprises the following steps:

when the images shot by the cameras are recognized to contain a plurality of faces, the faces corresponding to the speakers are calculated and judged according to the positions of the cameras in the virtual model and conference seats where the speakers are located.

Preferably, the method further comprises the steps of:

and recognizing the face recognition data of the speaker through a face recognition algorithm, inquiring the matched identity information through the face recognition data, and displaying the identity information in the video image.

The beneficial effects of the invention are as follows:

(1) Through establish the camera on the slide rail, slide on the conference table surface, the rotation of cooperation camera only needs a camera can make the camera just to arbitrary meeting person on the conference table to make the meeting person on every conference table homoenergetic just speak to the camera, practiced thrift the cost of the required camera equipment of multi-camera shooting again, also practiced thrift the manpower input of manual movement camera equipment shooting and the embarrassment of meeting person when facing the camera engineer.

(2) Through face identification algorithm, make the camera move health and face at the in-process that the speaker was spoken, anyway, homoenergetic control camera motion, the human face position of real-time tracking makes its moment keep in image center, very big improvement the practical experience of video conference of meeting person, make the speaker can relax the health simultaneously, needn't pay attention to constantly whether deviate from the orientation of camera, user experience is more comfortable, meeting visual image is more intelligent.

Preferably, the direction of the speaker on the conference table is automatically identified by the omnidirectional pickup microphone and the sound positioning algorithm, so that the camera is automatically opposite to the speaking participants.

Preferably, the camera is switched to the image of the panoramic camera in the moving process of the camera, and the camera is switched to the image of the camera after moving to the position opposite to the speaker, so that the phenomenon that the image shakes and other non-speaking participants are shot when the camera moves is avoided, and the effect of a video conference is inconsistent and dizziness is caused.

Drawings

The invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method according to a first embodiment of the invention;

fig. 2 is a schematic view of a conference table according to a second embodiment of the present invention.

In the figure: 1. a conference table; 2. an omnidirectional pickup microphone; 3. a steering mechanism; 4. a camera; 5. and a guide rail movement assembly.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The face recognition algorithm is that after the face is detected and the key feature points of the face are located, the main face area can be cut out, and the main face area is fed into the recognition algorithm at the rear end after preprocessing. The face recognition algorithm is mature at present, and is a mature prior art, so that a plurality of existing modules can be directly called in the market, and various implementation modes can be achieved, and the face detection algorithm of the scheme needs to detect the plane and the center point of the face, so that the following embodiments are not repeated.

Referring to fig. 1, as a first embodiment of the present invention, specifically, an intelligent conference method for automatically aligning the face of a speaker through a guide rail camera is disclosed, where the guide rail camera includes: steering mechanism 3, camera 4 and guide rail motion subassembly 5, guide rail motion subassembly 5 include guide rail and motion end, and the guide rail is fixed on conference table 1, and the motion end links to each other with the guide rail is sliding, and camera 4 rotates through steering mechanism 3 and locates on the motion end.

Meanwhile, the intelligent conference all-in-one machine of the scheme further comprises display equipment and sound equipment, wherein the display equipment is arranged in a conference room and can be a display screen, a desktop display or a projector.

The intelligent conference method for automatically aligning the face of the speaker through the guide rail camera comprises the following steps:

s1, establishing virtual models (a plane model or a three-dimensional model) of a conference table 1 and a guide rail camera, and marking position coordinates of a conference seat on the virtual models;

s2, identifying conference seats of speakers, controlling a moving end to move to a guide rail position closest to corresponding position coordinates of the conference seats, and simultaneously controlling a steering mechanism 3 to enable a camera 4 to always face the position coordinates;

s3, recognizing a human face in the camera 4 by using a human face detection algorithm;

s4, calculating the center coordinates of the face, calculating the position relation of the center coordinates relative to the camera 4, and controlling the steering mechanism 3 to drive the camera 4 to face the center coordinates according to the position relation;

s5, recognizing a plane where a face in the camera 4 is located by using a face detection algorithm;

s6, calculating a normal vector of the center coordinate of the face relative to the plane of the face, calculating a guide rail coordinate of the vertical plane of the vector and the guide rail, and controlling the guide rail motion assembly 5 to drive the camera 4 to move to the guide rail coordinate.

Preferably, in step S2 of this embodiment, the conference seat of the speaker is identified, specifically:

s211, setting a speaking key on each conference table 1 opposite to each conference seat;

and S212, when one of the speaking keys is triggered, identifying the conference seat opposite to the speaking key as the conference seat where the speaker is located.

Preferably, step S3 of this embodiment specifically includes the following sub-steps:

s31, connecting a camera 4 and capturing a video stream by using a python programming language and a library (such as OpenCV), and acquiring a plurality of frames (each frame or interval sampling) images in the video stream;

s32, for each acquired frame of image, detecting whether a face exists in the image through a Haar classifier;

s33, if so, determining the position and the size of a face in the image by using a cascade classifier on the image;

and S34, if not, continuing to detect the next frame of image.

Preferably, in step S33, a cascade classifier is used on the image to determine the face position and size in the image, and the specific implementation steps are as follows:

s331, converting the image into a gray scale map: converting the image into a gray image using a cv2.cvtdcolor () function of OpenCV;

s332, performing face detection by using a Haar classifier: detecting the face position on the gray image by using a pre-trained Haar classifier through a detectMultiScale () function, the detectMultiScale () function returning a rectangular list containing detected face position information;

s333, drawing a face frame: for each detected face position (represented by (x, y, w, h)), a rectangular box is drawn on the original image using the cv2.Rectangle () function of OpenCV as the position and size information of the face.

An example of a code for the above substeps is as follows:

import cv2

library of import camera control library

# loading Haar classifier

face_cascade＝cv2.CascadeClassifier('haarcascade frontalface default.xml')

# initializing camera and camera 4 control

camera＝cv2.VideoCapture(0)

camera_control=camera_control_library

while True:

ret,frame＝camera.read()

gray＝cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY)

Face detection using Haar classifier #

faces＝face_cascade.detectMultiScale(gray,scaleFactor＝1.1,minNeighbors＝5)

Continuous calculation and control camera

for(x,y,w,h)in faces:

# calculate the offset of face center and screen center

face_center_x＝x+w//2

screen_center_x＝frame.shape[1]//2

offset＝face_center_x-screen_center_x

Steering camera 4 to keep face centered

camera_control.adjust_position(offset)

cv2.imshow('Face Detection',frame)

if cv2.waitKey(1)&0xFF＝＝ord('q'):

break

camera.release()

cv2.destroyAllWindows()

According to the embodiment, the camera 4 can control the camera 4 to move regardless of the movement of the body and the face in the speaking process of the speaker through the face recognition algorithm, the face position of the human body is tracked in real time, so that the video conference practical experience of the participants is greatly improved, the speaker can relax, the speaker does not need to pay attention to the direction deviating from the camera 4 or not, the user experience is more comfortable, the conference visual image is more intelligent, and the image acquisition of the speaker in the conference is clearer and accords with the user expectation.

Referring to fig. 2, as a second embodiment of the present invention, this embodiment differs from the first embodiment in that it includes: the steering mechanism 3, the camera 4, the guide rail movement assembly 5 and the control terminal;

the guide rail movement assembly 5 comprises a guide rail, a driving motor and a movement end, wherein the guide rail is fixedly connected with the upper surface of the conference table 1, the movement end is slidably connected with the guide rail, the driving motor is used for driving the movement end to move along the guide rail, and the driving motor is in transmission connection with the movement end;

the steering mechanism 3 comprises a steering base, a horizontal steering frame, a vertical steering frame, a horizontal driving motor and a vertical driving motor, wherein the horizontal driving motor is used for driving the horizontal steering frame to rotate along a vertical axis relative to the steering base, the vertical driving motor is used for driving the horizontal steering frame to rotate along the horizontal axis relative to the vertical steering frame, and the camera 4 is fixedly connected with the vertical steering frame;

the driving motor, the horizontal driving motor, the vertical driving motor and the camera 4 are respectively and electrically connected with the control terminal.

The control terminal of the present embodiment is a server for operation of the videoconferencing system, and controls movement and operation of the steering mechanism 3, the camera 4, and the rail movement assembly 5.

Preferably, the control terminal is connected to the camera 4 by wireless signal communication. For example, wifi signals or 5G signals, reduce the connection of wires, avoid the problem of signal transmission and output caused by poor contact and wire breakage after bending and abrasion of long wires which move frequently

Preferably, the guide rail is any one of a linear guide rail, a rolling guide rail or a sliding guide rail. In contrast, the moving end can be a component matched with the guide rail, such as a sliding block or a small rail car, and the moving mode is not limited to traction transmission or wheel transmission.

The guide rail camera of this embodiment includes: all-directional pickup microphone 2, steering mechanism 3, camera 4 and guide rail motion subassembly 5, guide rail motion subassembly 5 include guide rail and motion end, and the guide rail is fixed on conference table 1, and the motion end links to each other with the guide rail is sliding, and all-directional pickup microphone 2 links to each other with the motion end is fixed, and camera 4 rotates through steering mechanism 3 and locates on the all-directional pickup microphone 2.

The omnidirectional pickup microphone 2 of this embodiment is a columnar structure, the steering mechanism 3 is divided into horizontal 360 ° steering and 30 ° pitch angle rotation in the vertical direction, and is driven by two independent servo motors, stepping motors or gear motors, and the rotation angle is feedback controlled by the included angle between the image of the camera 4 and the recognized face position.

The guide rail motion assembly 5 of this embodiment further includes haulage rope and quiet pulley, and driving motor links to each other with the one end of guide rail is fixed, and driving motor's pivot is fixed to be equipped with the main pulley, and quiet pulley links to each other with the other end rotation of guide rail, and the haulage rope encircles on main pulley and quiet pulley, and the motion end links to each other with the haulage rope is fixed.

In step S2 of this embodiment, the conference seat of the speaker is identified, and the specific implementation steps are as follows:

s221, positioning the sound source direction of a speaker relative to a moving end through an omnidirectional pickup microphone 2 by adopting a sound source positioning algorithm;

s222, calculating the conference seat where the speaker is located through the direction of the sound source and the position of the current moving end in the virtual model.

According to the method, the device and the system, the azimuth and the position information of the speaker are primarily judged through sound, the function of automatically identifying and moving to the front of the speaker can be achieved, and the conference system is more intelligent.

In step S3 of the present embodiment, the face detection algorithm is used to identify the face in the camera 4, and the steps are as follows:

and S311, when the images shot by the cameras 4 are recognized to contain a plurality of faces, calculating and judging the faces corresponding to the speakers according to the positions of the cameras 4 in the virtual model and conference seats of the speakers.

So as to prevent the camera 4 from misfacing other conference participants in the back row when more conference participants exist. The speaker correction process should run in real time or lock on the facial features of the current speaker after a confirmation.

The intelligent conference method for automatically aligning the face of the speaker through the guide rail camera in the embodiment further comprises the following steps:

s7, recognizing the face recognition data of the speaker through a face recognition algorithm, inquiring the matched identity information through the face recognition data, and displaying the identity information in the video image.

According to the embodiment, the camera 4 and the omnidirectional pickup microphone 2 are arranged on the sliding rail and slide on the surface of the conference table 1, and the camera 4 can be opposite to any participant on the conference table 1 only by means of the rotation of the camera 4, and the omnidirectional pickup microphone 2 is closer to a speaker, so that the participant on each conference table 1 can speak just opposite to the camera 4, the cost of camera equipment required by multi-camera shooting is saved, and the labor investment for manual moving of the camera equipment and the embarrassing sense when the participant faces a camera operator are also saved; under the same video conference system budget, the scheme can use the total price of a plurality of camera devices of the multi-camera video conference system for purchasing a camera device with higher configuration, thereby improving the image quality of the video conference.

The following is a third embodiment of the present solution, which is different from the first embodiment in that at least one panoramic camera is further provided in a conference room or a conference site.

The guide rail motion assembly 5 of this embodiment further includes a synchronous belt and an idler, wherein the driving motor is fixedly connected with one end of the guide rail, the synchronous wheel is fixedly arranged on the rotating shaft of the driving motor, the idler is rotationally connected with the other end of the guide rail, the synchronous belt is sleeved on the synchronous wheel and the idler, and the motion end is fixedly connected with the synchronous belt.

The upper surface of the conference table 1 of this embodiment is fixedly provided with a plurality of display screens, and the plurality of display screens are all used for displaying the picture of the intelligent conference.

s231, at least one panoramic camera is fixedly arranged in the conference room, and faces of all participants are shot through the panoramic camera;

s232, marking the position coordinates of each panoramic camera in the virtual model;

s233, recognizing the faces of all participants in the image shot by the panoramic camera through a face detection algorithm;

s234, identifying the current speaker from the participants through a mouth shape identification intelligent algorithm;

s235, calculating a conference seat where the speaker is located according to the position of the speaker in the shot image and the position coordinates of the panoramic camera in the virtual model.

When the moving end is detected to be in a moving state, the embodiment switches the image of the video conference to the image shot by the panoramic camera. The camera 4 is switched to the image of the panoramic camera in the moving process of the camera 4, and the image of the camera 4 is switched after the camera 4 moves to the position opposite to the speaker, so that the phenomenon that the image shakes and other non-speaking participants are shot when the camera 4 moves is avoided, and the effect of a video conference is inconsistent and dizziness is caused.

It should be noted that the embodiments of the apparatus and device described above are only schematic, where the units described as separate units may or may not be physically separated, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Claims

1. An intelligent conference method for automatically aligning the face of a speaker through a guide rail camera, wherein the guide rail camera comprises: the camera is rotationally arranged on the moving end through the steering mechanism;

recognizing a face in the camera by using a face detection algorithm;

calculating the center coordinates of the face, calculating the position relation of the center coordinates relative to the camera, and controlling a steering mechanism to drive the camera to face the center coordinates according to the position relation;

2. The intelligent conference method for automatically aligning the face of a speaker through a guide rail camera according to claim 1, wherein the face detection algorithm is used for recognizing the face in the camera, specifically:

otherwise, continuing to detect the next frame of image.

3. The intelligent conference method for automatically aligning the face of a speaker through a guide rail camera according to claim 2, wherein the step of determining the position and the size of the face in the image by using a cascade classifier on the image is specifically implemented as follows:

4. The intelligent conference method for automatically aligning the face of a speaker through a guide rail camera according to claim 1, wherein the conference seat for identifying the speaker is specifically:

5. The intelligent conference method according to claim 1, wherein said guideway camera further comprises an omnidirectional pickup microphone, said omnidirectional pickup microphone being fixedly coupled to said moving end.

6. The intelligent conference method for automatically aligning the face of a speaker through a guide rail camera according to claim 5, wherein the conference seat for identifying the speaker is specifically:

7. The intelligent conference method for automatically aligning the face of a speaker through a guide rail camera according to claim 1, wherein the conference seat for identifying the speaker is specifically:

marking the position coordinates of each panoramic camera in the virtual model;

8. The intelligent conference method for automatically aligning the face of a speaker through a guideway camera according to claim 1, further comprising the steps of:

at least one panoramic camera is fixedly arranged in the conference room;

9. The intelligent conference method for automatically aligning the face of a speaker through a guideway camera according to claim 1, wherein the face recognition algorithm is used to recognize the face in the camera, further comprising the sub-steps of:

10. The intelligent conference method for automatically aligning the face of a speaker through a guideway camera according to claim 1, further comprising the steps of: