CN116567349A - Video display method and device based on multiple cameras and storage medium - Google Patents

Video display method and device based on multiple cameras and storage medium Download PDF

Info

Publication number
CN116567349A
CN116567349A CN202310557214.8A CN202310557214A CN116567349A CN 116567349 A CN116567349 A CN 116567349A CN 202310557214 A CN202310557214 A CN 202310557214A CN 116567349 A CN116567349 A CN 116567349A
Authority
CN
China
Prior art keywords
face
image
video
frame
portrait
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310557214.8A
Other languages
Chinese (zh)
Inventor
高方奇
杨海军
马睿
郭锦文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kandao Technology Co Ltd
Original Assignee
Kandao Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kandao Technology Co Ltd filed Critical Kandao Technology Co Ltd
Priority to CN202310557214.8A priority Critical patent/CN116567349A/en
Publication of CN116567349A publication Critical patent/CN116567349A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/454Content or additional data filtering, e.g. blocking advertisements
    • H04N21/4545Input to filtering algorithms, e.g. filtering a region of the image
    • H04N21/45457Input to filtering algorithms, e.g. filtering a region of the image applied to a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a video display method, a device and a storage medium based on multiple cameras, which are used for issuing acquisition instructions to a plurality of slaves and receiving video data returned by the slaves; calculating a difference between the local time stamp and the video time stamp of each video data; video data of which the difference value between the local time stamp and the video time stamp is smaller than or equal to a preset value is reserved; generating a video to be displayed according to the reserved video data, and displaying the video to be displayed; the display effect of the video can be improved.

Description

Video display method and device based on multiple cameras and storage medium
Technical Field
The present invention relates to the field of video technologies, and in particular, to a video display method and apparatus based on multiple cameras, and a storage medium.
Background
With the development of technology, people can use various terminal devices for online conferences, such as voice conferences, video conferences, and the like. The on-line conference mode can be used for carrying out the conference without the need of people to arrive at the appointed place in the appointed time, so that the conference is carried out more efficiently and conveniently.
For a slightly larger conference room, a plurality of slave devices are required to be deployed to collect conference data, then the host device synthesizes the conference data collected by the slave devices into a video image and displays the video image in a screen of a display party, however, due to the time difference between the host device and the slave devices, the synthesized video image may have the problem of poor picture connection, and the display effect of the video is reduced.
Disclosure of Invention
The invention provides a video display method and device based on multiple cameras and a storage medium, which can improve the display effect of videos.
In a first aspect, the present invention provides a video display method based on multiple cameras, including:
issuing acquisition instructions to the slaves and receiving video data returned by the slaves;
calculating a difference between the local time stamp and the video time stamp of each video data;
video data of which the difference value between the local time stamp and the video time stamp is smaller than or equal to a preset value is reserved;
and generating a video to be displayed according to the reserved video data, and displaying the video to be displayed.
Optionally, in some embodiments of the present invention, the determining the video to be displayed according to the reserved video data includes:
extracting a scene image and a scene sound from the reserved video data;
face detection is carried out on the scene image;
cutting out a portrait image corresponding to each face from the scene image based on the detection result;
and generating a video to be displayed according to the scene sound and the portrait image.
Optionally, in some embodiments of the present invention, the cropping a portrait image corresponding to each face from the scene image based on the detection result includes:
constructing a face frame covering a face in the scene image based on a detection result;
and dividing a portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame.
Optionally, in some embodiments of the present invention, the segmenting, in the scene image, a portrait image corresponding to each face according to the size information of the face frame and the coordinate information of the face frame includes:
determining the position of a face in the face frame in the scene image according to the coordinate information of the face frame;
determining the face size of a face in the face frame based on the size information of the face frame;
and dividing a portrait image corresponding to each face in the scene image according to the position and the face size.
Optionally, in some embodiments of the present invention, further includes:
detecting whether the same face exists in the segmented portrait image;
when the fact that the same face exists in the segmented portrait image is detected, acquiring three-dimensional coordinates of the preset face five sense organs;
determining two-dimensional coordinates corresponding to a face in the face frame according to the coordinate information of the face frame;
based on the two-dimensional coordinates and the three-dimensional coordinates, recognizing the orientation of the face in the portrait image with the same face;
and determining a target portrait image from portrait images with the same face according to the face direction and the size information.
Optionally, in some embodiments of the present invention, the identifying the orientation of the face in the portrait image with the same face based on the two-dimensional coordinates and the three-dimensional coordinates includes:
constructing a conversion relation between the two-dimensional coordinates and the three-dimensional coordinates;
determining a face area corresponding to a face in the face frame according to the conversion relation;
carrying out normalization processing on the face region, and extracting image features of a normalized image;
and carrying out convolution processing on the image features to obtain values corresponding to the image features, and determining the orientation of the face in the portrait image with the same face based on the values.
Optionally, in some embodiments of the present invention, the detecting whether the same face exists in the segmented portrait image includes:
carrying out normalization processing on pixels of the segmented portrait image, and extracting feature vectors of faces in the normalized portrait image;
the feature vector is a low-dimensional feature vector;
determining an array corresponding to the converted feature vector;
and determining whether the same face exists in the segmented portrait images according to the array corresponding to the portrait image to be selected.
Optionally, in some embodiments of the present invention, further includes:
pedestrian detection is carried out on the scene image;
constructing a pedestrian frame covering pedestrians in the scene image according to the pedestrian detection result;
the step of dividing the portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame comprises the following steps: dividing a portrait image corresponding to each face in the scene image according to the size information of the pedestrian frame, the coordinate information of the pedestrian frame, the size information of the face frame and the coordinate information of the face frame
In a second aspect, the present invention further proposes an audio synchronization device of a distributed microphone, including:
the receiving module is used for issuing acquisition instructions to the slaves and receiving video data returned by the slaves;
a calculation module for calculating a difference between the local time stamp and the video time stamp of each video data;
the reservation module is used for reserving video data of which the difference value between the local time stamp and the video time stamp is smaller than or equal to a preset value;
the generation module is used for generating a video to be displayed according to the reserved video data;
and the display module is used for displaying the video to be displayed.
In a third aspect, the present invention further provides a computer storage medium, where a computer program is stored, where the computer program is executed to implement the video presentation method based on multiple cameras as described in any one of the above.
The invention discloses a video display method, a device and a storage medium based on multiple cameras, which are characterized in that after an acquisition instruction is issued to a plurality of slaves and video data returned by the slaves are received, the difference value between a local timestamp and the video timestamp of each video data is calculated, then, video data with the difference value between the local timestamp and the video timestamp smaller than or equal to a preset value is reserved, finally, a video to be displayed is generated according to the reserved video data, and the video to be displayed is displayed. In the video display scheme provided by the invention, the video data with the difference value between the local time stamp and the video time stamp being larger than the preset value is filtered, so that the time delay between the reserved video data is in a smaller range, the problem that the synthesized video images possibly have picture connection failure is avoided, and the display effect of the video is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments are briefly described below, and the drawings in the following description are only drawings corresponding to some embodiments of the present invention, and it is possible for a person skilled in the art to obtain drawings of other embodiments according to these drawings without inventive effort.
Fig. 1 is a schematic flow chart of a video display method based on multiple cameras according to an embodiment of the present invention;
fig. 2 is a schematic diagram showing a video to be displayed in a video display method based on multiple cameras according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an optimization model in a video display method based on multiple cameras according to an embodiment of the present invention;
fig. 4 is a schematic diagram of extracting a portrait image in a video display method based on multiple cameras according to an embodiment of the present invention;
fig. 5 is a schematic view of a scenario of a video display method based on multiple cameras according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a video display device based on multiple cameras according to an embodiment of the present invention;
fig. 7 is another schematic structural diagram of a video display device based on multiple cameras according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
Referring to fig. 1, the video display method based on multiple cameras of the present invention includes:
101. and issuing acquisition instructions to a plurality of slaves and receiving video data returned by the slaves.
For example, the master machine and the slave machine (such as a camera) are connected into the same local area network, the master machine searches for and accesses all internet protocol (Internet protocol, IP) addresses in the local area network, the slave machine responds to the access, and the master machine judges whether to search for the slave machine according to whether the slave machine responds or not and establishes communication with the slave machine. Then, responding to a conference opening operation (such as a user clicking a control to trigger video conference software to be opened), generating an acquisition instruction according to the conference opening operation, then, issuing the acquisition instruction to a plurality of slaves by a main control machine, and receiving video data returned by the slaves according to the acquisition instruction, wherein the acquisition instruction carries acquisition parameters (such as acquisition duration, acquisition volume and the like) for the conference.
102. The difference between the local time stamp and the video time stamp of each video data is calculated.
It can be understood that, because of the time difference between the master control machine and the slave machine, in order to facilitate the subsequent video presentation, in the present invention, the difference between the local timestamp (i.e. the timestamp of the master control machine) and the video timestamp of each video data is calculated, and when the video data of which the difference between the local timestamp and the video timestamp is less than or equal to the preset value is detected, the difference is reserved; and filtering when detecting that the difference value between the local time stamp and the video time stamp is larger than the video data of the preset value.
103. Video data in which the difference between the local time stamp and the video time stamp is less than or equal to a preset value is reserved.
104. And generating a video to be displayed according to the reserved video data, and displaying the video to be displayed.
For example, specifically, a scene image and a scene sound may be extracted from the retained video data, and given to generate a video to be presented. It should be noted that, because the distances between the persons in the conference room and the slave conference room are different, there may be particularly large figures of some persons in the collected video data, and the figures of some persons are particularly small, so that when the video display is performed subsequently, the size difference of figures corresponding to different persons in the conference personnel is too large, that is, the display effect of the video is not good, so the invention generates the video to be displayed by using the face detection result and the reserved video data, and the step of generating the video to be displayed "according to the reserved video data may specifically include:
(11) Extracting a scene image and a scene sound from the reserved video data;
(12) Face detection is carried out on the scene image;
(13) Cutting out a portrait image corresponding to each face from the scene image based on the detection result;
(14) And generating a video to be displayed according to the scene sound and the portrait image.
For example, after extracting a scene image and a scene sound from the retained video data, face detection is performed on the scene image, specifically, a face frame corresponding to each face may be generated based on the detection result, and then, according to the coordinates and the size of the face frame, a portrait image corresponding to each face is cut out from the scene image, that is, optionally, in some embodiments, the step of "cutting out a portrait image corresponding to each face from the scene image based on the detection result" may specifically include:
(21) Constructing a face frame covering a face in the scene image based on a detection result;
(22) And dividing a portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame.
It should be noted that, the original image obtained by a single slave machine (such as a conference camera) is a panorama, for example, face detection is performed on the panorama, two-dimensional coordinates of all faces on the panorama and face size information can be obtained, optionally, in some embodiments, the coordinate information of the faces and the face size information can be determined by the face detection result, that is, the face detection result is a face frame covering the faces, the coordinate information of the faces can be determined according to the position of the face frame, meanwhile, the size of the faces can be determined according to the size of the face frame, after that, according to the position of the faces and the face size, the face image corresponding to each face is segmented in the scene image, that is, the step of "segmenting the face image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame" specifically may include:
(31) Determining the position of a face in the face frame in the scene image according to the coordinate information of the face frame;
(32) Determining the face size of a face in the face frame based on the size information of the face frame;
(33) And dividing a portrait image corresponding to each face from the scene image according to the position and the face size.
In this embodiment, the purpose of dividing the portrait image corresponding to each face from the scene image based on the face size is to: the distance between each person and the slave (conference camera) is different, and the slave is divided based on the face size, so that the size of the face image in the finally displayed picture can be ensured to be close. Specifically, the face image may be segmented in the following manner, mode 1: and presetting a reference face size, and then adjusting all face sizes to be the reference face size, so as to obtain a portrait image corresponding to each face. Mode 2: the face sizes of the face a are adjusted to be approximate, for example, the face size of the face a is 8x10 inches, the face size of the face b is 4x6 inches, and the face size of the face c is 16x20 inches, so that the face size of the face b can be enlarged to 10x10 inches based on the face size of the face a, and meanwhile, the face size of the face c can be reduced to 10x10 inches.
The above embodiment describes a case where a single camera works, when a plurality of cameras work, the same face may exist in the divided portrait images, and when the same face is detected, the repeated face may be deduplicated by using the three-dimensional coordinates of the facial features of the general face and the two-dimensional coordinates of the face, that is, optionally, in some embodiments, the method may specifically further include:
(41) Detecting whether the same face exists in the segmented portrait image;
(42) When the fact that the same face exists in the segmented portrait image is detected, acquiring three-dimensional coordinates of the preset face five sense organs;
(43) Determining two-dimensional coordinates corresponding to a face in the face frame according to the coordinate information of the face frame;
(44) Based on the two-dimensional coordinates and the three-dimensional coordinates, recognizing the orientation of the face in the portrait image with the same face;
(45) And determining a target portrait image from portrait images with the same face according to the orientation and the size information of the face.
The invention adopts a novel repeated face detection method, which comprises the following steps: the face image is converted into an array of 128 dimensions through a preset algorithm, and then, by comparing the arrays corresponding to different face images, the more similar the arrays, the more likely the same person is illustrated, that is, optionally, in some embodiments, the step of detecting whether the same face exists in the segmented face image may specifically include:
(51) Carrying out normalization processing on pixels of the segmented portrait image, and extracting feature vectors of faces in the normalized portrait image;
(52) The feature vector is a low-dimensional feature vector;
(53) Determining an array corresponding to the converted feature vector;
(54) And determining whether the same human face exists in the segmented human image or not according to the array corresponding to the human image to be selected.
It should be noted that, the effect of collecting the face information array converted from only one picture is not good for each person, so in some embodiments of the present invention, 10 pictures are collected for each person, each picture is converted into 128-dimensional arrays with face information, the average of the 128-dimensional arrays with face information is used as the face feature array of the person, and then the determination of the repeated faces is performed based on the face feature array.
After face detection, a face two-dimensional coordinate in a face frame is obtained by utilizing a face recognition algorithm, specifically, the face two-dimensional coordinate can be a two-dimensional coordinate of a five sense organ (the center of eyes, the nose tip and the two sides of a mouth), then, three-dimensional modeling is carried out on the face by utilizing a general three-dimensional coordinate of the face five sense organ, then, the orientation of the face is obtained by combining the three-dimensional coordinate and the two-dimensional coordinate of the face five sense organ through a PNP algorithm, finally, the face is determined to be the most correct and the largest image of the face in a face image with the same face according to the orientation and the size information of the face, wherein the face is basically on the same horizontal line with a slave machine under the scene of a conference room, so that whether the face is a side face or a front face can be distinguished by judging the orientation of the face in the horizontal direction.
Further, in some embodiments, a face region corresponding to a face may be defined by using two-dimensional coordinates of the face and three-dimensional coordinates of preset facial features of the face, then a value corresponding to the face region is determined, and a corresponding direction is determined based on the value, that is, optionally, in some embodiments, the step of identifying a direction of the face in a portrait image with the same face based on the two-dimensional coordinates and the three-dimensional coordinates may specifically include:
(61) Constructing a conversion relation between a two-dimensional coordinate and the three-dimensional coordinate;
(62) Determining a face region corresponding to the face in the face frame according to the conversion relation;
(63) Carrying out normalization processing on the face region, and extracting image features of the normalized image;
(64) And carrying out convolution processing on the image features to obtain values corresponding to the image features, and determining the orientation of the face in the portrait image with the same face based on the values.
For example, specifically, the size of the resolution of the face region is adjusted to a preset value (e.g. 224x 224), the image pixels are converted into a range of 0-1 by normalization, then the image is converted into a standard normal distribution, that is, the average value of the image is pulled to 0.485,0.456,0.406, and the standard deviation is [0.229,0.224,0.225], then the features of the face region slice are extracted, the features can be extracted by using a standard resnet50 structure, after the resnet50 is subjected to multiple convolutions, a 7x7x2048 feature matrix is obtained, and then the features are mapped to the face orientation space: after the face features are extracted, the face features are mapped into a sample space, namely, a matrix of 7x7x2048 is simply subjected to matrix multiplication, converted into a final vector of 3x3 representing a rotation matrix, and the value representing the horizontal direction in the rotation matrix is converted into a specific value of-90 degrees to 90 degrees for finally judging the positive side face.
In addition, for a scene where people walk in a meeting, in some embodiments of the present invention, pedestrian detection may be performed on a scene image, and a pedestrian frame of a pedestrian may be covered in the scene image according to a pedestrian detection result, where the step of "segmenting a portrait image corresponding to each face in the scene image according to size information of the face frame and coordinate information of the face frame" may specifically include: and dividing the portrait image corresponding to each face in the scene image according to the size information of the pedestrian frame, the coordinate information of the pedestrian frame, the size information of the face frame and the coordinate information of the face frame.
After the deduplication processing and the preferential selection, the video to be displayed can be generated according to the number of people identified and the scene sound, optionally, the speaker can be determined according to the scene sound, the face image of the speaker is placed at the preset position of the video, the size of the face image of the speaker is larger than the sizes of the face images of other people, and then the positions of the other people in the video are arranged randomly, so that the video to be displayed is generated. When it is determined that the speaker finishes speaking at time t according to the scene sound, the size of the face image of the speaker is reduced to ensure that the size of the face image corresponding to the speaker after speaking is similar to or the same as the size of the face image of the rest of the people, as shown in fig. 2.
In order to further understand the face de-duplication scheme of the present invention, a conference scenario of a multi-conference machine is further described below as an example.
And the picture of the conference machine is sent to the processor through WIFI or network cables. After the processor receives the portrait pictures returned by the conference machines, the portrait of each conference machine can be combined together to generate a final picture. However, there are two problems that different conference machines may return the same portrait frame, and the portrait needs to be subjected to duplication removal, and another problem is that when the repeated portrait appears, the best portrait is selected for display. Two core algorithms of the processor are involved: a deduplication algorithm and a preferential algorithm after the occurrence of repeated figures. The invention provides an optimization model, as shown in fig. 3, which comprises a face recognition neural network D1, a face direction recognition neural network D2 and a pedestrian recognition neural network D3, and specifically comprises the following steps:
the face recognition neural network D1 (converting the face into an array which can be expressed mathematically) inputs a face box with 112x112 resolution, after inputting the face box, the neural network can return an array with 128 dimensions (each number in the array ranges from 0 to 1), and by comparing the arrays generated by different faces, the more similar the array, the more likely the repeated face is illustrated.
Specifically, the normalization processing is performed on the pixels of the face image: the resolution of the face image is 112x112, each pixel in the RGB picture is subtracted by 127.5 and divided by 128.
Extracting face characteristic information: the normalized face is converted into the feature vector by using CNN, the structure of the MobileNet V2 with the CNN of the extracted feature as a standard is used, the generated matrix with the feature of 7x7x1280 can be simply understood by dividing the face image into 49 small blocks with the total of 7x7, and each small block has a corresponding 1280-dimensional feature vector.
Converting the face features into low-dimensional feature vectors: for each face image, a matrix of 7x7x1280 can be generated by the CNN, which is equivalent to digitizing face information, and whether faces are similar can be judged by comparing the digitized information generated by different faces. But this matrix has 7x7x 1280= 62720 dimensions of data, which requires the conversion of this high-dimensional matrix into a low-dimensional array:
1. for each layer of 7x7 data, a weighted average is taken, and the closer to the center of 7x7 data, the higher the weight it takes when taking the average.
2. After weighted averaging, a 1x 1280-dimensional array is obtained, and this array is convolved with a 1x1x128 to obtain the final 128-dimensional array.
After obtaining 128-dimensional arrays of face recognition (the range of each number in the arrays is 0-1), we calculate the cosine distance between the two arrays, the specific formula is shown below, the cosine distance is 0-1, we consider that when the number is smaller than 0.36, we consider that the two arrays are similar, namely, the two arrays are judged to be repeated faces, and the method is shown in formula 1, wherein x and y are the arrays corresponding to face images.
For example, specifically, after inputting the scene image to the face recognition network D1, the face recognition network D1 performs face detection on the scene image, and constructs a face frame covering the face in the scene image based on the face detection result, where one face frame corresponds to one face, and then adjusts the size of the face based on the size information of the face frame and the coordinate information of the face frame, as shown in fig. 4, a specific adjustment manner is similar to the previous embodiment, and is not described herein again. In addition, the face recognition network D1 may output a 128-dimensional array corresponding to each face, and the 128-dimensional array may be used to determine whether the faces are similar.
Face orientation recognition neural network D2 (converts face into angle of face in horizontal orientation): when a repeated face occurs, a better face needs to be selected, which should be a more positive face than a side face. Based on the scene used by the conference machine, the human face is basically on the same horizontal line with the conference machine, so that the human face can be distinguished into a side face or a front face by judging the direction of the human face in the horizontal direction. The input of the algorithm is consistent with that in the face recognition neural network, namely a face box with 112x112 resolution, after the face box is input, the neural network can return a number, the range of the number is-90 degrees to +90 degrees, and 0 degrees indicates that the person is the most positive face.
Similarly, the face direction recognition neural network D2 is also the scene image, and outputs a value indicating the direction of the face image in the horizontal direction, and the face image is screened based on the value, for example, for the same face, the face image a, the face image b, and the face image c are corresponding, the value indicating the direction of the face image a in the horizontal direction is 6, the value indicating the direction of the face image b in the horizontal direction is 16, and the value indicating the direction of the face image c in the horizontal direction is 3, and then the face image a and the face image b may be deleted, and the face image c may be retained.
Pedestrian recognition neural network D3 (converts a portrait into an array that can be expressed mathematically): when a plurality of people use a conference machine for meeting, the people can take a mask, after the mask is taken, the face recognition accuracy is poor, so that a face recognition algorithm is trained, the input of the algorithm is 128X256 face images (not just faces), the output of the algorithm is 256-dimensional arrays, and the more similar the arrays are, the more likely the same person is illustrated through comparing the arrays generated by different face images.
It should be noted that, pedestrian recognition and face recognition are similar, which is equivalent to changing the input from a face to a whole-body picture, but the details of part of the faces are not very the same, when the face recognition is performed, the cut face basically covers the complete picture frame, but the pedestrian image does not occupy half of the resolution of the picture, alternatively, the pedestrian image is converted into different resolutions, such as 128×256,64×128 and 32×64, so as to extract the features of the image respectively, and reduce the influence caused by the difference of the sizes of the pedestrian images. Then, the features are extracted using the network structure of OmniScaleNet, and by comparing the features of different pedestrian image transformations, it is confirmed whether or not it is a repeated pedestrian image.
After the scene image is input to the pedestrian recognition neural network D3, the pedestrian recognition neural network D3 performs face detection on the scene image, and constructs a pedestrian frame covering the pedestrian in the scene image based on the pedestrian detection result, it can be understood that in some embodiments, the pedestrian recognition neural network D3 only targets the traveling object, so that the collision between the result output by the pedestrian recognition neural network D3 and the result output by the face recognition neural network D1 is avoided, and the subsequent video display is affected.
The host acquires all video streams and video information, analyzes and synthesizes video images through the video data and the video information, and sends the video images to a conference counterpart, for example, a local conference party acquires scene images p1 containing a participant A and a participant B through a conference machine a, the conference machine B acquires scene images p2 containing the participant A and the participant B, the host respectively processes the scene images p1 and the scene images p2 by utilizing built-in optimization models of the host, the images of the participant A and the images of the participant B are obtained after the deduplication processing and the preferential selection, then the images of the participant A and the images of the participant B are spliced by combining microphone data and sound data, the synthesized video data is obtained, the synthesized video data is sent to a far-end conference party, and finally, the synthesized video data is displayed in a screen of the far-end conference party, as shown in fig. 5.
According to the video display method based on the multiple cameras, after an acquisition instruction is issued to a plurality of slaves and video data returned by the slaves are received, a difference value between a local time stamp and the video time stamp of each video data is calculated, then video data with the difference value between the local time stamp and the video time stamp being smaller than or equal to a preset value is reserved, finally, video to be displayed is generated according to the reserved video data, and the video to be displayed is displayed. In the video display scheme provided by the invention, the video data with the difference value between the local time stamp and the video time stamp being larger than the preset value is filtered, so that the time delay between the reserved video data is in a smaller range, the problem that the synthesized video images possibly have picture connection failure is avoided, and the display effect of the video is further improved.
Accordingly, referring to fig. 6, an embodiment of the present invention provides a video display apparatus (hereinafter referred to as a display apparatus) with multiple cameras, which is characterized in that the video display apparatus includes:
and the receiving module 201 is used for issuing acquisition instructions to the plurality of slaves and receiving video data returned by the slaves.
A calculation module 202, configured to calculate a difference between the local timestamp and the video timestamp of each video data.
And the retaining module 203 is configured to retain video data in which a difference between the local timestamp and the video timestamp is less than or equal to a preset value.
The generating module 204 is configured to generate a video to be displayed according to the reserved video data.
Optionally, in some embodiments, the generating module 204 may specifically include:
an extraction unit for extracting a scene image and a scene sound from the retained video data;
the detection unit is used for carrying out face detection on the scene image;
the clipping unit is used for clipping a portrait image corresponding to each face from the scene image based on the detection result;
and the generating unit is used for generating the video to be displayed according to the scene sound and the portrait image.
Optionally, in some embodiments, the clipping unit may specifically include:
a construction subunit, configured to construct a face frame covering a face in the scene image based on a detection result;
the segmentation subunit is used for segmenting the portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame.
Alternatively, in some embodiments, the segmentation subunit may be specifically configured to: determining the position of a face in the face frame in the scene image according to the coordinate information of the face frame; determining the face size of a face in the face frame based on the size information of the face frame; and dividing a portrait image corresponding to each face from the scene image according to the position and the face size.
Optionally, in some embodiments, referring to fig. 7, the display device of the present invention may specifically further include a detection module 206, where the detection module 206 may specifically include:
the detection unit is used for detecting whether the same face exists in the segmented portrait image;
the acquisition unit is used for acquiring three-dimensional coordinates of the preset face five sense organs when the same face exists in the segmented portrait images;
the first determining unit is used for determining two-dimensional coordinates corresponding to the face in the face frame according to the coordinate information of the face frame;
the recognition unit is used for recognizing the orientation of the face in the portrait image with the same face based on the two-dimensional coordinates and the three-dimensional coordinates;
and a second determining unit for determining a target portrait image from portrait images with the same face according to the orientation and size information of the face.
Alternatively, in some embodiments, the detection unit may specifically be configured to: carrying out normalization processing on pixels of the segmented portrait image, and extracting feature vectors of faces in the normalized portrait image; the feature vector is a low-dimensional feature vector; determining an array corresponding to the converted feature vector; and determining whether the same human face exists in the segmented human image or not according to the array corresponding to the human image to be selected.
Alternatively, in some embodiments, the identification unit may be specifically configured to: constructing a conversion relation between a two-dimensional coordinate and the three-dimensional coordinate; determining a face region corresponding to the face in the face frame according to the conversion relation; carrying out normalization processing on the face region, and extracting image features of the normalized image; and carrying out convolution processing on the image features to obtain values corresponding to the image features, and determining the orientation of the face in the portrait image with the same face based on the values.
And the display module 205 is configured to display the video to be displayed.
As can be seen from the above, in the video display device based on multiple cameras of this embodiment, after the receiving module 201 issues the acquisition instructions to the multiple slaves and receives the video data returned by the slaves, the calculating module 202 calculates the difference between the local timestamp and the video timestamp of each video data, then the retaining module 203 retains the video data with the difference between the local timestamp and the video timestamp being less than or equal to the preset value, finally the generating module 204 generates the video to be displayed according to the retained video data, and the displaying module 205 displays the video to be displayed. In the video display scheme provided by the invention, the video data with the difference value between the local timestamp and the video timestamp larger than the preset value is filtered, so that the time delay between the reserved video data is in a smaller range, the problem that the synthesized video images possibly have picture connection failure is avoided, and the display effect of the video is further improved
The embodiment of the invention also provides a computer device, comprising a processor and a memory, wherein the memory stores a computer program, and the computer program realizes the steps of any one of the above method embodiments when being loaded and executed by the controller.
The embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program is executed, the method steps of any one of the above method embodiments are realized.
In the foregoing embodiments of the present invention, it should be understood that the disclosed systems, devices and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium.
Based on such understanding, the technical solution of the present application may be embodied in essence or contributing part or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a mobile terminal, a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In summary, although the present invention has been described with reference to the preferred embodiments, the scope of the invention is not limited thereto, and any person skilled in the art who is skilled in the art should make equivalent substitutions or modifications according to the technical scheme of the present invention within the scope of the present invention.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (10)

1. A video presentation method based on multiple cameras, comprising:
issuing acquisition instructions to the slaves and receiving video data returned by the slaves;
calculating a difference between the local time stamp and the video time stamp of each video data;
video data of which the difference value between the local time stamp and the video time stamp is smaller than or equal to a preset value is reserved;
and generating a video to be displayed according to the reserved video data, and displaying the video to be displayed.
2. The method of claim 1, wherein determining the video to be presented based on the retained video data comprises:
extracting a scene image and a scene sound from the reserved video data;
face detection is carried out on the scene image;
cutting out a portrait image corresponding to each face from the scene image based on the detection result;
and generating a video to be displayed according to the scene sound and the portrait image.
3. The method according to claim 2, wherein the cropping the portrait image corresponding to each face from the scene image based on the detection result includes:
constructing a face frame covering a face in the scene image based on a detection result;
and dividing a portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame.
4. A method according to claim 3, wherein the segmenting the portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame includes:
determining the position of a face in the face frame in the scene image according to the coordinate information of the face frame;
determining the face size of a face in the face frame based on the size information of the face frame;
and dividing a portrait image corresponding to each face in the scene image according to the position and the face size.
5. The method as recited in claim 4, further comprising:
detecting whether the same face exists in the segmented portrait image;
when the fact that the same face exists in the segmented portrait image is detected, acquiring three-dimensional coordinates of the preset face five sense organs;
determining two-dimensional coordinates corresponding to a face in the face frame according to the coordinate information of the face frame;
based on the two-dimensional coordinates and the three-dimensional coordinates, recognizing the orientation of the face in the portrait image with the same face;
and determining a target portrait image from portrait images with the same face according to the face direction and the size information.
6. The method of claim 5, wherein the identifying the orientation of the face in the portrait image with the same face based on the two-dimensional coordinates and the three-dimensional coordinates comprises:
constructing a conversion relation between the two-dimensional coordinates and the three-dimensional coordinates;
determining a face area corresponding to a face in the face frame according to the conversion relation;
carrying out normalization processing on the face region, and extracting image features of a normalized image;
and carrying out convolution processing on the image features to obtain values corresponding to the image features, and determining the orientation of the face in the portrait image with the same face based on the values.
7. The method of claim 5, wherein detecting whether the same face is present in the segmented portrait image comprises:
carrying out normalization processing on pixels of the segmented portrait image, and extracting feature vectors of faces in the normalized portrait image;
the feature vector is a low-dimensional feature vector;
determining an array corresponding to the converted feature vector;
and determining whether the same face exists in the segmented portrait images according to the array corresponding to the portrait image to be selected.
8. A method according to claim 3, further comprising:
pedestrian detection is carried out on the scene image;
constructing a pedestrian frame covering pedestrians in the scene image according to the pedestrian detection result;
the step of dividing the portrait image corresponding to each face in the scene image according to the size information of the face frame and the coordinate information of the face frame comprises the following steps: and dividing a portrait image corresponding to each face in the scene image according to the size information of the pedestrian frame, the coordinate information of the pedestrian frame, the size information of the face frame and the coordinate information of the face frame.
9. An image processing apparatus, comprising:
the receiving module is used for issuing acquisition instructions to the slaves and receiving video data returned by the slaves;
a calculation module for calculating a difference between the local time stamp and the video time stamp of each video data;
the reservation module is used for reserving video data of which the difference value between the local time stamp and the video time stamp is smaller than or equal to a preset value;
the generation module is used for generating a video to be displayed according to the reserved video data;
and the display module is used for displaying the video to be displayed.
10. A storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the video presentation method of any of claims 1 to 8.
CN202310557214.8A 2023-05-16 2023-05-16 Video display method and device based on multiple cameras and storage medium Pending CN116567349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310557214.8A CN116567349A (en) 2023-05-16 2023-05-16 Video display method and device based on multiple cameras and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310557214.8A CN116567349A (en) 2023-05-16 2023-05-16 Video display method and device based on multiple cameras and storage medium

Publications (1)

Publication Number Publication Date
CN116567349A true CN116567349A (en) 2023-08-08

Family

ID=87499843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310557214.8A Pending CN116567349A (en) 2023-05-16 2023-05-16 Video display method and device based on multiple cameras and storage medium

Country Status (1)

Country Link
CN (1) CN116567349A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422617A (en) * 2023-10-12 2024-01-19 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422617A (en) * 2023-10-12 2024-01-19 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system
CN117422617B (en) * 2023-10-12 2024-04-09 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system

Similar Documents

Publication Publication Date Title
JP6844038B2 (en) Biological detection methods and devices, electronic devices and storage media
CN106682632B (en) Method and device for processing face image
WO2021027537A1 (en) Method and apparatus for taking identification photo, device and storage medium
KR102120046B1 (en) How to display objects
JP4085959B2 (en) Object detection device, object detection method, and recording medium
EP3467788B1 (en) Three-dimensional model generation system, three-dimensional model generation method, and program
JP7387202B2 (en) 3D face model generation method, apparatus, computer device and computer program
WO2021213067A1 (en) Object display method and apparatus, device and storage medium
US7302113B2 (en) Displaying digital images
JPWO2018047687A1 (en) Three-dimensional model generation device and three-dimensional model generation method
JP4597391B2 (en) Facial region detection apparatus and method, and computer-readable recording medium
AU2012219026A1 (en) Image quality assessment
CN106981078B (en) Sight line correction method and device, intelligent conference terminal and storage medium
JP2004192378A (en) Face image processor and method therefor
WO2019237745A1 (en) Facial image processing method and apparatus, electronic device and computer readable storage medium
JP2000306095A (en) Image collation/retrieval system
WO2020023727A1 (en) Personalized hrtfs via optical capture
KR100560464B1 (en) Multi-view display system with viewpoint adaptation
KR20160057867A (en) Display apparatus and image processing method thereby
JP7419080B2 (en) computer systems and programs
CN116567349A (en) Video display method and device based on multiple cameras and storage medium
CN111629242B (en) Image rendering method, device, system, equipment and storage medium
CN111325107A (en) Detection model training method and device, electronic equipment and readable storage medium
CN111586424A (en) Video live broadcast method and device for realizing multi-dimensional dynamic display of cosmetics
US20160110909A1 (en) Method and apparatus for creating texture map and method of creating database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination