CN112218157A

CN112218157A - System and method for intelligently focusing video

Info

Publication number: CN112218157A
Application number: CN202011079550.9A
Authority: CN
Inventors: 黄敦笔; 杜武平
Original assignee: Hangzhou Sairobo Network Technology Co ltd
Current assignee: Hangzhou Sairobo Network Technology Co ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-01-12

Abstract

The invention discloses a system and a method for intelligently focusing videos, wherein the system comprises a first video frame data acquisition unit, a video detection unit, a focusing magnification generation unit, a second video frame generation unit and a second video frame rendering unit; the method comprises the steps of obtaining first video frame data; video detection; generating a focusing multiplying power; generating a second video frame and rendering the second video frame. The method has the characteristics of more direct content information acquisition, clear focusing, good detail fidelity effect and better user interactive experience.

Description

System and method for intelligently focusing video

Technical Field

The invention relates to a video processing technology, in particular to a system and a method for intelligently focusing videos.

Background

In the existing video conference system, video on demand system and video live broadcast system, video acquisition devices, such as a camera and a video acquisition card, are generally used for acquiring video contents in a user scene, the video contents are sent to other users through a video processing system, the installation and debugging cost of the video acquisition devices is high, if the installation position is improper, the acquired video contents contain a large amount of irrelevant background contents and other redundant contents, and the effect of a user main body cannot be highlighted. The existing video image acquisition system is relatively simple and flat in content information acquisition, weak in interest information, general in detail fidelity, not direct and strong in focus information and information of interest of a user, poor in user interaction experience, high in requirement on position precision of video acquisition equipment, inconvenient to install and debug, time-consuming and labor-consuming, and can acquire videos with accurate focusing by means of multiple adjustments.

Disclosure of Invention

The invention aims to provide a system and a method for intelligently focusing video. The method has the characteristics of more direct content information acquisition, clear focusing, good detail fidelity effect, better user interactive experience and more convenient equipment installation.

The technical scheme of the invention is as follows: a video intelligent focusing system comprises a first video frame data acquisition unit, a video detection unit, a focusing magnification generation unit, a second video frame generation unit and a second video frame rendering unit;

the unit for acquiring first video frame data is used for acquiring first video frame data from a target terminal or equipment as an input target processing object;

the video detection unit is used for detecting and generating indicating information containing a target object in video content according to the first video frame data;

the focusing magnification generating unit is used for calculating and generating a focusing magnification and a focusing area according to the indication information and the resolution of the second video target;

the second video generation unit is used for calculating and generating a second video frame according to the focusing multiplying power, the focusing area and the first video frame data;

the render second video unit is to render a second video.

In the foregoing system for intelligently focusing videos, the target processing object includes a first video frame acquired by a video capture device or a first video frame sent by a server or other terminals and received by a local terminal.

In the foregoing system for intelligently focusing a video, the video detection unit includes a scene content variation detection module and a scene content detection module, where the scene content variation detection module is configured to calculate a scene variation through a video content variation according to an input of at least one first video frame or a processed video frame, and reduce a calculation pressure of the scene content detection module; the scene content detection module is used for detecting the indication information of all target objects contained in the video content.

In the foregoing system for intelligently focusing a video, the video detection unit further includes a video frame detection module and a video frame conversion module, where the video frame detection module is used to detect a format of an input video frame, and the video frame conversion module is used to convert the format of the video frame into a video format required by the video detection unit.

In the foregoing system for intelligently focusing a video, the unit for generating a focusing magnification includes an information merging processing unit and a sampling processing unit, and the information merging processing unit is configured to merge and process indication information of at least two target objects; the sampling processing unit is used for carrying out format conversion and resolution size conversion processing on the grid position and the area coordinate of the target object.

A method for intelligently focusing videos comprises the following steps:

A. acquiring first video frame data: acquiring first video frame data to obtain a video format indication of a first video frame, and restoring and outputting an original video frame signal according to the video format indication;

B. video detection: detecting the variable quantity, the variable content and the variable area of the video content according to the first video frame data or the restored original video frame signal, and generating indication information of a target object contained in the video content;

C. generating a focusing magnification: calculating and generating a focusing multiplying factor and a focusing area according to the indication information and the second video target resolution;

D. generating a second video frame: calculating to generate a second video frame according to the focusing multiplying power, the focusing area and the first video frame data;

E. and rendering the second video frame.

In the foregoing method for intelligently focusing a video, the step C specifically includes the following steps:

a. acquiring indication information of a target object and a second video frame resolution;

b. indication information of the merging target object: performing union operation on the grid positions and the area coordinates of more than two target objects;

c. sampling operation merged target object: sampling the grid position and the area coordinate of the target object;

d. a focusing magnification and a first focusing area are generated.

In the foregoing method for intelligently focusing a video, the step D specifically includes the following steps:

a. generating second video data in the first video frame data according to the indication calculation of the first focus area;

b. sampling the video according to the second video data and the focusing multiplying power to generate a second video frame;

c. and performing image enhancement processing on the generated second video frame to solve and improve the problem of detail texture loss caused by video up-sampling processing.

In the foregoing method for intelligently focusing a video, the step C further includes performing corresponding resolution and edge preserving refinement processing on the first focus area based on the selected focusing magnification, so as to generate a second focus area.

a. calculating and generating second video data in the first video frame data according to the indication of the second focus area;

c. and performing image enhancement processing on the generated second video frame to solve and improve the problem of detail texture loss caused by video sampling processing.

Compared with the prior art, the method has the advantages that the user can acquire the content information through the video picture more directly, really and closer to the experience of the user, the focus of the user is concentrated and not easy to disperse, the focusing is clear, the detail fidelity effect is good, the user interaction operation is not needed, and the user interaction experience is better; and adopt this system to focus, can correct and/or ignore video acquisition equipment because of the improper problem that arouses focusing difficulty and user experience poor of fixed mounting position, experience effect is better, and the mounted position of video acquisition equipment and the constraint nature of angle are not very strong, need not adjustment many times, labour saving and time saving to it is more convenient to embody video acquisition equipment's installation, and the debugging cost is lower effect.

Therefore, the method and the device have the characteristics of more direct content information acquisition, clear focusing, good detail fidelity effect, better user interaction experience and more convenient equipment installation.

Drawings

FIG. 1 is a block diagram of a video smart focus system;

FIG. 2 is a flow chart of a video smart focusing method;

fig. 3 is a schematic flow chart of outputting an original video frame signal;

fig. 4 is a schematic flow chart of generating a second video frame.

Detailed Description

The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention.

Examples are given.

As shown in fig. 1, a system for video intelligent focusing includes a unit for acquiring first video frame data, a video detection unit, a unit for generating focusing magnification, a unit for generating second video frame, and a unit for rendering second video frame;

the target processing object comprises a first video frame acquired by video acquisition equipment or a first video frame which is received by a local terminal and issued by a server or sent by other terminals. The first video frame acquired by the video acquisition equipment is an original video frame signal or a first video frame signal generated by compression of a video coding specification. The original video signal contains format data of original video data, such as image luminance, chrominance YCbCr format data, or image color R, G, B format data. The first video frame signal generated by the compression of the video coding specification is a video compressed byte stream, such as a video compressed byte stream via ITU/ISOMPEG-2/MPEG-4/h.263/h.264/h.265.

The local terminal receives a first video frame sent by a server or sent by other terminals, wherein the first video frame comprises a first video frame signal generated by video coding specification compression or a data packet signal of media encapsulation containing a video coding sub-stream. The media encapsulated packet signal comprising the video encoded sub-stream is in the packet format of a video compressed byte stream, such as the packet format of RTP, TS, PS, MOV, MP4 or WebM comprising a video compressed byte stream such as ITU/ISO MPEG-2/MPEG-4/H.263/H.264/H.265.

The video detection unit is used for detecting and generating indication information of a target object contained in video content according to first video frame data, wherein the indication information of the target object comprises a target object name and position coordinates of the target object; the position coordinates of the target object represent the grid position and the area coordinates of the target object in the video content relative to the video;

the video detection unit comprises a scene content variation detection module and a scene content detection module. The scene content variable quantity detection module is used for calculating scene change according to the input of at least one first video frame or a video frame obtained by the first video frame after the video frame is subjected to scaling, stretching or image enhancement and the like, and reducing the calculation pressure of the scene content detection module;

specifically, the scene change includes the scene content, the amount of the scene content change, and the area where the scene content changes.

The scene content detection module is configured to detect indication information of all target objects included in the video content, for example: the indication information of the target object is detected and generated through processing or reasoning calculation by using the existing traditional image processing method or a neural network model based on deep learning of vision.

In addition, the scene content detection module guides the tracking calculation of the target object through the scene change, so that the calculation pressure can be reduced. Specifically, the scene content detection module detects the amount of scene content change, and if the amount of scene content change is less than or equal to a threshold, it represents that the current scene and the previous scene do not change, so that the current video frame may not be used as the input of the scene content detection module, and the scene content detection module does not need to calculate the current video frame, thereby further saving the calculation pressure.

The video detection unit also comprises a video frame detection module and a video frame conversion module. The video frame detecting module is used for detecting the format of an input video frame, and the video frame converting module is used for converting the format of the video frame into a video format required by the video detecting unit. If the format of the input video frame generated by the calculation is not consistent with the video format required by the video detection unit, the video frame needs to be converted, that is, the format of the input video frame is converted into the video format and the corresponding resolution required by the video detection unit, for example, the format of the input video frame is 1920x1080 progressive scanning YcbCr, and is converted into 224x224 RGB.

The video detection unit further comprises a preset first target object list, wherein the first target object list is a technical scheme for generating a target object list which is interested by a user through system presetting, and the preset first target object list is used for further generating indication information for detecting a target object.

The target object comprises at least one or more human faces, an upper body figure, a whole body figure, a computer, a monitor, a keyboard, a notebook, a book, a pen, a conference table, a chair, a cup and the like.

The focusing magnification generating unit is used for calculating and generating a focusing magnification and a focusing area according to the indication information and the second video target resolution; the second video target resolution is the video resolution requested by the user, or the design resolution preset by the system, or the resolution equal to the first video.

The generating focusing magnification unit comprises an information merging processing unit used for merging and processing the indication information of at least two target objects. That is, if the result generated by the video detection unit includes two or more target objects, the information merging processing unit is configured to perform merging processing on the target object indication information, including a union operation on the grid position and the area coordinates of the target object.

The unit for generating focusing multiplying power also comprises a sampling processing unit which is used for carrying out video format conversion processing and resolution size conversion processing on the grid position and the area coordinate of the target object.

The focusing area is an area containing all target objects output by the video detection unit. The focusing area includes a first focusing area and/or a second focusing area. And the area of the target object after the merging processing is the first focusing area, and the second focusing area is obtained by performing corresponding resolution and edge protection refinement processing on the first focusing area based on the selected focusing magnification.

The corresponding resolution size of the focusing area can be consistent with or inconsistent with the resolution size of the second video frame, and when the two sizes are consistent, a better visual retention effect can be obtained.

The focusing multiplying factor represents a ratio relation, the ratio value is equal to the value obtained by dividing the width or height of the resolution of the second video frame by the width or height of the resolution of the video in which the focusing area is located respectively, and R is correspondingly output_fh、R_fv. R can be selected according to the situation_fhOr R_fvAs the final focusing magnification.

The focusing multiplying power generating unit also comprises a multi-stage multiplying power factor skip list preset in proportion. The multi-level magnification factor jumplist includes at least more than one focus magnification factor.

The multi-level multiplying factor skip list with preset proportion comprises the following steps: 8-fold, 6-fold, 4-fold, 3-fold, 2-fold, 1.5-fold, 1.2-fold, and 1-fold magnification factors.

The second video generation unit is used for calculating and generating a second video frame according to the focusing multiplying power, the focusing area and the first video frame data; the focusing area is a first focusing area or a second focusing area.

The render second video unit is to render a second video using conventional video rendering techniques.

As shown in fig. 2-4, a method for video smart focusing includes the following steps:

A. the method for acquiring the first video frame data specifically comprises the following steps:

a. acquiring first video frame data;

b. obtaining a video format indication of the first video frame or performing discrimination calculation to generate the video format indication of the first video frame;

c. and judging the type of the video format indication, performing corresponding processing according to the format indication, and restoring and outputting the original video frame signal.

If the input first video frame data is a media encapsulated data packet signal containing video coding sub-stream, recovering and outputting an original video frame signal through video format de-encapsulation and video decoding processing; if the input first video frame data is a first video frame signal generated by video coding standard compression, restoring and outputting an original video frame signal through video decoding processing; if the input first video frame data is the original video frame signal, no processing is performed.

B. Video detection:

detecting the variable quantity, the variable content and the variable area of the video content according to the first video frame data or the restored original video frame signal, and detecting and generating the indication information of all target objects contained in the video content through processing or reasoning calculation by using the conventional image processing method or a visual-based deep learning neural network model;

if the format of the input video frame generated by the calculation is not consistent with the video format required by the video detection unit, the video frame needs to be converted, that is, the format of the input video frame is converted into the video format and the corresponding resolution required by the video detection unit, for example, the format of the input video frame is 1920x1080 progressive scanning YcbCr, and is converted into 224x224 RGB.

When the indication information is calculated, the amount of change of the video content is detected, and if the amount of change of the video content is smaller than or equal to a threshold value, the current scene and the previous scene are represented to be unchanged.

The video detection adopts a central processing unit CPU, an image processor GPU, a video processor VPU or a neural network processor NPU for calculation processing, preferably adopts NPU built-in support to carry out reasoning acceleration processing based on a neural network model, thereby further reducing power consumption and equipment heating.

C. Generating a focusing magnification: and calculating and generating a focusing multiplying factor and a focusing area according to the indication information and the second video target resolution. The second video target resolution is the video resolution requested by the user, or the design resolution preset by the system, or the resolution equal to the first video.

The method comprises the following steps:

b. merging the indication information of the target object; that is, when the target object includes two or more, the indication information of the target object needs to be merged; the merging process is to perform union operation on the grid position and the area coordinates of the target object.

c. Sampling the merged target object;

and sampling the grid position and the area coordinate of the target object, wherein the sampling treatment comprises format conversion treatment and resolution size conversion treatment of the grid position and the area coordinate of the video target object.

When the resolution size conversion processing is performed, the grid position and the area coordinates of the target object output and generated by the video detection unit are the size converted based on the resolution size of the video frame input by the video detection module, and are different from the second video resolution, so that the position and area coordinates need to be sampled to correspond to the target second video resolution.

d. Generating a focusing multiplying factor and a first focusing area, wherein the first focusing area is all the target object areas merged in the step c;

or after the focusing multiplying power and the first focusing area are generated, the corresponding resolution and edge protection fine processing is carried out on the first focusing area based on the selected focusing multiplying power, and a second focusing area is generated and output.

The fine processing of the resolution protection and the edge protection comprises deleting the pixel number pairs in the left-right direction or the up-down direction or inserting the pixel number pairs in the left-right direction or the up-down direction;

and calculating and generating a second target object area concerned by the user according to the target object indication information and a preset first target object list.

Wherein the insertion method is performed under the boundary protection condition of the input video frame; when the object is near the edge of the video frame, that is, the total number of pixels inserted at two sides is less than the inserting operation of the pixel pairs in the directions of the two sides, a plurality of corresponding pixel columns or pixel rows corresponding to the left and right and up and down need to be constructed at the edge of the focusing area where the inserting operation is not enough, so as to fill corresponding padding to match the proportion.

Specifically, the focus magnification may be generated by:

dividing the width or height of the resolution of the second video frame by the width or height of the resolution of the video in which the first focus area is located, respectively, to correspond to the output ratio R_fh、R_fv。

If R is_fh＝R_fvThe video width-height ratio of the resolution of the first focus area is consistent with that of the second video frame, and the focusing magnification is R_fhOr R_fv. Further, R may be based on the focusing magnification_fhOr R_fvAnd generating and outputting a second focusing area, wherein the second focusing area is equal to the first focusing area.

If R is_fh≠R_fvThat is, the resolution of the first focus area is not consistent with the resolution of the second video frame, and the focusing magnification is R_fh、R_fvThe larger of the two.

Further, a second focusing area at the focusing magnification may be generated by deleting a certain amount of pixels in the corresponding horizontal or vertical direction, i.e. left and right or up and down, respectively, based on the selected focusing magnification, such that the resolution of the second focusing area is consistent with the resolution of the target second video frame.

If R is_fh≠R_fvConsidering that the target object region is not clipped, the focusing magnification is R_fh、R_fvThe smaller of the two.

Further, a certain amount of pixels may be inserted in a horizontal or vertical direction corresponding to the selected focusing magnification, i.e. corresponding to left and right or up and down respectively, so as to generate a second focusing region based on the focusing magnification condition, and the resolution of the second focusing region generated based on the focusing magnification condition is consistent with the resolution of the target second video frame.

In addition, the focusing magnification can also be generated through a preset multi-stage magnification factor skip list. A multi-stage multiplying factor skip list is preset, and the multi-stage multiplying factor skip list comprises at least more than one focusing multiplying factor. The multi-level magnification factor jumplist comprises: 8-fold, 6-fold, 4-fold, 3-fold, 2-fold, 1.5-fold, 1.2-fold, and 1-fold magnification factors.

And then sequentially comparing the coordinates of the first focusing area with the resolution of each magnification factor in the multi-level magnification factor skip list corresponding to the target second video frame, and selecting one of the best matching resolutions to calculate and generate the corresponding magnification factor, namely the focusing magnification.

Further, based on the selected focusing magnification, the first focusing area is subjected to resolution preserving and edge preserving refinement processing, and a second focusing area under the focusing magnification is generated.

D. Generating a second video frame:

and calculating to generate a second video frame according to the focusing magnification, the focusing area and the first video frame data.

The method specifically comprises the following steps:

a. calculating and generating second video data of a corresponding focusing area in the first video frame data according to the indication of the first focusing area or the second focusing area;

b. sampling processing is carried out on the video according to the second video data and the focusing multiplying power to generate a second video frame;

c. and performing image enhancement processing such as image brightness, contrast and sharpness on the generated second video frame to solve and improve the problem of loss of detail texture caused by video up-sampling processing.

E. Rendering a second video frame: the second video frame is rendered using conventional video rendering techniques.

Claims

1. A video intelligent focusing system is characterized in that: the method comprises a first video frame data acquisition unit, a video detection unit, a focusing magnification generation unit, a second video frame generation unit and a second video frame rendering unit;

the render second video unit is to render a second video.

2. The system for video intelligent focusing according to claim 1, wherein: the target processing object comprises a first video frame acquired by video acquisition equipment or a first video frame which is received by a local terminal and issued by a server or sent by other terminals.

3. The system for video intelligent focusing according to claim 1, wherein: the video detection unit comprises a scene content variable quantity detection module and a scene content detection module, wherein the scene content variable quantity detection module is used for calculating scene change according to the input of at least one first video frame or processed video frame and the video content variable quantity, so that the calculation pressure of the scene content detection module is reduced; the scene content detection module is used for detecting the indication information of all target objects contained in the video content.

4. The system for video intelligent focusing according to claim 1, wherein: the video detection unit further comprises a video frame detection module and a video frame conversion module, wherein the video frame detection module is used for detecting the format of an input video frame, and the video frame conversion module is used for converting the format of the video frame into a video format required by the video detection unit.

5. The system for video intelligent focusing according to claim 1, wherein: the focusing magnification generating unit comprises an information merging processing unit and a sampling processing unit, wherein the information merging processing unit is used for merging and processing the indication information of at least two target objects; the sampling processing unit is used for carrying out format conversion and resolution size conversion processing on the grid position and the area coordinate of the target object.

6. A video intelligent focusing method is characterized in that: the method comprises the following steps:

E. and rendering the second video frame.

7. The method of claim 6, wherein: the step C specifically comprises the following steps:

d. a focusing magnification and a first focusing area are generated.

8. The method of claim 7, wherein: the step D specifically comprises the following steps:

9. The method of claim 7, wherein: and C, performing corresponding resolution and edge protection refinement processing on the first focusing area based on the selected focusing magnification to generate a second focusing area.

10. The method of claim 9, wherein: the step D specifically comprises the following steps: