CN118212137A

CN118212137A - Video processing method, device, equipment and storage medium

Info

Publication number: CN118212137A
Application number: CN202410158607.6A
Authority: CN
Inventors: 袁振威
Original assignee: Black Sesame Intelligent Technology Co ltd
Current assignee: Black Sesame Intelligent Technology Co ltd
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-06-18

Abstract

The application provides a video processing method, a video processing device, video processing equipment and a storage medium. The main technical scheme comprises the following steps: acquiring original pose information of a portrait contained in each frame of image of a video to form an original pose track of the portrait; filtering the original gesture track of the human figure to obtain a virtual gesture track of the human figure, wherein the virtual gesture track of the human figure comprises virtual gesture information of the human figure corresponding to each frame of image; performing image transformation on each frame of image of the video based on preset constraint conditions to obtain each frame of corrected image, wherein the constraint conditions comprise minimized target constraint, and the target constraint at least comprises a first constraint item which shows the difference between the pose information of the portrait contained in each frame of corrected image and the virtual pose information of the portrait corresponding to each frame of image in a time window of preset duration; and correcting the image based on each frame to obtain each frame result image so as to form the processed video. The application realizes the anti-shake processing based on the portrait and improves the stability of the portrait in the processed video.

Description

Video processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of multimedia processing technologies, and in particular, to a video processing method, apparatus, device, and storage medium.

Background

In daily production and life, when mobile devices such as a handheld device, a wearable device or a vehicle-mounted device shoot videos, picture shake is inevitably introduced, so that the shot video pictures are poor in appearance. A number of video anti-shake techniques have been proposed to improve video effects, such as optical anti-shake, electronic anti-shake, etc.

Some shooting modes mainly focus on a portrait, such as self-timer video, sports video and the like, and stability of the portrait needs to be ensured for the video. At present, the conventional electronic anti-shake technology is directly applied to a video containing a portrait, however, the conventional electronic anti-shake technology can achieve a good effect on the overall stability of an image, but the anti-shake effect of the portrait cannot be ensured, and even the portrait becomes more shake due to the stability of the background.

Disclosure of Invention

In view of the above, the present application provides a video processing method, apparatus, device and storage medium for performing anti-shake processing on a video, so as to improve the stability of a portrait in the video.

In a first aspect, there is provided a video processing method, the video including a plurality of frame images, the method comprising:

Acquiring original pose information of a portrait contained in each frame of image of the video to form an original pose track of the portrait;

Filtering the original gesture track of the portrait to obtain a virtual gesture track of the portrait, wherein the virtual gesture track of the portrait comprises the virtual gesture information of the portrait corresponding to each frame of image;

Performing image transformation on each frame of image of the video based on a preset constraint condition to obtain each frame of corrected image, wherein the constraint condition comprises a minimum target constraint, and the target constraint at least comprises a first constraint item which shows the difference between the pose information of the portrait contained in each frame of corrected image and the portrait virtual pose information corresponding to each frame of image in a time window of preset duration;

And correcting the image based on each frame to obtain each frame result image so as to form a processed video.

According to an implementation manner of the embodiment of the present application, the filtering the original gesture track of the portrait to obtain a virtual gesture track of the portrait includes:

respectively executing for each frame image in the video: weighting the original pose information of the portraits in the current frame image and the previous preset number of frame images to obtain the virtual pose information of the portraits of the current frame image;

and forming the portrait virtual gesture track by using portrait virtual gesture information of each frame of image in the video.

According to an implementation manner of the embodiment of the present application, the correcting the image based on each frame to obtain each frame result image includes:

And respectively intercepting images in the region of interest from the corrected images of each frame, and taking the intercepted images as result images of each frame, wherein the region of interest is a region with preset shape, size and position under an image coordinate system.

According to an implementation manner of the embodiment of the present application, the method further includes: acquiring original attitude information of each frame of image of the video;

The target constraint further comprises a second constraint term which represents the difference between the image posture information of each frame of corrected image and the original posture information of each frame of image in a time window with preset duration.

According to an implementation manner of the embodiment of the present application, the target constraint further includes a third constraint item and/or a fourth constraint item;

the third constraint item reflects differences between the image posture information of each frame of corrected image and the image posture information of the previous frame of corrected image in a time window with preset duration;

the fourth constraint item represents differences between gradients corresponding to the frames on the image virtual gesture track in a time window with preset duration and gradients corresponding to the previous frame on the image virtual gesture track, wherein the image virtual gesture track is formed by correcting image gesture information of the images of the frames.

According to an implementation manner of the embodiment of the present application, the constraint condition further includes: the frames of corrected images cover the region of interest.

According to an implementation manner of the embodiment of the present application, the performing image transformation on each frame of image of the video based on a preset constraint condition includes: respectively executing for each frame image of the video:

Based on the preset constraint condition, predicting the image posture information of the corresponding corrected image aiming at the current frame image;

determining a rotation matrix corresponding to the current frame by utilizing the image posture information of the corrected image and the original posture information of the current frame image;

and carrying out image transformation on the current frame image by utilizing the rotation matrix corresponding to the current frame to obtain a correction image corresponding to the current frame image.

According to an implementation manner of the embodiment of the present application, the time window of the preset duration in the constraint condition on which the current frame is based is: a time window of a preset duration centered on the current frame, or a time window of a preset duration centered on the current frame.

According to an implementation manner of the embodiment of the present application, the target constraint is obtained by weighting constraint items included in the target constraint.

In a second aspect, there is provided a video processing apparatus, the video including a plurality of frame images, the apparatus comprising:

the gesture acquisition unit is configured to acquire original gesture information of a portrait contained in each frame of image of the video so as to form an original gesture track of the portrait;

the track filtering unit is configured to filter the original gesture track of the portrait to obtain a virtual gesture track of the portrait, wherein the virtual gesture track of the portrait comprises virtual gesture information of the portrait corresponding to each frame of image;

The image transformation unit is configured to perform image transformation on each frame of image of the video based on a preset constraint condition to obtain each frame of corrected image, wherein the constraint condition comprises a minimum target constraint, the target constraint at least comprises a first constraint item, and the first constraint item represents the difference between the pose information of the portrait contained in each frame of corrected image and the portrait virtual pose information corresponding to each frame of image in a time window of a preset duration;

And an image post-processing unit configured to correct the image based on the frames, and obtain each frame result image to constitute a processed video.

In a third aspect, an electronic device is provided, the electronic device comprising:

one or more processors, and

A memory storing a program comprising instructions which, when executed by a processor, cause the processor to perform the method of the first or second aspect described above.

In a fourth aspect, there is provided a computer readable storage medium storing a program comprising instructions which, when executed by one or more processors of a computing device, cause the computing device to perform the method of the first or second aspects described above.

According to the specific embodiment provided by the application, the application discloses the following technical effects:

1) According to the method, the track of the human figure posture in the video is considered, constraint conditions based on the human figure are added into image transformation, so that a corrected image obtained after the image transformation is as close as possible to a human figure virtual posture obtained after the filtering of the original human figure posture, and the stability of the human figure in the processed video is improved.

2) The application can further introduce a second constraint item, so that the image posture information of the corrected image is consistent with the original posture information of each frame of image as much as possible, thereby ensuring the following performance of the image background while carrying out image anti-shake, and further improving the anti-shake effect of the video.

3) The application can further introduce a third constraint item and/or a fourth constraint item, wherein the third constraint item enables the image posture information of the corrected images of the adjacent frames to be as close as possible, and the fourth constraint item enables the gradients of the adjacent frames on the virtual posture track of the images to be as close as possible, so that the smoothness of the images of the adjacent frames is ensured while the anti-shake effect of the human images is carried out, and the anti-shake effect of the video is further improved.

4) According to the application, the target constraint is obtained by weighting each constraint item, so that the anti-shake intensity, background following performance and smoothness of the processed video are controlled by adjusting the weight parameters, and the optimal video effect is achieved.

Of course, it is not necessary for any one product to practice the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a system architecture to which embodiments of the present application are applicable;

Fig. 2 is a flowchart of a video processing method according to an embodiment of the present application;

FIG. 3 is a flow chart of one implementation of the present application provided for step 205 of FIG. 2;

FIG. 4 is a schematic diagram of clipping a corrected image according to an embodiment of the present application;

FIG. 5a is a graph illustrating an x-coordinate value of an experimental video according to an embodiment of the present application;

FIG. 5b is a graph illustrating the y-coordinate values of an experimental video according to an embodiment of the present application;

Fig. 6 is a schematic block diagram of a video processing apparatus according to an embodiment of the present application;

Fig. 7 is a schematic block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

The electronic anti-shake refers to image restoration and enhancement by means of algorithm post-processing to achieve the optical anti-shake effect. In the specific implementation of the algorithm, the stability effect is generally realized by dynamically cutting the edge of the video picture in combination with the equipment posture change detected by an IMU (Inertial Measurement Unit ) sensor such as a gyroscope in the equipment. However, the conventional electronic anti-shake technology focuses on the posture change of the whole image, and the portrait may become more shake due to the stability of the background.

In view of this, the present application adopts a new idea to solve the problem of video jitter including portrait. For ease of understanding, a brief description of a system architecture to which the present application is applicable will first be provided. As shown in fig. 1, the system architecture may include: a video acquisition device 101, a storage device 102 and video processing means 103.

The video capturing device 101 is used to implement a video capturing function, and may be, for example, a video camera, a video recorder, a camera, a video capturing card, and the like. The video acquired by the video acquisition device 101 includes a plurality of frames of images, and the acquired images may be stored in the storage device 102.

As one of the possible implementation manners, the video processing apparatus 103 may directly perform anti-shake processing on the video acquired by the video acquisition device 101, and store the video after the anti-shake processing in the storage device 102, that is, a real-time video processing manner is adopted, which is shown in fig. 1.

As another implementation manner, the video processing apparatus 103 may also perform anti-shake processing on the original video stored in the storage device 102 (i.e., the video acquired by the video acquisition device 101 and not subjected to the anti-shake processing), and then store the video after the anti-shake processing in the storage device 102 or other storage devices.

The specific manner in which the video processing apparatus 103 performs the anti-shake processing will be described in detail in the following embodiments.

The video capturing apparatus 101, the storage apparatus 102, and the video processing device 103 may be provided as separate apparatuses, or may be provided in the same electronic apparatus. The electronic device may be, for example, a smart phone, a digital camera, a video recorder, a personal computer, a notebook computer, a tablet computer, etc.

For example, the electronic device may capture video in real time through its own camera (i.e., video capture device), or store video that has been captured before, or may store video data acquired from other external devices via wired or wireless communication in the present electronic device. The electronic equipment can perform anti-shake processing on videos shot in real time or stored in the electronic equipment by adopting the method provided by the embodiment of the application. In this case, the video processing apparatus in the electronic device may be an application installed and running on the electronic device, or a functional unit such as a plug-in or a software development kit (Software Development Kit, SDK) in the application. In addition, the electronic device can be provided with the display device, so that the video after the anti-shake processing can be displayed for a user to view.

The video processing device 103 may also be disposed at a server, where the video capturing apparatus 101 is disposed at a user, and the user may upload the video captured by the video capturing apparatus 101 to the server, where the video processing device 103 at the server performs anti-shake processing by using the method provided by the embodiment of the present application, and returns the video after the anti-shake processing to the user. The server can be a single server or a server cluster consisting of a plurality of servers, and the server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS, virtual PRIVATE SERVER) service.

It should be understood that the number of video acquisition devices 101, storage devices 102, and video processing apparatus 103 in fig. 1 is merely illustrative. There may be any number of video capture devices 101, storage devices 102, and video processing means 103, as desired for implementation.

Fig. 2 is a flowchart of a video processing method according to an embodiment of the present application, which may be performed by the video processing apparatus in the system architecture shown in fig. 1. As shown in fig. 2, the method may include the steps of:

Step 201: and acquiring original pose information of the portrait contained in each frame of image of the video to form an original pose track of the portrait.

Step 203: and filtering the original gesture track of the human figure to obtain a virtual gesture track of the human figure, wherein the virtual gesture track of the human figure comprises virtual gesture information of the human figure corresponding to each frame of image.

Step 205: and carrying out image transformation on each frame of image of the video based on preset constraint conditions to obtain each frame of corrected image, wherein the constraint conditions comprise minimized target constraint, the target constraint at least comprises a first constraint item, and the first constraint item represents the difference between the pose information of the portrait contained in each frame of corrected image and the portrait virtual information corresponding to each frame of image in a time window of preset duration.

Step 207: and correcting the image based on each frame to obtain each frame result image so as to form the processed video.

From the above flow, the application considers the track of the portrait posture in the video, adds the constraint condition based on the portrait into the image transformation, so that the corrected image obtained after the image transformation is as close as possible to the portrait virtual posture obtained after the filtering of the portrait original posture, thereby improving the stability of the portrait in the processed video.

Each step in the above flow is described in detail below with reference to examples. It should be noted that the limitations of "first", "second", and the like in the embodiments of the present application are not limited in size, order, and number, and are merely used for distinguishing between them by name. For example, "first constraint item", "second constraint item", "third constraint item" and "fourth constraint item" are used only to distinguish the constraint items by name.

First, the above-mentioned step 201, i.e. "acquiring the original pose information of the portrait included in each frame image of the video to form the original pose track of the portrait" will be described in detail with reference to the embodiments.

The video acquired by the video acquisition device comprises multiple frames of images, and the embodiment of the application aims at the video comprising the human images, namely the video comprises the multiple frames of images comprising the human images. In the embodiment of the application, the human figure posture in each frame of image before the video is subjected to anti-shake processing is called as human figure original posture information.

Let the original pose information of the portrait in the i-th frame image in the video be represented as F _i { x, y }, and simply represented as F _i. Wherein, x and y can represent the coordinates of the center point of the face frame in the frame image, or the coordinates of the center point of the whole human frame, and the coordinates are expressed under an image coordinate system. The original pose information F _i { x, y } of the portrait contained in each frame of image in the video can form a track, namely the original pose track of the portrait.

In addition, original pose information of each frame image in the video can be obtained, for example, original pose information of an ith frame image in the video can be represented as P _i { α, β, γ }, which is simplified as P _i. Where α, β, and γ denote three attitude angles of the frame image, respectively, and specifically, may represent a pitch angle (pitch), a yaw angle (yaw), and a roll angle (row), respectively, in the attitude.

The raw pose information of the image may be obtained by an inertial sensor in the video acquisition device. Wherein the inertial sensors may include, but are not limited to, acceleration sensors, angular rate sensors, gyroscopes, IMUs (Inertial Measurement Unit, inertial measurement units), AHRS (Attitude AND HEADING REFERENCE SYSTEM, avionic reference system), etc.

The original pose information of the image is expressed under a world coordinate system, and the original pose information of the image needs to be converted into the expression in the image coordinate system, so that a unified coordinate system is adopted with the original pose information of the human body. How to convert the representation in the world coordinate system to the image coordinate system is a relatively mature technology at present, and is not described in detail here.

Step 203 in the flow shown in fig. 2, namely, "filtering the original gesture track of the portrait to obtain the virtual gesture track of the portrait" is described in detail below with reference to the embodiments.

Since the motion of the photographed person causes the portrait in the photographed video to shake, the shake is represented as a jump in the pose of the portrait. In this step, the original gesture track of the portrait is filtered, and in fact, the original gesture track of the portrait is smoothed. The track obtained after the smoothing processing is called as a 'portrait virtual gesture track', which can be considered as a target track, namely a track formed by portrait gestures (namely portrait virtual gesture information) which are expected to be achieved by the anti-shake processing, wherein the portrait gestures in the track are relatively stable and consider the movement trend of the original portrait.

The portrait virtual gesture track comprises portrait virtual gesture information corresponding to each frame of image. It is assumed that the portrait virtual pose information of the i-th frame image after the filtering process is represented as S _i { x, y }, which is abbreviated as S _i.

As one of the possible ways, gaussian filtering, which is a linear smoothing filtering, is used in this step, and is usually a weighting process. Specifically, the following filtering process may be performed for each frame image: and carrying out weighting processing (such as weighted summation) on the original pose information of the portrait in the current frame image and the previous n frames of images (i.e. a plurality of historical frames) to obtain the virtual pose information of the portrait of the current frame image. For example, the following formula may be employed:

Wherein F _k represents the original pose information of the portrait of the kth frame of image, w _k is the weight coefficient corresponding to the kth frame of image, which can be determined by a gaussian distribution function, and generally the closer to the current frame (i.e. the ith frame of image), the larger the weight coefficient corresponding to the image is, and vice versa, the smaller the weight coefficient is. n is a preset positive integer, namely the virtual pose information of the portrait of the ith frame image represented by the formula is obtained by carrying out weighted summation on the original pose information of the portrait of the ith frame image and the previous n frame images.

Through the filtering processing mode, the obtained portrait virtual attitude information of each frame of image can keep the motion of the original portrait, and the jitter in the motion is filtered.

In addition, the filtering performed in this step may be other filtering forms such as kalman filtering. The Kalman filtering (KALMAN FILTERING) is a method for processing a noisy input signal on the basis of an online state space representation to obtain a real signal.

The step 205 in the flowchart shown in fig. 2, that is, "based on the preset constraint, performs image transformation on each frame of image of the video to obtain each frame of corrected image" is described in detail in connection with the embodiment.

In the step, the human image virtual posture information corresponding to each frame image in the human image virtual posture track is actually used as prior information to construct the optimization problem of human image anti-shake. This step may be performed as shown in fig. 3, for each frame of image in the video, respectively:

step 301: based on preset constraint conditions, predicting the image posture information of the corresponding corrected image aiming at the current frame image.

The virtual posture information of the current frame image is actually solved based on preset constraint conditions, so that the virtual posture information of the current frame image is used as a transformation target when the current frame image is subjected to image transformation, and therefore the virtual posture information of the current frame image is the image posture information of the corrected image corresponding to the current frame. In the embodiment of the application, the virtual posture information of the ith frame image is represented as V _i { alpha, beta, gamma }, and is simplified as V _i.

Constraints involved in embodiments of the present application include minimizing target constraints and may further include correcting the image to cover the region of interest. These two constraints are described in detail below, respectively.

The target constraint in the embodiment of the application at least comprises a first constraint item, and can further comprise at least one of a second constraint item, a third constraint item and a fourth constraint item.

The first constraint item reflects the difference between the gesture information of the portrait contained in each frame of corrected image and the virtual gesture information of the portrait corresponding to each frame of image in a time window with preset duration. As one of the realizable modes, the first constraint item may be determined by a sum or a mean value of differences between pose information of the portrait contained in each frame of corrected image and virtual pose information of the portrait corresponding to each frame of image within a time window of a preset duration.

For example, the first constraint corresponding to the i-th frame image in the video is denoted as E _f(V_i), E _f(V_i) may be determined using the following formula:

Wherein Li is the time window in which the ith frame image is located, j represents each frame in the time window, ₂ represents the 2-norm, Representing the square of the 2 norms. Proj (F _j,P_j,V_j) represents homography of F _j using P _j and V _j. Therefore, the meaning of this first constraint is the sum of differences between the portrait pose information obtained after each frame of image is transformed in Li and the portrait virtual pose information S _j, and the objective is to minimize this first constraint. The first constraint item is used for realizing a portrait anti-shake effect.

As one of the possible ways, the time window of the preset duration may be the duration of the whole video, and then the constraint condition is to minimize the target constraint in the whole video duration.

As another more preferred embodiment, the time window of the preset duration referred to in the embodiment of the present application may be a time window of less than the duration of the video, for example, a time window of 1 second duration is used. The time window may take the form of a sliding window, and for the ith frame of image, li may be a time window of a preset duration centered on the ith frame of image, which corresponds to minimizing the target constraint within several frames before and after the ith frame of image. Or Li may be a time window of a preset duration ending in the ith frame image (e.g. window start or window end), which corresponds to minimizing the target constraint within several frames after the ith frame image or within several frames before the ith frame image.

The second constraint term represents the difference between the image posture information of each frame of corrected image and the original posture information of each frame of image in a time window of a preset duration. As one of the realizations, the second constraint term may be determined by a sum or a mean value of differences between the image pose information of each frame correction image and the original pose information of each frame image within a time window of a preset duration, or the like.

For example, the second constraint corresponding to the i-th frame image in the video is represented as E _o(V_i), E _o(V_i) may be determined using the following formula:

The formula is effectively a sum of differences between the image virtual pose information of each frame within Li (i.e., the image pose information of the corrected image) and the original pose information. The second constraint is mainly used to achieve background following, i.e. the background is as consistent as possible with the original pose.

The third constraint item represents differences between the image posture information of each frame of corrected image and the image posture information of the previous frame of corrected image in a time window of a preset duration.

For example, the third constraint corresponding to the i-th frame image in the video is denoted as E _c0(V_i), E _c0(V_i) may be determined using the following formula:

It can be seen that this formula is actually a constraint that the difference between the image virtual pose information of each frame and the image virtual pose information of the previous frame within Li is such that the image virtual poses of each frame are as close as possible, and therefore, the third constraint is a constraint to control the smoothness of the image virtual pose.

The fourth constraint item reflects differences between gradients corresponding to the frames on the image virtual gesture track in a time window with preset duration and gradients corresponding to the previous frame on the image virtual gesture track, wherein the image virtual gesture track is formed by correcting image gesture information of the images of the frames.

The trajectory constituted by the virtual pose information of each frame image (i.e., the image pose information of each frame correction image) is referred to as an image virtual pose trajectory, and the purpose of the fourth constraint item is to reduce the distance between gradients of the virtual pose information of adjacent frame images as much as possible.

For example, the fourth constraint corresponding to the i-th frame image in the video is denoted as E _c1(V_i), E _c1(V_i) may be determined using the following formula:

It can be seen that this formula is actually a constraint that the difference between the image virtual pose trajectory gradient of each frame and the image virtual pose trajectory gradient of the previous frame within Li is such that the image virtual poses of each frame are as close as possible, and therefore, the third constraint is a constraint to control the smoothness of the image virtual pose.

As one of the realizations, the target constraint may be obtained by weighting constraint items included in the target constraint. Wherein the weighting process may be, for example, weighted summation, weighted averaging, or the like. Taking the example that the target constraint includes the first constraint term E _f(V_i), the second constraint term E _o(V_i), the third constraint term E _c0(V_i), and the fourth constraint term E _c1(V_i) described above, the target constraint can be expressed as:

Wherein w _f、w_o、w_c0 and w _c1 are weight parameters of each constraint item, and the portrait anti-shake intensity, background following property and smoothness of the final video can be controlled by adjusting the weight parameters. For example, the weight parameters may be set to 1, 20, 2, and 2, respectively. The weight parameters can be empirical values or experimental values, and can also be provided for a user to set.

The constraint conditions may further include, in addition to minimizing the target constraint,: each frame of corrected image covers a region of interest. Wherein the region of interest is typically a pattern of preset shapes, sizes and locations. For example, a rectangle of height H and width W is preset. The idea of electronic anti-shake is to sacrifice the dynamic clipping of the picture edges to achieve a stabilizing effect, and in the subsequent steps, image transformation is involved, and the corrected image obtained after image transformation is actually rotated with the original image. As shown in fig. 4, the large rectangular frame in the figure is the original image, and the corrected image obtained after the transformation is the dotted frame. The corrected image is cropped according to the region of interest (i.e., the small rectangular border in the figure) to obtain the final processed image. Therefore, to prevent inclusion of invalid regions (i.e., regions without content) in the cropped image, further constraints may be placed such that the solved image virtual pose ensures that the corrected image covers the region of interest.

Based on the constraint conditions, virtual posture information of the current frame image can be solved, namely, the predicted image posture information V _i of the corrected image corresponding to the current frame.

Step 303: and determining a rotation matrix corresponding to the current frame by utilizing the image posture information of the corrected image and the original posture information of the current frame image.

The transformation of the object pose is generally represented by the rotation of the object around the x, y and z axes, and after knowing the original pose information and the target pose information (i.e. the image pose information of the corrected image obtained by solving), the rotation matrix can be determined. The specific method for determining the rotation matrix is known in the art and will not be described in detail herein.

Step 305: and carrying out image transformation on the current frame image by utilizing the rotation matrix corresponding to the current frame to obtain a correction image corresponding to the current frame image.

The image transformation in this step is actually transforming the current frame image according to the rotation matrix under the image coordinate system so that the original pose of the current frame image is converted into the predicted virtual pose.

Assuming that the rotation matrix corresponding to the current frame is denoted by M _R, the rotation matrix M _R and the camera internal reference matrix M _K may be used to obtain a homography matrix M _T of the current frame image:

The camera internal reference matrix M _K can be obtained by calibrating a camera of a video acquisition device for shooting the video.

Then, coordinate transformation is performed on the current frame image by using the homography transformation matrix M _T, that is, the original pixel coordinates (u, v) are transformed into pixel coordinates (u ', v'), which can be represented by the following formula:

However, if the coordinate transformation is performed for each pixel, the calculation amount is large. Therefore, it is possible to transform a part of pixel coordinates in the current frame image, and then interpolate the remaining part by using the transformed part of pixel coordinates, to obtain a corrected image. Other ways of improving computing performance may also be employed, and no other enumeration is made herein.

Step 207 in the flowchart shown in fig. 2, namely, "correcting the image based on each frame, obtaining the resultant image of each frame to constitute the processed video", will be described in detail below with reference to the embodiment.

After each frame of corrected image is obtained, each frame of corrected image can be cut out, and the image in the region of interest is intercepted, so that each frame of result image is obtained. The region of interest (Region of Interest, ROI) is a preset region, typically a region of preset shape, size and position. For example, a rectangle of a preset height and width may be employed and set at a predetermined position in the image coordinate system. The region of interest may be set by default by the electronic device in which the video processing apparatus is located, or may be set by a user. For example, as shown in fig. 4, the region of interest may be disposed at a center position of the region where the original image is located, and the size may be set to 70%, 80% or other proportion of the original image size. In addition to the rectangular shape illustrated in fig. 4, the region of interest may be circular, elliptical, or even irregularly shaped.

And cutting each frame of corrected image based on the region of interest to obtain each frame of resultant image with the same shape, size and position, thereby forming the final processed video. In addition to clipping based on the region of interest, the corrected images of each frame may be processed to obtain a resulting image, for example, by compensating for edge pixels of the corrected image.

In order to facilitate distinguishing and understanding of the method provided by the application and the method provided by the traditional electronic anti-shake technology, experimental comparison is performed by taking a specific video as an example. The horizontal axis in fig. 5a and 5b indicates the number of frames corresponding to each frame image, for example, the 1 st frame image, the 16 th frame image, the 31 st frame image, and so on; the vertical axis represents pixel values. The blue track represents the original gesture track of the human image, the meaning of each point on the blue track on the vertical axis in fig. 5a is the x coordinate value (expressed by pixel value) in the coordinates (x, y) of the central point of the human face frame in each frame image, and the meaning of each point on the blue track on the vertical axis in fig. 5b is the y coordinate value (expressed by pixel value) in the coordinates (x, y) of the central point of the human face frame in each frame image. The orange track represents a virtual gesture track of the human image obtained by filtering the original gesture track of the human image, the meaning of each point on the orange track on the vertical axis in fig. 5a is an x coordinate value in the coordinates (x, y) of the central point of the human face frame in each frame of image after filtering, and the meaning of each point on the blue track on the vertical axis in fig. 5b is a y coordinate value in the coordinates (x, y) of the central point of the human face frame in each frame of image after filtering. The yellow trace represents the trace of the center of each frame image, each point on the yellow trace in FIG. 5a is the x-coordinate value of the center of each frame image, and each point on the yellow trace in FIG. 5b is the y-coordinate value of the center of each frame image.

The conventional electronic anti-shake technology provides a way to essentially restrict the original gesture track (i.e. blue track) of the portrait to the center (i.e. yellow track) of the image when the image is transformed. The method provided by the application essentially constrains the original gesture track (namely, blue track) of the portrait to the virtual gesture track (namely, orange track) of the portrait obtained after filtering when the image is transformed. In fig. 5a and 5b, some frames are randomly truncated, i.e. the parts marked by red boxes in the figures, it can be seen that: constraining the blue track to the yellow track compared with constraining the orange track requires stretching the image to a greater extent, and the image has more lost background and poor background following performance. The scheme of the application is to restrict the blue track to the orange track, filter the shake and keep the original motion track while ensuring a certain stability, and the degree of stretching the image is small, so that the image background can be kept greatly, and the background following property is enhanced greatly.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Fig. 6 shows a schematic block diagram of a video processing apparatus according to one embodiment, as shown in fig. 6, the apparatus 600 comprising: a pose acquisition unit 601, a trajectory filtering unit 602, an image conversion unit 603, and an image post-processing unit 604. Wherein the main functions of each constituent unit are as follows:

The pose acquisition unit 601 is configured to acquire original pose information of a portrait included in each frame image of a video to constitute a portrait original pose trajectory.

The track filtering unit 602 is configured to filter the original gesture track of the portrait to obtain a virtual gesture track of the portrait, where the virtual gesture track of the portrait includes virtual gesture information of the portrait corresponding to each frame of image.

The image transforming unit 603 is configured to perform image transformation on each frame of image of the video based on a preset constraint condition, so as to obtain each frame of corrected image, where the constraint condition includes a minimum target constraint, and the target constraint includes at least a first constraint term, and the first constraint term represents a difference between pose information of a portrait included in each frame of corrected image and pose information of a portrait virtual corresponding to each frame of image in a time window of a preset duration.

An image post-processing unit 604 configured to correct the image based on each frame, and obtain each frame result image to constitute a processed video.

As one of the realizations, the image post-processing unit 604 may be specifically configured to extract, from each frame of the corrected image, an image in a region of interest, where the region of interest is a region of a preset shape, size, and position in the image coordinate system, respectively, so as to obtain each frame of the resultant image.

In an embodiment of the present application, the filtering performed by the trajectory filtering unit 602 may include, but is not limited to, gaussian filtering, kalman filtering, and the like.

Still further, the pose acquisition unit 601 may be further configured to acquire original pose information of each frame image of the video. Correspondingly, the target constraint can further comprise a second constraint term, wherein the second constraint term represents the difference between the image posture information of each frame of corrected image and the original posture information of each frame of image in a time window of a preset duration.

Still further, the target constraint may further include a third constraint term and/or a fourth constraint term.

Still further, the constraint may further include: each frame of corrected image covers a region of interest.

As one of the realizations, the image transforming unit 603 may be specifically configured to perform, for each frame image of the video, respectively:

Based on preset constraint conditions, predicting image posture information of a corresponding corrected image aiming at the current frame image; determining a rotation matrix corresponding to the current frame by utilizing the image posture information of the corrected image and the original posture information of the current frame image; and carrying out image transformation on the current frame image by utilizing the rotation matrix corresponding to the current frame to obtain a correction image corresponding to the current frame image.

As one of the realizable modes, the time window of the preset duration in the constraint condition based on the current frame is as follows: a time window of a preset duration centered on the current frame, or a time window of a preset duration centered on the current frame.

As one of the realizations, the target constraint may be obtained by weighting constraint items included in the target constraint. Wherein the weighting process may include, for example, weighted summation, weighted averaging, etc.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part.

The method and the device provided by the embodiment of the application can be suitable for various application scenes, and an example is listed here.

For example, in a cell phone forward-facing scene, i.e., when video is taken with a cell phone forward-facing camera, it is typically used for self-photographing. The video processing device provided by the embodiment of the application can be arranged in a video processing application program, for example, in the form of a plug-in unit or a functional unit. If a user triggers a video processing application program and calls a forward camera to shoot video, the video shot by the forward camera is processed by adopting the method provided by the embodiment of the application, so that the portrait anti-shake function is realized.

As one of the realizable modes, the method provided by the embodiment of the application can be adopted to process the video shot by the forward camera in real time, so as to obtain the processed video and display the processed video to the user through the display screen.

As one of the realizable modes, the method provided by the embodiment of the application can be adopted to process the video shot by the forward camera in real time, so as to obtain the processed video and store the processed video in the storage space of the mobile phone. And responding to a viewing request of the user for the video, and displaying the processed video to the user through a display screen.

As another implementation manner, the video shot by the forward camera is stored in the storage space of the mobile phone. If the user triggers anti-shake processing on the video through the video processing application program, for example, a control provided by the user isomorphic video processing application program selects the video from a storage space, and triggers an anti-shake processing function on the video through the control, the method provided by the embodiment of the application can be adopted to process the video, so that the processed video is displayed to the user through a display screen.

In addition, embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, the program comprising instructions which, when executed by one or more processors of a computing device, perform the steps of the method according to any of the preceding method embodiments.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

The electronic device provided in the embodiments of the present application may include one or more processors 701 as shown in fig. 7, and further include a memory 702 storing one or more programs, which are executed by the one or more processors 701 to implement the method flows and/or the apparatus operations shown in the above embodiments of the present application.

The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor 701 may process instructions for execution within an electronic device. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories.

Processor 701 may include one or more single-core processors or multi-core processors. Processor 701 may include any combination of general-purpose processors or special-purpose processors (e.g., image processors, application processor baseband processors, etc.).

The memory 702 is a computer readable storage medium provided by the present application, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and units, such as program instructions/units corresponding to the video processing method shown in fig. 2 in the embodiment of the present application. The processor 701 executes a video processing method such as that shown in fig. 2 in the above-described method embodiment by running non-transitory software programs, instructions, and units stored in the memory 702.

The control device may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.

The input device 703 may receive input digital or character information and generate signal inputs related to user settings and function control of the video processing device. The output device 704 may output the result generated by the video processing method shown in fig. 2 to other devices connected to the electronic apparatus.

The above-described programs (also referred to as software, software applications, or code) include machine instructions of a programmable processor, and these computing programs may be implemented in an object-oriented programming language, assembly, or machine language.

With the development of time and technology, the media has a wider meaning, and the propagation path of the computer program is not limited to a tangible medium any more, and can be directly downloaded from a network, etc. Any combination of one or more computer readable storage media may be employed. The computer readable storage medium can be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer diskette, hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; also, it is within the scope of the present application to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the application.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is to be construed as including any modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims

1. A method of video processing, the video comprising a plurality of frames of images, the method comprising:

2. The method of claim 1, wherein filtering the portrait original pose trajectory to obtain a portrait virtual pose trajectory comprises:

3. The method of claim 1, wherein said correcting the image based on the frames to obtain a resultant image of the frames comprises:

4. The method according to claim 1, wherein the method further comprises: acquiring original attitude information of each frame of image of the video;

5. The method according to claim 1, wherein the target constraint further comprises a third constraint term and/or a fourth constraint term;

6. A method according to claim 3, wherein the constraints further comprise: the frames of corrected images cover the region of interest.

7. The method according to any one of claims 1 to 6, wherein said image transforming each frame image of the video based on a preset constraint comprises: respectively executing for each frame image of the video:

8. The method of claim 7, wherein the time window of the preset duration in the constraint on which the current frame is based is: a time window of a preset duration centered on the current frame, or a time window of a preset duration centered on the current frame.

9. The method according to any one of claims 1 to 6, wherein the target constraint is obtained by weighting constraint items included in the target constraint.

10. A video processing apparatus, the video comprising a plurality of frames of images, the apparatus comprising:

11. An electronic device, the electronic device comprising:

one or more processors, and

A memory storing a program comprising instructions that when executed by the processor cause the processor to perform the method of any one of claims 1 to 9.

12. A computer-readable storage medium storing a program comprising instructions that, when executed by one or more processors of a computing device, cause the computing device to perform the method of any of claims 1-9.