CN112788236B

CN112788236B - Video frame processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN112788236B
Application number: CN202011643661.8A
Authority: CN
Inventors: 李仕康
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-08-09
Anticipated expiration: 2040-12-31
Also published as: CN112788236A

Abstract

The application discloses a video frame processing method and device, electronic equipment and a readable storage medium, and belongs to the field of image processing. The method comprises the following steps: acquiring an initial video frame and a current video frame; acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame; and inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame. The method and the device solve the problem that in the prior art, due to the fact that electronic anti-shaking in the camera shooting process needs artificial excessive intervention and adjustment, the image enhancement effect is poor.

Description

Video frame processing method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the field of image processing, and particularly relates to a video frame processing method and device, electronic equipment and a readable storage medium.

Background

An EIS (electronic Image Stabilization) anti-shake technology of an existing mobile phone generally needs to do a large amount of preliminary calibration work, determine parameters such as a focal length of a camera, drift and delay of a gyroscope and the like, and calculate a change matrix of pixel points of a standard frame to a current frame by matching with data output by the gyroscope. That is, the EIS converts the left of the current pixel position multiplied by the change matrix into the pixel position of the standard frame, and intercepts certain image edges to complete the final anti-shake effect.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:

cropping of the Image is necessary with current EIS techniques, which results in a smaller Image perspective and less than ideal pan compensation without the aid of ois (optical Image stabilization). Meanwhile, the exposure time of the Rolling Shutter Door is indefinite, so that the anti-shaking effect is negatively influenced, the whole EIS process is complicated, manual intervention and adjustment are needed in too many links, and the maintenance, problem troubleshooting and effect enhancement are slightly fussy and difficult.

In view of the above problems, no effective solution has been proposed.

Content of application

An object of the embodiments of the present application is to provide a video frame processing method, an apparatus, an electronic device, and a readable storage medium, which can solve the problem in the prior art that an image enhancement effect is poor due to excessive human intervention and adjustment required for electronic anti-shake in a camera shooting process.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for processing a video frame, where the method includes: acquiring an initial video frame and a current video frame; acquiring first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame; inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame.

In a second aspect, an embodiment of the present application provides a video frame processing apparatus, including: the first image acquisition unit is used for acquiring an initial video frame and a current video frame; a first obtaining unit, configured to obtain first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame; the first processing unit is configured to input the initial video frame, the current video frame, the first multimode data, and the second multimode data into an image processing model obtained through pre-training, so as to obtain a target video frame corresponding to the current video frame.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, an initial video frame and a current video frame are obtained; acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame; and inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame. The image processing of the video frame is realized through the multimode data, various factors in the image shooting process are integrated, and the effective enhancement of the video frame is realized. And then solved among the prior art because the electron anti-shake in the camera shooting process needs artificial too much intervention and adjustment, and lead to the poor problem of image enhancement effect.

The foregoing description is only an overview of the technical solutions of the present application, and the present invention can be implemented as described in the specification in order to make the technical means of the present application more clearly understood, and the following detailed description of the present invention is provided in order to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Fig. 1 is a schematic flow chart of an alternative video frame processing method in an embodiment of the present application;

FIG. 2a is a schematic diagram of the first four channels of an alternative component matrix in an embodiment of the present application;

FIG. 2b is a diagram of the last four channels of an alternative component matrix in an embodiment of the present application;

FIG. 2c is a schematic diagram of an alternative component matrix in an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative image processing model according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an alternative encoding module in an embodiment of the present application;

FIG. 5 is a diagram illustrating an alternative biasing process for component matrices according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an alternative structure of a code convolution layer in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an alternative decoding module in an embodiment of the present application;

FIG. 8 is a schematic diagram of a structure of yet another alternative image processing model in an embodiment of the present application;

FIG. 9 is a schematic diagram of an image processing model training scenario in an embodiment of the present application;

fig. 10 is a schematic structural diagram of an alternative video frame processing apparatus in an embodiment of the present application;

fig. 11 is a schematic structural diagram of an alternative electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The image frame processing method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

An embodiment of the present application provides a video frame processing method, and with reference to fig. 1, a flow diagram of the video frame processing method of the present application is shown, where the method specifically includes the following steps:

s102, acquiring an initial video frame and a current video frame;

specifically, a video frame is acquired through a camera or other image acquisition components of the electronic terminal, and in this embodiment, the video frame is acquired in a manner including, but not limited to, video shooting and photo shooting. In one example, when a video is shot through a mobile phone, each video frame is acquired or video frames are collected according to a preset interval so as to obtain an initial video frame and a current video frame; in another example, in the case of taking a picture through a mobile phone, a preview video of a shooting target is displayed on a shooting preview interface of a camera of the mobile phone, and a video frame of the preview video is acquired to obtain an initial video frame and a current video frame.

S104, acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame;

specifically, the multi-mode data includes, but is not limited to, image data of a video frame, line exposure time information of a rolling shutter, gyroscope information, and shooting time node, etc. multi-dimensional data. The method is suitable for different exposure time of each row and reduces the influence of the dynamic blurring motion blur, and the uncertain variability of the pixel space of the video frame along with the image jitter is fully considered.

In a practical application scenario, a processor of the electronic terminal acquires data of each sensor, for example, acquires image data and line exposure time data in an optical sensor in a camera, and orientation offset data of a gyroscope, or acquires time data corresponding to each video frame. In another example, the multimode data may be associated with image data corresponding to a video frame, and the processor of the electronic terminal may directly acquire the multimode data corresponding to the video frame while acquiring the image data.

And S106, inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame.

Specifically, in this embodiment, the image data of the initial video frame and the current video frame and the corresponding multi-mode data are input into the image processing model obtained by pre-training, and the image processing model performs image enhancement on the current video frame according to the image data and the multi-mode data corresponding to the initial video frame and the current video frame, so as to obtain the target video frame.

In the implementation of the present application, an image processing model needs to be trained first.

In some embodiments of the present application, a training sample set is constructed according to a video frame generated in a photographing or shooting process of a camera, where each training sample in the training sample set includes: the video processing method comprises the steps of obtaining an initial video frame, a current video frame, first multimode data corresponding to the initial video frame, second multimode data corresponding to the current video frame and a target video frame. In one example, the training samples include image pixel information, line exposure time, gyroscope bias data, and image acquisition time for the initial video frame and the current video frame, respectively.

Optionally, in this embodiment, the acquiring the first multimode data corresponding to the initial video frame and the second multimode data corresponding to the current video frame includes, but is not limited to: acquiring image data, line exposure time, image offset data and acquisition time which respectively correspond to an initial video frame and a current video frame; and determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame according to the image data, the line exposure time, the image offset data and the acquisition time.

In a specific application scenario, for example, in a mobile phone, image pixel information and a row exposure time of a rolling shutter corresponding to an initial video frame and a current video frame are acquired from a sensor, and gyroscope data corresponding to the current video frame, that is, an angular velocity and an acquisition time in XYZ axis directions in a spatial coordinate, are acquired from a gyroscope.

By the embodiment, the first component matrix corresponding to the initial video frame and the second component matrix corresponding to the current video frame are determined according to the image data, the line exposure time, the image offset data and the acquisition time which respectively correspond to the initial video frame and the current video frame. The target video frame is processed by the component matrix obtained by fusing the multi-mode data, so that the processing of the video frame has anti-shake effects on all layers such as rotation, translation and the like.

Optionally, in this embodiment, the first component matrix corresponding to the initial video frame and the second component matrix corresponding to the current video frame are determined according to the image data, the line exposure time, the image offset data, and the acquisition time, which includes but is not limited to: performing BGR conversion on the image data conversion to obtain an image matrix corresponding to the image data; determining a first matrix according to the image matrix and the row exposure time; determining a second matrix according to the image offset data and the acquisition time; and combining the first matrix and the second matrix to obtain a first component matrix and a second component matrix.

In one example, the first four channel schematic of the component matrix shown in FIG. 2a, converts image pixel information into the BGR spatial domain and takes the row exposure information as the fourth channel of the BGR image as the matrix BGR-E; as shown in the diagram of the last four channels of the component matrix in fig. 2b, the gyroscope XYZ axis data and the timestamp data of the acquisition time are developed into the last four channels of the component matrix which is the same as the BGR-E, and the timestamp data is the timestamp of the current frame minus the timestamp of the standard frame; finally, the two component matrices are combined into a component matrix as shown in fig. 2 c.

By the embodiment, the first matrix determined according to the image matrix and the row exposure time and the second matrix determined according to the image offset data and the acquisition time are combined to obtain the corresponding component matrix, so that the fusion of multimode data is realized, and the whole network of the video frame processing model can be made robust.

Firstly, video frames and corresponding multimode data stored in a preset database are obtained. In general, any two video frames can be selected from a plurality of video frames generated in a camera motion state, and image data of any two video frames and corresponding multi-mode data are acquired to generate a training sample. Each training sample comprises information such as image pixel information, line exposure time, gyroscope bias data, image acquisition time and the like corresponding to an initial video frame and a current video frame respectively. In some embodiments of the present application, each training sample is represented as a binary set comprising < image pixel information corresponding to the initial video frame, line exposure time, gyroscope bias data, and image acquisition time; image pixel information, line exposure time, gyroscope bias data and image acquisition time > corresponding to the current video frame, and further, data processing is performed on the training sample to obtain a data type which is a binary group of a component matrix, < a first component matrix corresponding to the initial video frame, and a second component matrix corresponding to the current video frame >.

Next, the image processing model is trained based on the constructed training sample set. And taking a first component matrix corresponding to an initial video frame and a second component matrix corresponding to a current video frame in the training sample as model inputs, and taking the reference image as a model target to train the image processing model.

Optionally, in this embodiment, the image processing model includes an encoding module and a decoding module, where the initial video frame, the current video frame, the first multimode data, and the second multimode data are input into the image processing model obtained by pre-training to obtain a target video frame corresponding to the current video frame, including but not limited to: inputting the first component matrix and the second component matrix into an encoding module of an image processing model to obtain a first target component matrix; and inputting the first target component matrix into a decoding module of the image processing model to obtain a target video frame.

Specifically, as shown in the schematic structural diagram of the image processing model in fig. 3, the image processing model 30 in this embodiment includes an encoder 310 and a decoder 320 that are sequentially arranged, and an output end of the encoder 310 is connected to an input end of the decoder 320. The first component matrix a300 and the second component matrix B300 are input to the encoder 310 to obtain a first target component matrix C300, the target component matrix is input to the decoder 320 to obtain a component matrix corresponding to the target video frame, and then the component matrix corresponding to the target video frame is subjected to image conversion to obtain a target video frame D300.

With the above embodiment, the first component matrix and the second component matrix are input to the first target component matrix obtained in the encoding module; the target video frame is input into the coding module to obtain the target video frame, so that the target video frame can be combined with the initial video frame and the correlation among various modals of the current video frame to improve the enhancement effect on the target video frame.

Optionally, in this embodiment, the encoding module includes at least one encoded convolutional layer and a first preset convolutional layer, and the at least one encoded convolutional layer is connected in series with the first preset convolutional layer, where the first component matrix and the second component matrix are input into the encoding module of the image processing model to obtain a first target component matrix, which includes but is not limited to: inputting the first component matrix and the second component matrix into at least one coding convolutional layer to obtain a first output corresponding to the first component matrix and a second output corresponding to the second component matrix; the first output and the second output are input to a first preset convolution layer to obtain a first target component matrix.

In an example, as shown in the schematic diagram of the encoding module shown in fig. 4, the encoding module 40 in this embodiment includes 2 encoding convolutional layers 410 and 1 first preset convolutional layer 420, where the encoding convolutional layer 410-1, the encoding convolutional layer 410-2 and the first preset convolutional layer 420 are connected in series, an output end of the encoding convolutional layer 410-1 is connected to an input end of the encoding convolutional layer 410-2, and an output end of the encoding convolutional layer 410-2 is connected to an input end of the first preset convolutional layer 420. The first target component matrix a40 and the second component matrix a42 are input into the encoded convolutional layer 410-1 and the encoded convolutional layer 410-2 to obtain a first output B40 and a second output B42, and the first output B40 and the second output B42 are input into the first preset convolutional layer 420 to obtain a first target component matrix C40.

Optionally, in this embodiment, the encoded convolutional layers include a first bias layer corresponding to the first component matrix, a second bias layer corresponding to the second component matrix, and a shared convolutional layer, where the first component matrix and the second component matrix are input to at least one encoded convolutional layer, including but not limited to: inputting the first component matrix to a first bias layer to obtain a third component matrix; inputting the second component matrix to a second bias layer to obtain a fourth component matrix; the third component matrix and the fourth component matrix are input to the shared convolutional layer respectively to obtain the first output and the second output.

Specifically, in this embodiment, the encoded convolutional layer includes three channels, which are convolutional kernel weight, x component offset, and y component offset, respectively. The first component matrix and the second component matrix are respectively processed based on different x component offsets, y component offsets and the same convolution kernel weight so as to adapt to different jitter amplitudes and dynamic blurring motion blur influence ranges. As shown in fig. 5, component matrix indexes corresponding to an initial video frame and a current video frame are offset by different X component offsets and y component offsets, and then the inputs of the initial video frame and the current video frame are convolved based on the same shared convolution kernel to obtain a first output corresponding to the initial video frame and a second output corresponding to the current video frame.

Further, the component matrix index is biased by the following formula:

wherein, in the above formula (1), S is an index value in the standard convolution range, a _ij Is an input value, k, corresponding to an index _mn And (4) convolution kernel parameters.

In one example, as shown in fig. 6, encoded convolutional layer 60 includes a first bias layer 610 corresponding to a first component matrix, a second bias layer 620 corresponding to a second component matrix, and a shared convolutional layer 630, the first component matrix a60 is input to the first bias layer 610 to obtain a third component matrix B60, and the second component matrix a62 is input to the second bias layer 620 to obtain a fourth component matrix B62. The third component matrix B60 and the fourth component matrix B62 are input to the shared convolutional layer 630 to obtain a first output C60 corresponding to the first component matrix a60 and a second output C62 corresponding to the second component matrix a 62.

Optionally, in this embodiment, the first output and the second output are input to the first predetermined convolutional layer to obtain the first target component matrix, which includes but is not limited to: inputting the first output and the second output to the first predetermined convolutional layer respectively; and performing convolution operation on the first output and the second output through a first preset convolution layer respectively to obtain a first target component matrix.

Specifically, in this embodiment, the features of the initial video frame and the current video frame after the x and y offset processing are convolved by the first preset convolution layer to obtain abstract semantic information of the initial video frame and the current video frame, where the speech information includes the correlation between the multi-mode data and the video frame image.

In an example, still taking the encoding module shown in fig. 4 as an example for illustration, the convolutional layer 410-1 and the convolutional layer 410-2 are encoded to obtain a first output B40 and a second output B42, and the first output B40 and the second output B42 are input to the first predetermined convolutional layer 420 to obtain a first target component matrix.

Optionally, in this embodiment, the decoding module includes at least one decoding convolutional layer and a second preset convolutional layer connected in series, where the first target component matrix is input into the decoding module of the image processing model to obtain the target video frame, including but not limited to: inputting the first target component matrix into at least one decoding convolution layer to obtain a second target component matrix; and inputting the second target component matrix into a second preset convolution layer to obtain a target video frame.

Specifically, in this embodiment, by obtaining high-level abstract semantic information output by a coding module in an image processing model, the currently input coding information is correspondingly fused on the basis of the semantic information for decoding, and finally, a target video frame is output.

In one example, as shown in FIG. 7, decoding module 70 comprises decoded convolutional layer 710-1, decoded convolutional layer 710-2 and a second predetermined convolutional layer 720, wherein the output of decoded convolutional layer 710-1 is connected to the input of decoded convolutional layer 710-2, and the output of decoded convolutional layer 710-2 is connected to the input of a second predetermined convolutional layer 720. The first target component matrix is input into the decoded convolutional layer 710-1, and the target video frame is output in the second preset convolutional layer 720.

Optionally, in this embodiment, the number of decoding convolutional layers in the decoding module is the same as the number of encoding convolutional layers in the encoding module, and the decoding convolutional layers correspond to the levels of the encoding convolutional layers one to one, where the first target component matrix is input into at least one decoding convolutional layer to obtain a second target component matrix, which includes but is not limited to: acquiring a third output and a fourth output of the coding convolutional layer corresponding to the current decoding convolutional layer; and, obtaining a fifth output of a last decoded convolutional layer adjacent to the current decoded convolutional layer; determining a first input corresponding to the current decoding convolutional layer according to the third output, the fourth output and the fifth output; the first input is input to the currently decoded convolutional layer.

Specifically, in one example, the image processing model shown in fig. 8 includes an encoding module 80 and a decoding module 82, wherein the encoding module 80 includes 3 encoded convolutional layers 810 and 1 first preset convolutional layer 820, and the decoding module 82 includes a decoded convolutional layer 830-1, a decoded convolutional layer 830-2, a decoded convolutional layer 830-3, and 1 second preset convolutional layer 840. Wherein each encoded convolutional layer includes a first bias layer 8102, a second bias layer 8104, and a shared convolutional layer 8106. Taking the decoded convolutional layer 830-1 and the decoded convolutional layer 830-2 as an example, the third output of the encoded convolutional layer 810 corresponding to the decoded convolutional layer 830-1 is the component matrix a80, the fourth output is the component matrix a82, the output of the first preset convolutional layer 820 is the first target component matrix B80, the component matrix a80, the component matrix a82 and the first target component matrix B80 are subjected to component matrix splicing or matrix point addition to obtain a first input C80, and the first input is input to the decoded convolutional layer 830-1 to obtain the component matrix B82. Next, for the decoded convolutional layer 830-2, based on the third output of the corresponding encoded convolutional layer 810 being the component matrix a84 and the fourth output being the component matrix a86, the component matrix a84, the component matrix a86 and the component matrix B82 are subjected to component matrix splicing or matrix point addition to obtain a first input C82, and C82 is input to the decoded convolutional layer 830-2.

Through the embodiment, the currently input true coding information is merged into the high-level semantic information for decoding, and the target image frame is finally output, so that the image enhancement of the image frame is realized, and the image enhancement effect is improved.

Optionally, in this embodiment, before inputting the initial video frame, the current video frame, the first multimodal data, and the second multimodal data into the pre-trained image processing model, the method further includes, but is not limited to: acquiring reference images and a preset number of training images; acquiring multimode data corresponding to the reference images and a preset number of training images respectively; constructing reference data and a training data set corresponding to the image processing model; training the image processing model based on the loss function, the training data set and the reference data so as to enable the fitting degree of the image processing model to reach a preset threshold value; wherein the loss function is as follows:

wherein, X _ predict is a network output image, X _ Target is a reference image, i.e. a standard anti-shake image, and N is the number of pixels.

Specifically, in an example, as shown in the image processing model training scenario of fig. 9, the two mobile phones may be placed in the same place through a support for acquiring training image materials, wherein one mobile phone starts a vibration motor, the vibration frequency of the vibration motor may be appropriately changed to simulate a real shaking scenario, while the other mobile phone remains still to serve as a training reference image, and after training, the network may be used. The loss function used for training is as described in equation (2) above.

In the above embodiment, the reference data and the training data set corresponding to the image processing model are constructed; and training the image processing model based on the loss function, the training data set and the reference data so as to realize the rapid training of the image processing model and effectively reduce the fitting degree of the image processing model.

According to the embodiment of the application, an initial video frame and a current video frame are obtained; acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame; and inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame. The image processing of the video frame is realized through the multimode data, various factors in the image shooting process are integrated, and the effective enhancement of the video frame is realized. And then solved among the prior art because the electron anti-shake in the camera shooting process needs artificial too much intervention and adjustment, and lead to the poor problem of image enhancement effect.

It should be noted that, in the video frame processing method provided in the embodiment of the present application, the execution main body may be a video frame processing apparatus, or a control module in the video frame processing apparatus for executing the loaded video frame processing method. In the embodiment of the present application, a video frame processing apparatus executes a processing method for loading a video frame as an example, and the video frame processing method provided in the embodiment of the present application is described.

According to another aspect of the present application, there is also provided a video frame processing apparatus, as shown in fig. 10, including:

1) a first image capturing unit 100, configured to obtain an initial video frame and a current video frame;

2) a first obtaining unit 102, configured to obtain first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame;

3) a first processing unit 104, configured to input the initial video frame, the current video frame, the first multimode data, and the second multimode data into an image processing model obtained through pre-training, so as to obtain a target video frame corresponding to the current video frame.

Optionally, in this embodiment, the first obtaining unit 102 includes:

1) the acquisition module is used for acquiring image data, line exposure time, image offset data and acquisition time which respectively correspond to the initial video frame and the current video frame;

2) and the determining module is used for determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame according to the image data, the row exposure time, the image offset data and the acquisition time.

Optionally, in this embodiment, the determining module includes:

1) the conversion submodule is used for performing BGR conversion on the image data before determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame according to the image data, the line exposure time, the image offset data and the acquisition time so as to obtain an image matrix corresponding to the image data;

2) the first determining submodule is used for determining a first matrix according to the image matrix and the row exposure time;

3) the second determining submodule is used for determining a second matrix according to the image offset data and the acquisition time;

4) and the first processing submodule is used for merging the first matrix and the second matrix to obtain the first component matrix and the second component matrix.

Optionally, in this embodiment, the image processing model includes an encoding module and a decoding module, wherein the first processing unit 104 includes:

1) the first processing module is used for inputting the first component matrix and the second component matrix into an encoding module of the image processing model so as to obtain a first target component matrix;

2) and the second processing module is used for inputting the first target component matrix into a decoding module of the image processing model so as to obtain the target video frame.

Optionally, in this embodiment, the encoding module includes at least one encoding convolutional layer and a first preset convolutional layer, and the at least one encoding convolutional layer is connected in series with the first preset convolutional layer, where the first processing module includes:

1) a second processing submodule, configured to input the first component matrix and the second component matrix to the at least one code convolutional layer, so as to obtain a first output corresponding to the first component matrix and a second output corresponding to the second component matrix;

2) a third processing submodule, configured to input the first output and the second output to the first preset convolution layer, so as to obtain the first target component matrix.

Optionally, in this embodiment, the encoded convolutional layers include a first bias layer corresponding to the first component matrix, a second bias layer corresponding to the second component matrix, and a shared convolutional layer, where the second processing sub-module is further configured to:

s1, inputting the first component matrix to the first bias layer to obtain a third component matrix;

s2, inputting the second component matrix to the second bias layer to obtain a fourth component matrix;

s3, inputting the third output and the fourth output to the shared convolution layer respectively to obtain the first output and the second output.

Optionally, in this embodiment, the third processing sub-module is further configured to:

s1, inputting the first output and the second output to the first predetermined convolutional layer respectively;

s2, performing convolution operations on the first output and the second output respectively through the first pre-set convolution layer to obtain the first target component matrix.

Optionally, in this embodiment, the decoding module includes at least one decoding convolutional layer and a second preset convolutional layer connected in series, where the second processing module includes:

1) a fourth processing submodule, configured to input the first target component matrix into the at least one decoding convolutional layer to obtain a second target component matrix;

2) and the fifth processing submodule is used for inputting the second target component matrix into the second preset convolution layer so as to obtain the target video frame.

Optionally, in this embodiment, the number of decoding convolutional layers in the decoding module is the same as the number of encoding convolutional layers in the encoding module, and the decoding convolutional layers correspond to the levels of the encoding convolutional layers one to one, where the fourth processing sub-module is further configured to:

s1, obtaining the third output and the fourth output of the coding convolution layer corresponding to the current decoding convolution layer; and the number of the first and second groups,

s2, obtaining a fifth output of the last decoded convolutional layer adjacent to the current decoded convolutional layer;

s3, determining a first input corresponding to the current decoded convolutional layer according to the third output, the fourth output and the fifth output;

s4, inputting the first input to the current decoding convolutional layer.

Optionally, in this embodiment, the method further includes:

1) the second image acquisition unit is used for acquiring training images and reference images in preset quantity before the initial video frame, the current video frame, the first multimode data and the second multimode data are input into an image processing model obtained by training in advance;

2) the second acquisition unit is used for acquiring the multimode data corresponding to the reference image and the training images in the preset number respectively;

3) the second processing unit is used for constructing reference data and a training data set corresponding to the image processing model;

4) the training unit is used for training the image processing model based on a loss function, the training data set and the reference data so as to enable the fitting degree of the image processing model to reach a preset threshold value;

wherein the loss function is as follows:

wherein, X _predict For the output image of the image processing model, X _Target Is made by reference toLike, N is the number of pixels.

The video frame processing apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The video frame processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The video frame processing apparatus provided in the embodiment of the present application can implement each process implemented by the video frame processing apparatus in the method embodiments of fig. 1 to fig. 10, and for avoiding repetition, details are not repeated here.

The method comprises the steps that an initial video frame and a current video frame are obtained through a video frame processing device provided by the embodiment of the application; acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame; and inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame. The image processing of the video frame is realized through the multimode data, various factors in the image shooting process are integrated, and the effective enhancement of the video frame is realized. And then solved among the prior art because the electron anti-shake in the camera shooting process needs artificial too much intervention and adjustment, and lead to the poor problem of image enhancement effect.

Optionally, an electronic device is further provided in this embodiment of the present application, and includes a processor 1110, a memory 1109, and a program or an instruction stored in the memory 1109 and executable on the processor 1110, where the program or the instruction is executed by the processor 1110 to implement each process of the video frame processing method embodiment, and can achieve the same technical effect, and details are not repeated here to avoid repetition.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1100 includes, but is not limited to: a radio frequency unit 1101, a network module 1102, an audio output unit 1103, an input unit 1104, a sensor 1105, a display unit 1106, a user input unit 1107, an interface unit 1108, a memory 1109, a processor 1110, and the like.

Those skilled in the art will appreciate that the electronic device 1100 may further include a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 1110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 11 does not constitute a limitation to the electronic device, and the electronic device may include more or less components than those shown in the drawings, or combine some components, or arrange different components, and thus, the description is omitted here.

The input unit 1104, which is a camera in the embodiment of the present application, is configured to acquire an initial video frame and a current video frame;

a sensor 1105, configured to obtain first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame;

a processor 1110, configured to input the initial video frame, the current video frame, the first multimode data, and the second multimode data into an image processing model obtained through pre-training, so as to obtain a target video frame corresponding to the current video frame.

It should be understood that in the embodiment of the present application, the input Unit 1104 may include a Graphics Processing Unit (GPU) 11041 and a microphone 11042, and the Graphics processor 11041 processes image data of still pictures or video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1106 may include a display panel 11061, and the display panel 11061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1107 includes a touch panel 11071 and other input devices 11072. A touch panel 11071, also called a touch screen. The touch panel 11071 may include two portions of a touch detection device and a touch controller. Other input devices 11072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1109 may be used for storing software programs and various data including, but not limited to, application programs and an operating system. Processor 1110 may integrate an application processor that handles primarily operating systems, user interfaces, applications, etc. and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1110.

The method comprises the steps of obtaining an initial video frame and a current video frame through electronic equipment in the embodiment of the application; acquiring first multimode data corresponding to an initial video frame and second multimode data corresponding to a current video frame; and inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame. The image processing of the video frame is realized through the multimode data, various factors in the image shooting process are integrated, and the effective enhancement of the video frame is realized. And then solved among the prior art because the electron anti-shake in the camera shooting process needs artificial too much intervention and adjustment, and lead to the poor problem of image enhancement effect.

The embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above video frame processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer-readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the method embodiment of the video frame processing method, and the same technical effect can be achieved, and in order to avoid repetition, details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for processing video frames, the method comprising:

acquiring an initial video frame and a current video frame;

acquiring first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame;

inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training to obtain a target video frame corresponding to the current video frame;

acquiring first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame, including:

acquiring image data, line exposure time, image offset data and acquisition time which respectively correspond to the initial video frame and the current video frame;

and determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame according to the image data, the line exposure time, the image offset data and the acquisition time.

2. The method of claim 1, wherein determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame based on the image data, the line exposure time, the image offset data, and the acquisition time comprises:

performing BGR conversion on the image data to obtain an image matrix corresponding to the image data;

determining a first matrix according to the image matrix and the row exposure time;

determining a second matrix according to the image offset data and the acquisition time;

and combining the first matrix and the second matrix to obtain the first component matrix and the second component matrix.

3. The method of claim 1, wherein the image processing model comprises an encoding module and a decoding module, wherein,

inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training to obtain a target video frame corresponding to the current video frame, wherein the method comprises the following steps:

inputting the first component matrix and the second component matrix into an encoding module of the image processing model to obtain a first target component matrix;

and inputting the first target component matrix into a decoding module of the image processing model to obtain the target video frame.

4. The method of claim 3, wherein the coding module comprises at least one coding convolutional layer and a first preset convolutional layer, the at least one coding convolutional layer being connected in series with the first preset convolutional layer, wherein,

inputting the first component matrix and the second component matrix into an encoding module of the image processing model to obtain the first target component matrix, including:

inputting the first component matrix and the second component matrix to the at least one code convolutional layer to obtain a first output corresponding to the first component matrix and a second output corresponding to the second component matrix;

inputting the first output and the second output to the first convolution layer to obtain the first target component matrix.

5. The method of claim 4, wherein the encoded convolutional layers comprise a first bias layer corresponding to the first component matrix, a second bias layer corresponding to the second component matrix, and a shared convolutional layer, wherein,

inputting the first component matrix and the second component matrix to the at least one encoded convolutional layer, comprising:

inputting the first component matrix to the first bias layer to obtain a third component matrix;

inputting the second component matrix to the second bias layer to obtain a fourth component matrix;

inputting the third component matrix and the fourth component matrix to a shared convolutional layer respectively to obtain the first output and the second output.

6. The method of claim 4, wherein inputting the first output and the second output to the first convolutional layer to obtain the first target component matrix comprises:

inputting the first output and the second output to the first pre-defined convolutional layer, respectively;

and performing convolution operation on the first output and the second output through the first preset convolution layer respectively to obtain the first target component matrix.

7. The method of claim 3, wherein said decoding module comprises at least one decoding convolutional layer and a second predetermined convolutional layer connected in series, wherein,

inputting the first target component matrix into a decoding module of the image processing model to obtain the target video frame, including:

inputting the first target component matrix into the at least one decoding convolutional layer to obtain a second target component matrix;

and inputting the second target component matrix into the second preset convolution layer to obtain the target video frame.

8. The method of claim 7, wherein the number of decoding convolutional layers in the decoding module is the same as the number of encoding convolutional layers in the encoding module, and the decoding convolutional layers correspond to the levels of the encoding convolutional layers one-to-one, wherein,

inputting the first target component matrix into the at least one decoded convolutional layer to obtain a second target component matrix, comprising:

acquiring a third output and a fourth output of the coding convolutional layer corresponding to the current decoding convolutional layer; and the number of the first and second groups,

obtaining a fifth output of a last decoded convolutional layer adjacent to the current decoded convolutional layer;

determining a first input corresponding to the currently decoded convolutional layer according to the third output, the fourth output and the fifth output;

inputting the first input to the currently decoded convolutional layer.

9. The method of claim 1, further comprising, prior to inputting the initial video frame, the current video frame, the first multimodal data, and the second multimodal data into a pre-trained image processing model:

acquiring reference images and a preset number of training images;

acquiring multimode data corresponding to the reference image and the preset number of training images respectively;

constructing reference data and a training data set corresponding to the image processing model;

training the image processing model based on a loss function, the training data set and the reference data so that the fitting degree of the image processing model reaches a preset threshold value;

wherein the loss function is as follows:

wherein, X _predict For the output image of the image processing model, X _Target For the reference image, N is the number of pixels.

10. A video frame processing apparatus, characterized in that the apparatus comprises:

the first image acquisition unit is used for acquiring an initial video frame and a current video frame;

a first obtaining unit, configured to obtain first multimode data corresponding to the initial video frame and second multimode data corresponding to the current video frame;

the first processing unit is used for inputting the initial video frame, the current video frame, the first multimode data and the second multimode data into an image processing model obtained by pre-training so as to obtain a target video frame corresponding to the current video frame;

the first acquisition unit includes:

the acquisition module is used for acquiring image data, line exposure time, image offset data and acquisition time which respectively correspond to the initial video frame and the current video frame;

and the determining module is used for determining a first component matrix corresponding to the initial video frame and a second component matrix corresponding to the current video frame according to the image data, the row exposure time, the image offset data and the acquisition time.

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the video frame processing method according to any one of claims 1-9.

12. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the video frame processing method according to any one of claims 1 to 9.