CN114339395A

CN114339395A - Video jitter detection method, detection device, electronic equipment and readable storage medium

Info

Publication number: CN114339395A
Application number: CN202111529901.6A
Authority: CN
Inventors: 张鎏锟; 毛礼建
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-12

Abstract

The application discloses a video jitter detection method, a detection device, an electronic device and a readable storage medium, wherein the method comprises the following steps: extracting at least two adjacent first video frames from the video data, and determining a foreground region in each first video frame; obtaining a background area outside a foreground area in each first video frame; determining an image offset value between each two adjacent first video frames based on the background area in each two adjacent first video frames; a video judder detection result of the video data is determined based on the plurality of image offset values. According to the scheme, the accuracy rate of the jitter detection on the video data can be improved.

Description

Video jitter detection method, detection device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of video image processing technologies, and in particular, to a video jitter detection method, a detection apparatus, an electronic device, and a readable storage medium.

Background

With the increasing role of monitoring equipment in various aspects such as city safety, civil management and control, industrial construction and the like, the anomaly detection in video data is gradually concerned, wherein the anomaly such as video jitter has the characteristics of burstiness and short duration, so that the difficulty in finding the video jitter is very high, and the sudden video jitter is difficult to be accurately fed back in the face of human examination of massive video data. In view of the above, how to improve the accuracy of detecting the video data jitter becomes an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a video jitter detection method, a detection device, an electronic device and a readable storage medium, which can improve the accuracy of jitter detection on video data.

To solve the above technical problem, a first aspect of the present application provides a video jitter detection method, including: extracting at least two adjacent first video frames from video data, and determining a foreground region in each first video frame; obtaining a background region outside the foreground region in each of the first video frames; determining an image offset value between each two adjacent first video frames based on the background region in each two adjacent first video frames; determining a video shake detection result of the video data based on a plurality of the image offset values.

To solve the above technical problem, a second aspect of the present application provides a video judder detection device, including: the device comprises a foreground region determining module, a background region extracting module, an offset value determining module and a result determining module, wherein the foreground region determining module is used for extracting at least two adjacent first video frames from video data and determining a foreground region in each first video frame; the background region extraction module is used for obtaining a background region outside the foreground region in each first video frame; the offset value determining module is used for determining an image offset value between every two adjacent first video frames based on the background area in every two adjacent first video frames; the result determination module is configured to determine a video shake detection result for the video data based on a plurality of the image offset values.

To solve the above technical problem, a third aspect of the present application provides an electronic device, including: a memory and a processor coupled to each other, wherein the memory stores program data, and the processor calls the program data to execute the method of the first aspect.

To solve the above technical problem, a fourth aspect of the present application provides a computer storage medium having stored thereon program data, which when executed by a processor implements the method of the first aspect.

According to the scheme, at least two adjacent first video frames are extracted from video data, a foreground area in the first video frames is determined, a background area outside the foreground area is obtained, and the influence of the movement of an object in the foreground area on video jitter detection is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a flowchart illustrating an embodiment of a video jitter detection method according to the present application;

FIG. 2 is a flow chart illustrating a video jitter detection method according to another embodiment of the present application;

FIG. 3 is a schematic structural diagram of an embodiment of a video judder detection apparatus according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a computer storage medium according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a video jitter detection method according to the present application, the method including:

s101: at least two adjacent first video frames are extracted from the video data, and a foreground region in each first video frame is determined.

Specifically, video data acquired by monitoring equipment is obtained, at least two adjacent video frames are extracted from the video data to serve as first video frames, the first video frames are analyzed, and a foreground area in each first video frame is determined.

In an application mode, at least part of first video frames are extracted from video data, the extracted first video frames are continuous video frames, foreground object detection is performed on each first video frame, a foreground object in each first video frame is determined, and a region corresponding to the foreground object is used as a foreground region, so that the foreground region in the first video frames is determined.

In another application mode, a starting video frame is determined from video data, continuous first video frames are extracted by taking the starting video frame as a starting point, foreground object detection is performed on the extracted first video frames, a foreground object in each first video frame is determined, and a region corresponding to the foreground object is taken as a foreground region, so that the foreground region in the first video frame is determined.

In a specific application scene, foreground object detection is performed on a first video frame by using a foreground object detection model, wherein the foreground object detection model can identify a foreground object of a preset category through pre-training, the first video frame is input into the foreground object detection model, so that the foreground object detection model outputs the foreground object in the first video frame, and a region corresponding to the foreground object in the first video frame is used as a foreground region.

S102: a background region outside the foreground region in each first video frame is obtained.

Specifically, a background region other than the foreground region is obtained from the first video frame, thereby determining a background object in the first video frame, which has a very small probability of motion.

In an application mode, pixels corresponding to the foreground region in each first video frame are eliminated, and therefore the background region in the first video frame is obtained.

S103: an image offset value between each two adjacent first video frames is determined based on the background area in each two adjacent first video frames.

Specifically, comparing background areas in every two adjacent first video frames, determining an image offset value after the background area changes between every two adjacent first video frames, and reducing the influence of the movement of an object in the foreground area on video shake detection.

In an application mode, a first video frame is converted into a second video frame, wherein the second video frame only keeps a background area, a transformation matrix between every two adjacent second video frames is generated, changes between the two adjacent second video frames in the horizontal direction and the vertical direction are represented through the transformation matrix, and an image offset value between the video frames with the time sequence behind relative to the video frames with the time sequence ahead is determined based on the transformation matrix and the corresponding second video frames.

S104: a video judder detection result of the video data is determined based on the plurality of image offset values.

Specifically, whether periodic variation of an image offset value exists between a plurality of adjacent video frames is determined, so that the frequency of occurrence of the periodic variation is counted, and a shake detection result corresponding to video data in a corresponding time period is determined based on the frequency of occurrence of the periodic variation.

In an application mode, an image deviation value of every two adjacent video frames in continuous video frames within a period of time changing along with time is obtained, for the video frames which do not shake or shift, the deviation value of the horizontal direction and the vertical direction is basically 0, if the image deviation value is greater than a deviation threshold value, a change section from a change starting point to a change ending point is intercepted, whether periodic change exists in the change section is judged, the frequency appearing in the change section is counted, when the frequency appearing in the change section is greater than a frequency threshold value, the video data in the corresponding change section are judged to shake, otherwise, the video data are normal, the probability that the visual angle movement of the monitoring equipment is also judged to be video shake is reduced, and the robustness and the accuracy of the detection result are improved.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating another embodiment of a video judder detection method according to the present application, the method including:

s201: at least two adjacent first video frames are extracted from the video data, and a foreground region in each first video frame is determined.

Specifically, video data shot by monitoring equipment is obtained, at least two adjacent first video frames are extracted from the video data, and a foreground area in each first video frame is determined. The foreground region is a region corresponding to a foreground object, and the foreground object includes but is not limited to an object that can actually move, such as a person, an animal, and a vehicle.

In one application, the step of determining the foreground region in each first video frame includes: each first video frame is input to a foreground segmentation model such that the foreground segmentation model determines a foreground region and a background region in the first video frame based on pixels on the first video frame.

Specifically, the foreground region includes pixels corresponding to a foreground object in the first video frame, and the background region includes pixels corresponding to a background object other than the foreground object in the first video frame. In addition, the foreground segmentation model is obtained after being trained in advance based on a plurality of second training images, and the plurality of second training images comprise a plurality of foreground objects.

Furthermore, each first video frame is input into the foreground segmentation model, the foreground segmentation model detects the first video frames, foreground objects in the first video frames are determined, pixels corresponding to the foreground objects are used as foreground regions, and the trained foreground segmentation model is used for judging the foreground regions in the first video frames in batch, so that the foreground regions are determined, and the influence of the displacement of the foreground objects on the jitter detection accuracy is reduced.

In an application scenario, the training process of the foreground segmentation model comprises the following steps: the method comprises the steps of obtaining a second training image comprising multiple categories, marking labels corresponding to foreground objects on the second training image, enabling the categories of the foreground objects to comprise pedestrians, vehicles and animals, inputting the second training image into a foreground segmentation model to be trained so that the foreground segmentation model outputs segmentation results for the foreground objects, determining prediction loss between the segmentation results and the labels on the second training image by using a cross entropy loss function, adjusting parameters in the foreground segmentation model based on the prediction loss until the prediction loss is converged, and obtaining the trained foreground segmentation model. The above loss function is formulated as follows:

where M represents the number of categories of the foreground, y_cIs a class corresponding to the pixel, p_cIs the probability that the prediction sample belongs to class c.

In a specific application scenario, the trained foreground segmentation model is used to determine pixel classes corresponding to pixels on an image, where the pixel classes include a foreground and a background. Wherein the step of determining the foreground region and the background region in the first video frame based on the pixels on the first video frame comprises: the pixel type corresponding to the pixels of the foreground object is judged as the foreground, the pixel type corresponding to the pixels of the background object is judged as the background, and therefore the foreground area and the background area in the first video frame are determined.

Specifically, the pixel types are divided into foreground and background, at least two first video frames are input into a foreground division model, so that the foreground division model judges the pixel type of each pixel in the first video frames, the pixel type corresponding to the pixel of a foreground object is judged to be the foreground based on the pixel type, a foreground area in the first video frames is determined, the pixel type corresponding to the pixel of a background object is judged to be the background, a background area in the first video frames is determined, and therefore the pixel in the first video frames is inevitably divided into the foreground or the background, so that the foreground and the background in the first video frames are definitely distinguished.

S202: and setting pixels in a region except the foreground region in the first video frame as a preset value or clearing the pixels to obtain a background region.

Specifically, foreground pixels in the first video frame are set to a preset value or eliminated to determine a background region in the first video frame. Wherein, the pixels in the foreground region in the background region are set to preset values or are cleared.

In an application mode, zero value filling is performed on pixels corresponding to a foreground region in a first video frame, so that the pixels corresponding to the foreground region in the first video frame are eliminated, and only the pixels corresponding to a background region are reserved, so as to determine the background region in the first video frame. The above process is formulated as follows:

wherein, IR represents the result image, C is the pixel type of the foreground object, C_i,jIs a pixel at the I, j position of the image, I_i,jIs a pixel of the original image when c_i,jAnd when the image does not belong to the foreground C, the image is a background, and the pixel at the position is equal to the pixel of the original image. When c is going to_i,jWhen the foreground C is included, the foreground is obtained, and the pixel at the position is subjected to zero value filling.

Optionally, pixels corresponding to a foreground region in the first video frame are set to be preset values, so that the foreground region in the first video frame is clearly distinguished from a background region to determine the background region in the first video frame, a target which can be focused in the first video frame is changed into a background object, and the influence of the motion of the foreground object on a shake detection result is reduced.

S203: and respectively inputting every two adjacent background areas into the registration model, so that the registration model outputs affine transformation matrixes respectively corresponding to every two adjacent background areas.

Specifically, the registration model is obtained after being trained in advance based on a plurality of first training images, the plurality of first training images include a plurality of groups of adjacent video frames, and the trained registration model can output affine transformation matrixes corresponding to two input video frames based on the two input video frames.

Further, the training process of the registration model comprises: obtaining a plurality of first training images, wherein the first training images are at least partial adjacent video frames extracted from video data, labeling label matrixes corresponding to the adjacent first training images, the label matrixes are affine transformation matrixes, inputting the first training images to a registration model to be trained so that the registration model outputs prediction matrixes, adjusting parameters in the registration model based on errors between the prediction matrixes and the label matrixes until the errors converge, and obtaining the trained registration model. The affine transformation matrix is expressed by the following formula:

wherein, a₁₁,a₁₂,a₂₁,a₂₂Representing affine transformation parameters for feeding back whether the image is rotated, t_x,t_yRepresenting translation transformation parameters, and x and y respectively representing translation proportions in the horizontal direction and the vertical direction, and are used for feeding back whether the image is translated or not.

Specifically, the straightness and the parallelism of the two-dimensional image are maintained through affine transformation, wherein the straightness represents whether a straight line is subjected to affine transformation or a straight line, an arc is subjected to affine transformation or an arc, the parallelism represents that the relative position relation between straight lines is kept unchanged, the parallel lines are still parallel lines after the affine transformation, and the position sequence of points on the straight lines cannot be changed, wherein the registration network can be used for mining the depth characteristics between video frames and generating an affine transformation matrix between two adjacent video frames, so that the subsequent image deviation value generated based on the affine transformation matrix is more accurate, and the detection result obtained based on the image deviation is more robust.

S204: an image offset value between each two adjacent video frames is determined based on the respective affine transformation matrices and the corresponding background regions.

Specifically, after obtaining the affine transformation matrix corresponding to each two adjacent background regions, the image offset value between each two adjacent video frames can be determined based on the affine transformation matrix and the corresponding background regions.

In an application scene, multiplying a background area with a front time sequence in two adjacent background areas by a corresponding affine transformation matrix to obtain an image deviation value of the background area with a rear time sequence relative to the background area with the front time sequence; an image offset value between each two adjacent video frames is determined across each two adjacent background regions.

Specifically, since the jitter of the monitoring device is approximate to a rigid body transformation, the offset of the affine transformation matrix in the horizontal direction and the vertical direction is the displacement proportion of the video frame jitter, the affine transformation matrix is multiplied by the width and the height of the corresponding background region in the front time sequence, so that the displacement of the background region in the rear time sequence in the horizontal direction and the vertical direction relative to the background region in the front time sequence can be obtained, and thus the image offset value between two adjacent video frames is determined.

In a specific application scenario, when two adjacent video frames do not have jitter, the affine transformation matrix is

When the two adjacent video frames shake, the affine transformation matrix changes inevitably, so that the affine transformation matrix can feed back the change of the depth characteristics between the two adjacent video frames, and the image offset value between the two adjacent video frames can be accurately obtained based on the affine transformation matrix and the corresponding background area.

S205: and acquiring a plurality of image offset values corresponding to at least two adjacent first video frames.

Specifically, image offset values corresponding to each two adjacent video frames in at least two adjacent first video frames extracted from the video data are obtained to obtain a plurality of image offset values, so that the influence of random values on the shake detection result is reduced.

S206: a variation trend map is generated based on the plurality of image offset values, wherein a distribution of signal points in the variation trend map is related to a timing of the first video frame.

Specifically, a change trend graph is generated according to the image offset value of each time-sequence later video frame relative to the time-sequence earlier video frame in the continuous video frames, wherein the change trend graph changes along with the change of time, and the distribution of signal points in the change trend graph is related to the time sequence of the first video frame.

In an application scene, a plurality of image deviation values of continuous adjacent video frames changing along with time within a preset time period are obtained, a signal diagram of the image deviation values changing along with time is generated, for the adjacent video frames which do not shake or shift, the deviation amount of the adjacent video frames in the horizontal direction and the vertical direction is basically 0, and the deviation amount on the signal diagram is 0.

S207: and determining a video jitter detection result of the video data based on the plurality of image offset values, the variation trend graph and the offset threshold values corresponding to the image offset values.

Specifically, the position where the image offset value exceeds the offset threshold value is determined in the change trend graph, the change frequency is counted, when the change frequency exceeds the frequency threshold value, the video shake detection result is judged to be the occurrence of shake, when the change frequency does not exceed the frequency threshold value, the video shake detection result is judged to be the non-occurrence of shake, and the change of the image in the video data is generally caused by the movement of the view angle.

In an application mode, the step of determining the video shake detection result of the video data based on the plurality of image offset values, the variation trend graph and the offset threshold corresponding to the image offset values includes: extracting at least one change segment from the change trend graph according to time sequence; the starting point of each change section is a signal point of which the image deviation value is greater than the deviation threshold value, and the ending point of each change section is a signal point of which the first image deviation value after the corresponding starting point is less than the deviation threshold value; determining a change frequency corresponding to the image offset value based on each change segment; and determining a video jitter detection result in a preset time period based on the change frequency and the corresponding frequency threshold value.

Specifically, the starting point corresponding to the image offset value larger than the offset threshold value is determined from front to back in time series in the variation trend graph, and an end point corresponding to an image offset value less than or equal to the offset threshold value, taking a section between the start point and the end point as a variation section, thereby obtaining at least one change section in the change trend graph, wherein the image deviation value corresponding to the starting point of each change section exceeds the deviation threshold value, the signal point of the first image deviation value which is obtained from the starting point to the back of the ending point of each change section according to the time sequence is smaller than the deviation threshold value, whether the image deviation value in each change section has periodic variation or not is determined based on the autocorrelation function, the frequency of the periodic variation is counted, when the frequency of the periodic variation exceeds the frequency threshold value, and judging that the video jitter detection result in the change section is jitter, and judging that the video jitter detection result in the change section is no jitter when the frequency of occurrence does not exceed the frequency threshold.

In a specific application scene, acquiring a plurality of image deviation values of continuous adjacent video frames changing along with time in a preset time period, and generating a signal graph of the image deviation values changing along with time, wherein the preset time period is a time period corresponding to a continuous first video frame, if the signal deviation amount is greater than a deviation threshold value, intercepting a change section between a change starting point and a change ending point, judging whether periodic change exists in the change section by utilizing an autocorrelation function, counting the frequency of the change section, if the change section is greater than the frequency threshold value, judging that video jitter occurs in the video frame corresponding to the change section, and if not, judging that the video jitter occurs in the video frame corresponding to the change section. Wherein the autocorrelation function is formulated as follows:

wherein E is the desired value, X_iIs the value of a random variable at time series t (i); mu_iIs the expected value at time series t (i), X_i+kIs the value of a random variable at time series t (i + k), mu_i+kIs the expected value, σ, at time series t (i + k)²The obtained autocorrelation value R (k) is in the range of [ -1,1 ] for variance]1 is the maximum positive correlation, -1 is the maximum negative correlation, and 0 is uncorrelated. The autocorrelation function in the signal processing reflects the correlation degree of the same signal between values at different moments, and if the signal is periodic, the autocorrelation function can obtain the maximum value when k is a corresponding period value, so that the autocorrelation function can analyze the periodicity of the signal, the offset period in the horizontal and vertical directions when jitter occurs is counted through the autocorrelation function, and the influence of non-jitter states such as unidirectional movement of monitoring equipment on a detection result is avoided.

In the embodiment, background areas of adjacent video frames are extracted, then, a registration model is used for depth feature matching to obtain an affine transformation matrix, horizontal and vertical offsets between the adjacent video frames are obtained based on the affine transformation matrix to obtain image offset values between the adjacent video frames, a change trend graph is generated based on a plurality of image offset values, and a more accurate video jitter detection result is determined based on an autocorrelation function in the change trend graph.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of a video jitter detection apparatus 30 of the present application, where the video jitter detection apparatus includes a foreground region determining module 301, a background region extracting module 302, an offset value determining module 303, and a result determining module 304. The foreground region determining module 301 is configured to extract at least two adjacent first video frames from the video data, and determine a foreground region in each first video frame; the background region extraction module 302 is configured to obtain a background region in each first video frame, where the background region is outside a foreground region; the offset value determining module 303 is configured to determine an image offset value between each two adjacent first video frames based on the background area in each two adjacent first video frames; the result determination module 304 is configured to determine a video shake detection result of the video data based on the plurality of image offset values.

It should be noted that the video shake detection apparatus in this embodiment is applied to the video shake detection method in any one of the above embodiments, wherein the foreground region determination module 301, the background region extraction module 302, the offset value determination module 303, and the result determination module 304 are applicable to the video shake detection method in any one of the above embodiments.

In the above solution, the foreground region determining module 301 extracts at least two adjacent first video frames from the video data, determines a foreground region from the first video frames, the background region extracting module 302 obtains a background region outside the foreground region, reduces the influence of the movement of the object in the foreground region on the video shake detection, so that the offset value determining module 303 determines an image offset value between every two adjacent first video frames based on the background area in every two adjacent first video frames, the result determining module 304 determines a video shake detection result within a corresponding time period based on a plurality of image offset values, therefore, the change rule of the image offset value can be mined by more samples, the accuracy of the video data jitter detection is improved, the influence on the detection result when the visual angle of the monitoring equipment moves is reduced, and the robustness of the detection result is improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of an electronic device 40 of the present application, where the electronic device includes a memory 401 and a processor 402 coupled to each other, where the memory 401 stores program data (not shown), and the processor 402 calls the program data to implement the video jitter detection method in any of the embodiments described above, and the description of relevant contents refers to the detailed description of the embodiments of the method described above, which is not repeated herein.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a computer storage medium 50 of the present application, the computer storage medium 50 stores program data 500, and the program data 500 is executed by a processor to implement the video jitter detection method in any of the above embodiments.

It should be noted that, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A video jitter detection method, the method comprising:

extracting at least two adjacent first video frames from video data, and determining a foreground region in each first video frame;

obtaining a background region outside the foreground region in each of the first video frames;

determining an image offset value between each two adjacent first video frames based on the background region in each two adjacent first video frames;

determining a video shake detection result of the video data based on a plurality of the image offset values.

2. The method according to claim 1, wherein said step of obtaining a background region of each of the first video frames other than the foreground region comprises:

and setting pixels in a region outside the foreground region in the first video frame as a preset value or clearing the pixels to obtain the background region.

3. The method according to claim 1, wherein the step of determining an image offset value between every two adjacent first video frames based on the background area in every two adjacent first video frames comprises:

inputting every two adjacent background regions into a registration model respectively, so that the registration model outputs affine transformation matrixes corresponding to every two adjacent background regions respectively;

determining the image offset value between each two adjacent video frames based on each affine transformation matrix and the corresponding background region;

the registration model is obtained after being trained in advance based on a plurality of first training images, and the plurality of first training images comprise a plurality of groups of adjacent video frames.

4. The method according to claim 2, wherein said step of determining said image offset value between every two adjacent video frames based on each said affine transformation matrix and corresponding said background area comprises:

multiplying the background area with the front time sequence in the two adjacent background areas by the corresponding affine transformation matrix to obtain an image deviation value of the background area with the front time sequence relative to the background area with the rear time sequence;

traversing every two adjacent background regions, determining the image offset value between every two adjacent video frames.

5. The video judder detection method according to claim 1, wherein said step of determining a video judder detection result for said video data based on a plurality of said image offset values comprises:

acquiring a plurality of image deviation values corresponding to the at least two adjacent first video frames;

generating a trend graph based on a plurality of the image offset values; wherein the distribution of signal points in the trend graph is related to the timing of the first video frame;

and determining a video jitter detection result of the video data based on a plurality of image offset values, the variation trend graph and offset threshold values corresponding to the image offset values.

6. The video shake detection method according to claim 5, wherein the step of determining the video shake detection result of the video data based on a plurality of shift thresholds corresponding to the image shift value, the variation trend graph and the image shift value comprises:

extracting at least one change segment from the change trend graph according to time sequence; the starting point of each change section is a signal point of which the image deviation value is greater than the deviation threshold value, and the ending point of each change section is a signal point of which the first image deviation value after the corresponding starting point is less than the deviation threshold value;

determining a variation frequency corresponding to the image offset value based on each variation segment;

and determining the video jitter detection result in the preset time period based on the change frequency and the corresponding frequency threshold value.

7. The method of claim 1, wherein the step of determining the foreground region in each of the first video frames comprises:

inputting each of the first video frames to a foreground segmentation model such that the foreground segmentation model determines a foreground region and a background region in the first video frame based on pixels on the first video frame;

the foreground region comprises pixels corresponding to a foreground object in the first video frame, and the background region comprises pixels corresponding to a background object except the foreground object in the first video frame; the foreground segmentation model is obtained after being trained in advance based on a plurality of second training images, and the plurality of second training images comprise a plurality of foreground objects.

8. The method according to claim 7, wherein the trained foreground segmentation model is used to determine pixel classes corresponding to pixels on an image, wherein the pixel classes include a foreground and a background;

the step of determining a foreground region and a background region in the first video frame based on pixels on the first video frame comprises:

and determining the pixel type corresponding to the pixels of the foreground object as a foreground, and determining the pixel type corresponding to the pixels of the background object as a background, so as to determine the foreground area and the background area in the first video frame.

9. A video shake detection apparatus, characterized in that the detection apparatus comprises:

the foreground region determining module is used for extracting at least two adjacent first video frames from video data and determining a foreground region in each first video frame;

a background region extraction module, configured to obtain a background region in each of the first video frames, where the background region is outside the foreground region;

an offset value determining module, configured to determine an image offset value between each two adjacent first video frames based on the background area in each two adjacent first video frames;

a result determination module to determine a video shake detection result for the video data based on a plurality of the image offset values.

10. An electronic device, comprising: a memory and a processor coupled to each other, wherein the memory stores program data that the processor calls to perform the method of any of claims 1-8.

11. A computer-readable storage medium, on which program data are stored, which program data, when being executed by a processor, carry out the method of any one of claims 1-8.