CN114359333A

CN114359333A - Moving object extraction method and device, computer equipment and storage medium

Info

Publication number: CN114359333A
Application number: CN202111671998.4A
Authority: CN
Inventors: 沈丰毅; 肖春林
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-15

Abstract

The invention belongs to the field of video processing, and particularly provides a moving target extraction method, a moving target extraction device, computer equipment and a storage medium, aiming at solving the problems of how to segment a moving target from an image and accurately acquiring the contour of the moving target and the motion track of the moving target. To this end, the method of the invention comprises: selecting a video image in a frame skipping mode; sending the image to be processed and the background image into a trained foreground extraction network for semantic segmentation, thereby obtaining a mask image containing a foreground target; and fusing the information of the multiple mask images to extract a moving target. By applying the method, the speed of identifying the moving target is improved by adopting the semantic segmentation-based convolutional neural network; the video processing speed is further accelerated by a frame skipping strategy; and the image to be processed and the background image are simultaneously sent into the convolutional neural network, so that the segmentation accuracy of the foreground target is improved, and meanwhile, the anti-interference capability of the network is greatly improved.

Description

Moving object extraction method and device, computer equipment and storage medium

Technical Field

The invention belongs to the field of video processing, and particularly provides a moving object extraction method and device, computer equipment and a storage medium.

Background

Video concentration is a simple summary of video content, moving objects are extracted by performing algorithm analysis on the moving objects in a video, then the moving tracks of all the objects are analyzed, different objects are spliced into a common background scene and combined in a certain mode, and a plurality of objects and activities appearing at different times are displayed at the same time, so that people watching the video can watch several or even tens of hours of video within a short few minutes. One of the most critical steps of video compression is to extract moving objects in the video, which directly relates to the quality and accuracy of the final composite video.

The method mainly comprises the following difficulties in the current moving target extraction, wherein firstly, the category of the moving target is uncertain, some common target extraction algorithms such as target detection or instance segmentation algorithms can only extract the target of a fixed category, and for the categories which do not exist in a training set of the common target extraction algorithms, the recall rate is low, so that the common target extraction algorithms are not universal enough for video concentration. Secondly, a relatively static target exists within a certain time, some algorithms can extract a moving target in the video by analyzing the information of a front multiframe and a rear multiframe, but the problem that the accuracy rate is not high often exists, and the video speed is low because a plurality of frames of images need to be analyzed. Thirdly, some small disturbances in the video, which are more traditional image processing algorithms, such as a frame difference method, have low accuracy and are easy to generate false detection although moving targets can be obtained; although the accuracy of methods such as the optical flow method is high, the video processing speed is low, and the requirement of real-time detection is difficult to meet. Therefore, how to segment the moving object from the background image and accurately acquire the contour of the moving object and the motion trajectory of the moving object becomes a problem to be solved urgently.

Accordingly, there is a need in the art for a new solution to the above-mentioned problems.

Disclosure of Invention

The invention aims to solve the technical problem of how to segment a moving object from a background image and accurately acquire the contour of the moving object and the motion track of the moving object.

In a first aspect, the present invention provides a moving object extraction method, the method comprising:

acquiring a background image;

acquiring a first image from a first video to be processed according to a first frame skipping rule;

obtaining a first mask image sequence based on the first image and the background image;

acquiring a second video to be processed from the first video to be processed according to the first mask image sequence;

acquiring a second image from the second video to be processed according to a second frame skipping rule;

obtaining a second mask image sequence based on the second image and the background image;

and acquiring the motion trail of the moving target based on the second mask image sequence.

In an embodiment of the above moving object extraction method, "obtaining a first mask image sequence based on the first image and the background image" specifically includes sending the first image and the background image to a trained foreground extraction network in sequence for semantic segmentation to obtain the first mask image sequence, where the first mask image sequence includes a plurality of first mask images;

the step of obtaining a second mask image sequence based on the second image and the background image specifically includes sending the second image and the background image to the trained foreground extraction network in sequence for semantic segmentation to obtain the second mask image sequence, where the second mask image sequence includes a plurality of second mask images;

the foreground extraction network is a convolutional neural network;

the semantic information of the first mask image and the semantic information of the second mask image are the same, the semantic information includes that the position with the pixel value of 1 represents that a foreground target exists, and the position with the pixel value of 0 represents that the foreground target does not exist.

In one embodiment of the above moving object extracting method, the method further comprises:

marking a connected domain with a pixel value of 1 in the second mask image;

and acquiring the position information of the foreground target in the second mask image according to the connected domain, wherein the position information comprises a foreground target ID and a rectangular frame corresponding to the foreground target ID, and the number of the foreground target IDs is one or more.

In an embodiment of the moving object extracting method, the foreground object includes a first foreground object and a second foreground object, the first foreground object is the foreground object in a first result image, the second foreground object is the foreground object in a second result image, the first result image and the second result image are two adjacent second mask images in the second mask image sequence, the second result image is the second mask image of a frame before the first result image, the position information includes first position information, second position information and predicted position information, and the first position information, the second position information and the predicted position information each include a respective foreground object ID and a rectangular frame corresponding to the respective foreground object ID;

the step of acquiring the motion trail of the moving object based on the second mask map sequence specifically includes:

acquiring the first position information of the first foreground target;

acquiring the predicted position information of the second foreground target at the moment corresponding to the first result image;

obtaining IoU values of a rectangular frame in the first position information and a rectangular frame in the predicted position information;

and acquiring the motion trail of the motion target according to the IoU value.

In an embodiment of the above moving object extracting method, the step of "obtaining the moving track of the moving object according to the IoU value" specifically includes:

acquiring the second position information of the second foreground target;

taking the IoU value as a weight value of a KM algorithm, and obtaining the matching degree of the moving targets of the first foreground target and the second foreground target through the KM algorithm;

when the matching degree of the moving target is greater than or equal to a threshold value of the matching degree of the moving target, judging that the first foreground target and the second foreground target are the same moving target;

and acquiring the motion track of the moving target according to the first position information and the second position information.

In an embodiment of the above moving object extracting method, the foreground object further includes a third foreground object, the third foreground object is the foreground object in a third result image, the third result image is the second mask image of the first M frames of the second result image in the second mask image sequence, and M is an integer greater than or equal to 1;

the step of "obtaining the predicted position information of the second foreground target at the time corresponding to the first result image" specifically includes:

acquiring the current speed of the second foreground target;

obtaining the historical speed of the third foreground target, wherein the moving targets corresponding to the third foreground target and the second foreground target are the same;

acquiring the predicted speed of the second foreground target according to the current speed and the historical speed;

obtaining the central point of the rectangular frame of the predicted position information according to the predicted speed and the time difference between the second result image and the first result image;

and acquiring the predicted position information according to the central point of the rectangular frame of the predicted position information.

In an embodiment of the above moving object extracting method, the step of "obtaining the predicted position information according to a central point of a rectangular frame of the predicted position information" specifically includes:

the width and the height of the rectangular frame of the predicted position information are respectively the second result image and an average value of the width and the height of the rectangular frame in the position information of each foreground object corresponding to the same moving object in the second mask image of the first N frames of the second result image, wherein N is an integer greater than or equal to 1.

In one embodiment of the above moving object extracting method, the method of "acquiring a background image" includes:

acquiring an initial background image;

the step of "acquiring the initial background image" specifically includes:

acquiring a third image from a time range specified by the initial stage of the first video to be processed according to a third frame skipping rule;

and obtaining the initial background image through a median filtering algorithm based on the third image.

In one embodiment of the above moving object extracting method, the method of "acquiring a background image" further includes:

maintaining the background image;

the step of "maintaining the background image" specifically includes:

acquiring a fourth image from the first video to be processed according to a fourth frame skipping rule;

sending the fourth image and the historical background image into the trained foreground extraction network to obtain a third mask image, wherein the historical background image is the background image before the moment corresponding to the fourth image;

and updating the background image according to the fourth image, the third mask image and the historical background image.

In one embodiment of the above moving object extracting method, the background image is maintained during the process of acquiring the first mask image sequence;

and/or maintaining the background image during the acquisition of the second mask map sequence.

In an embodiment of the above moving object extracting method, "acquiring a second to-be-processed video from the first to-be-processed video according to the first mask map sequence" specifically includes:

obtaining a foreground target ratio of the first mask image according to the number of pixels of which the pixels are 1 in the first mask image;

when the foreground target proportion value is larger than or equal to a foreground target proportion threshold value, judging that the foreground target exists in the first mask image;

checking the first mask images in the first mask image sequence one by one to determine whether the foreground target exists;

when the foreground object exists in the two adjacent first mask images, the first to-be-processed video in the time range corresponding to the two adjacent first mask images is obtained, and the second to-be-processed video is obtained.

In a second aspect, the present invention provides a moving object extraction apparatus, the apparatus comprising:

a context acquisition module configured to:

the initial background image is acquired and the background image,

maintaining the background image;

an image acquisition module configured to perform the following operations:

acquiring a first image from a first video to be processed according to a first frame skipping rule,

acquiring a second video to be processed from the first video to be processed according to the first mask image sequence,

a foreground target segmentation module configured to:

obtaining a first sequence of mask patterns based on the first image and the background image,

a moving object extraction module configured to obtain a moving trajectory of the moving object according to the second mask map sequence.

In an embodiment of the above moving object extracting apparatus, the foreground object segmentation module is configured to perform the following specific operations:

sending the first image and the background image to a trained foreground extraction network in sequence for semantic segmentation to obtain a first mask image sequence, wherein the first mask image sequence comprises a plurality of first mask images;

sending the second image and the background image to the trained foreground extraction network in sequence for semantic segmentation to obtain a second mask image sequence, wherein the second mask image sequence comprises a plurality of second mask images;

the foreground extraction network is a convolutional neural network;

In a third aspect, the invention proposes a computer device comprising a processor and storage means adapted to store a plurality of program codes, characterized in that said program codes are adapted to be loaded and run by said processor to perform a moving object extraction method according to any of the previous solutions.

In a fourth aspect, the present invention proposes a storage medium adapted to store a plurality of program codes adapted to be loaded and run by a processor to perform the moving object extraction method according to any one of the above aspects.

Under the condition of adopting the technical scheme, the invention adopts the semantic segmentation convolutional neural network, improves the speed of identifying the moving target, and further accelerates the video processing speed through a frame skipping strategy. And the image of the current frame of the video and the background image are simultaneously sent into the convolutional neural network, any kind of object which does not exist in the background can be segmented, and some tiny disturbances in the video can be eliminated through the strong learning capacity of the neural network, so that the anti-interference capacity of the network is improved.

Drawings

Preferred embodiments of the present invention are described below with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of main steps of a moving object extraction method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of semantic segmentation of a foreground extraction network according to an embodiment of the present invention.

Fig. 3 is a flowchart of a specific implementation of step S107 in fig. 1.

Fig. 4 is a flowchart of a specific implementation of step S1072 in fig. 3.

Fig. 5 is a schematic diagram of a composition structure of a moving object extracting apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention. And can be adjusted as needed by those skilled in the art to suit particular applications.

Referring to fig. 1, fig. 1 is a flow chart of main steps of a moving object extracting method according to an embodiment of the present invention. As shown in fig. 1, the moving object extraction method according to the embodiment of the present invention includes:

step S101: acquiring a background image;

step S102: acquiring a first image from a first video to be processed according to a first frame skipping rule;

step S103: obtaining a first mask image sequence based on the first image and the background image;

step S104: acquiring a second video to be processed from the first video to be processed according to the first mask image sequence;

step S105: acquiring a second image from a second video to be processed according to a second frame skipping rule;

step S106: obtaining a second mask image sequence based on the second image and the background image;

step S107: and acquiring the motion trail of the moving target based on the second mask image sequence.

In this embodiment, the video capture device is a camera with a fixed position and angle, such as a security camera in a supermarket, a security camera on a road, and the like, for example, a scene shot by the camera is relatively fixed within a certain time range, that is, a relatively fixed background image is included in the video.

When the historical video in a certain time period needs to be video-condensed, in step S101, an initial background image of a working scene of the video to be processed needs to be acquired first. Preferably, according to a third frame skipping rule, an image in a specified time range of the initial stage of the first video to be processed is intercepted to obtain a third image, wherein the first video to be processed is a video file which needs to be subjected to video concentration. As an example, the third frame skipping rule may be set to intercept one frame of image every 100 frames, the specified time range is 30 seconds before the video file, and when the first to-be-processed video includes 25 frames per second, the image in the first to-be-processed video is intercepted from the first frame of image, resulting in 8 third images with the same size.

Preferably, the 8 third images are fused by a median filtering algorithm to obtain an initial background image. As an example, when the third image is in RGB format, the color data of the pixel at the same position in 8 third images may be regarded as 3 groups (R, G, B) of one-dimensional sequence with 8 data, and the 8R, G, and B values of the pixel at the same position in the 8 third images are subjected to one-dimensional median filtering, so as to obtain an initial background image which better reflects the real background and is fused with the information of the 8 third images.

In another embodiment, the 8 third images are first subjected to median filtering processing of the two-dimensional image matrix, and the size and/or shape of the window of the two-dimensional median filtered submatrix may be selected according to actual situations, such as selecting a 3 × 3 rectangular window, as an example. And then according to the method, carrying out one-dimensional median filtering processing on the pixels at the same position of the 8 third images subjected to the two-dimensional median filtering processing, thereby obtaining an initial background image. The one-dimensional or two-dimensional median filtering algorithm can be implemented using C language, OpenCV software library, or other computer tools.

In this embodiment, in order to increase the video processing speed, a frame skipping manner is adopted to detect whether a foreground object exists in the first to-be-processed video. In step S102, a first image is acquired from a first video to be processed according to a first frame skipping rule. The more frames are skipped by the first frame skipping rule, the faster the detection speed is, but when the interval is too large, the probability of missed detection is greatly increased, so that the first frame skipping rule needs to be reasonably selected to take efficiency and effect into consideration. As an example, when the video file is compressed by h.264 standard, the first frame skipping rule may be set to intercept I frames of the decoded video. The reason why the I frame is selected as the first frame skipping rule is that the I frame is also called a key frame, is an important frame in interframe compression coding, is a full-frame compressed coding frame, and can reconstruct a complete image only by using the data of the I frame during decoding.

In step S103, according to the time sequence of the video, the first image and the background image are sequentially sent to the trained foreground extraction network for semantic segmentation, so as to obtain a first mask sequence composed of a plurality of first mask images. Preferably, the foreground extraction network is a BiSeNet v2 network in a convolutional neural network. The BiSeNet v2 network is a real-time semantic segmentation network that separately processes spatial details and classification semantics to achieve high-precision and high-efficiency real-time semantic segmentation.

As shown in fig. 2, the output of the BiSeNet v2 network is a mask map after binarization, and in this embodiment, the semantic information of the mask map is: the position with the pixel value of 1 (white part in the mask image) represents that the foreground object exists, the position with the pixel value of 0 (black part in the mask image) represents that the foreground object does not exist, and the position is the background image part, namely the area with the pixel value of 1 in the mask image is the area which can be the moving object in the image to be processed. The mask patterns in this embodiment include the first mask pattern in step S103, the second mask pattern in step S106, and the third mask pattern.

In step S104, a ratio of the number of pixels with a pixel value of 1 in each first mask image in the first mask image sequence to the first mask image is calculated to obtain a foreground target ratio, and when the foreground target ratio is greater than or equal to a preset foreground target ratio threshold, it is determined that a foreground target exists in the first mask image. As an example, the type of the moving object that usually appears in the scene, the size of the moving object in the field of view of the camera, and the like may be set, for example, the foreground object ratio threshold may be set to 5%, and when the ratio of the pixel with the pixel value of 1 in the first mask image exceeds 5%, it is determined that the first mask image contains the foreground object.

And checking whether two adjacent first mask images in the first mask image sequence contain foreground targets one by one. And when the two adjacent first mask images contain the foreground target, selecting the first to-be-processed video in the time range corresponding to the two first mask images as the second to-be-processed video.

It can be known from the above method for obtaining the second to-be-processed video that the second to-be-processed video is the video including the foreground object, so that in the subsequent data processing, only the first to-be-processed video including the foreground object needs to be processed, and the to-be-processed video only including the background does not need to be processed, which greatly improves the speed of image processing.

In step S105, a second image is acquired from the second video to be processed according to the second frame skipping rule. In order to obtain more accurate extraction of the moving object, the second frame skipping rule may be set to frame by frame. In other embodiments, in order to obtain a faster processing speed, the second frame skipping rule may also be set to every other frame or frames, but this may reduce the accuracy of the moving object extraction.

In step S106, similarly, according to the time sequence of the video, the second image and the background image obtained in step S105 are simultaneously sent to the same trained foreground extraction network as that in step S103 for semantic segmentation, so as to obtain a second mask map sequence composed of a plurality of second mask maps. Similarly, the second mask image is the binarized mask image, and a position with a pixel value of 1 indicates that a foreground object is present, and a position with a pixel value of 0 indicates that a foreground object is not present.

Before step S107 is executed, the position information of each second foreground object in the second mask needs to be acquired. The connected domain with the pixel value of 1 in the second mask image is marked, and the marking method of the connected domain is not limited in the invention, and can be realized by a connected components within the OpenCV, and the function can distinguish different connected domains, so as to distinguish different foreground targets.

And then drawing a rectangular frame of the connected domain in the second mask diagram according to the connected domain, wherein the drawing method of the rectangular frame is not limited in the invention, and as an example, a rectangular outline of the connected domain can be obtained through a boundinget function in OpenCV.

When a plurality of moving objects exist in the video, correspondingly, the second mask image comprises a plurality of rectangular frames, each rectangular frame is numbered one by one, and the position information of the foreground object in the second mask image is obtained, wherein the position information comprises a foreground object ID and the rectangular frame corresponding to the foreground object ID, and the foreground object ID can be one or a plurality of. In this embodiment, the top left vertex of the second mask is taken as the origin of the pixel coordinate system, and the rectangular frame can be represented as S_n(u, v, w, h), wherein u and v are row coordinate values and column coordinate values of a center point of the rectangular frame in a pixel coordinate system,w is the width value of the rectangular frame, h is the height value of the rectangular frame, and n is the number of the foreground target ID.

It should be noted that, in step S104 of this embodiment, only the number of pixels 0 and 1 in the first mask map is counted, and the presence or absence of the foreground object can be determined according to the number of pixels 0 and 1 in the first mask map, and the connected component and the rectangular frame are not processed, which can further increase the speed of video processing. Similarly, step S104 may also refer to the method in step S107 to mark the connected component, obtain the rectangular frame, and determine whether the foreground object exists in the first mask map according to the ratio of the connected component or the rectangular frame.

Next, a specific implementation method of step S107 will be described with reference to fig. 3.

According to the time sequence, two adjacent second mask images, namely a first result image and a second result image, are selected, the second result image is the second mask image of the frame before the first result image, the time corresponding to the first result image is A, and the time corresponding to the second result image is B.

It should be noted that, in step S107, the foreground object includes a first foreground object and a second foreground object, the first foreground object is a foreground object in the first result image, and the second foreground object is a foreground object in the second result image. In this embodiment, for the position information in different images or for different purposes, the position information further includes first position information, second position information, third position information, predicted position information, and the like, which all have a common technical feature, that is, each includes a respective foreground object ID and a rectangular frame corresponding to the respective foreground object ID.

In step S1071, first position information of a first foreground object in a first result image is obtained, where a time corresponding to the first result image is a, the first result image includes n first foreground objects, and the first position information of the first foreground object may be represented as

S(A)_n(u1,v1,w1,h1)。

In step S1072, a second foreground object in the second result image is obtained in the first result imageThe contents of the predicted position information, like the predicted position information at the corresponding time, are: predicting the predicted position information of the second foreground object to be positioned in the second mask image at the time A according to the second result image and the position information of the second mask image in the specified time range before the time B corresponding to the second result image, wherein the predicted position information can be expressed as S' (A)_n(u′,v′,w′,h′)。

Next, a specific implementation method of step S1072 will be described with reference to fig. 4. In step S401, the current speed of the second foreground object is acquired. For this purpose, information of a third result image is further required, where the third result image is a second mask image of an mth frame before the second result image in the second mask image sequence, a time instant corresponding to the third result image is C, and M is an integer greater than or equal to 1, and as an example, M may be set to 1.

The foreground objects further comprise a third foreground object, and the third foreground object is a foreground object in the third result image. As can be seen from the foregoing, the position information further includes second position information and third position information, the second position information is position information of the second foreground object, and the third position information is position information of the third foreground object. The second position information may be represented as S (B)_n(u2, v2, w2, h2), the third position information may be represented as S (C)_n(u3,v3,w3,h3)。

The position difference of the U axis and the V axis between the center point (U2, V2) of the rectangular frame in the second position information and the center point (U3, V3) of the rectangular frame in the third position information is divided by the time difference between the second result image and the third result image, and the second foreground target can be obtained

The current speed on the U-axis is: vU (B)_n＝(u2-u3)/|B-C|，

The current speed on the V-axis is: vV (B)_n＝(v2-v3)/|B-C|，

Wherein, the sign of the speed represents the moving direction of the moving object on the U axis or the V axis.

In the same way, according to the information of the second mask image of the W-th frame before the third result image, W is an integer greater than or equal to 1. The historical speed of the third foreground subject is acquired in step S402,obtaining the historical speed vU (C) of a third foreground target U shaft_nAnd the historical velocity of the V-axis vV (C)_nW may be set to 10, as an example.

In step S403, the current speed and the historical speed are fused to obtain a predicted speed of the second foreground object. As an example, the predicted speed of the U axis is obtained by:

v′U(B)_n＝α*vU(B)_n+(1-α)*vU(C)_n；

the method for acquiring the predicted speed of the V axis comprises the following steps:

v′V(B)_n＝β*vV(B)_n+(1-β)*vV(C)_n；

where α and β are predetermined coefficients between 0 and 1, and α is 0.1 and β is 0.1, for example.

After the prediction speed is obtained, in step S404, a prediction center point of a rectangular frame of the predicted position information is obtained according to a time difference between the second result image and the first result image and the position information of the second foreground object in the second result image, and the specific method is as follows:

u′＝u2+v′U(B)_n*|A-B|，

v′＝v2+v′V(B)_n*|A-B|。

in step S405, after the predicted central point of the rectangular frame of the predicted position information is obtained, the length and width of the rectangular frame in the second result image and the position information of each foreground object corresponding to the moving object in the second mask image of the N frames before the second result image are averaged to obtain the value of w ', h ', thereby obtaining the predicted position information S ' (a)_n(u′,v′,w′,h′)。

Continuing to read step S1073, an IoU value (Intersection-over-unity) of the rectangular box in the first position information and the rectangular box in the predicted position information is calculated. IoU, the invention is not limited thereto, and IoU can be calculated by the Rect function of OpevCV as an example.

In step S1074, the IoU value obtained in step S1073 is used as a weight of the KM algorithm to obtain a moving object matching degree of the first foreground object in the first result image and the second foreground object in the second result image. The KM algorithm is an algorithm for optimal matching of weighted bipartite graphs, the KM algorithm is a technique known in the art, and a specific implementation method is not described herein again.

When the matching degree of the moving target is greater than or equal to the threshold value of the matching degree of the moving target, the first foreground target and the second foreground target are judged to be the same moving target, the foreground target ID of the first foreground target and the foreground target ID of the second foreground target correspond to the same moving target (the moving target has a unique moving target ID), the moving target ID and the central point position of a rectangular frame of the moving target in different second mask images are recorded in a database, and therefore corresponding relation data of the moving target ID, the position of the video to be processed and the position of the moving target in the image are obtained. As an example, the moving object matching degree threshold may be set to 0.3, or other values according to actual situations.

And matching the images of the adjacent frames in the second mask image according to the method in time sequence, and recording the moving object ID of each moving object and the position of each moving object in each frame image in the video to be processed, so as to extract all moving objects in the video to be processed.

It should be noted that the KM algorithm can achieve multi-target matching, that is, when there are a plurality of first foreground targets and a plurality of second foreground targets, multi-target motion trajectory extraction can be achieved through the KM algorithm.

In order to solve the problem of discontinuous foreground targets caused by occlusion and missing detection of partial frames, it may be set that, when a second foreground target does not have a matched first foreground target in the first result image, the second foreground target will remain for a certain time and continue to participate in matching of the moving target. As an example, the maximum number of subsequent frames that the second foreground object retains is K frames, and the value of K may be set to 50 frames. That is to say, within the range of the subsequent maximum 50 frames, the second foreground object participates in matching, if the matching can be successfully performed within the range of the subsequent 50 frames, the second foreground object still is regarded as the same moving object, and the track of the moving object is recorded; and if the matching is not successful for more than 50 frames, terminating the track recording of the moving target corresponding to the second foreground target. Therefore, even if the conditions of shielding, missing detection of partial frames and the like occur, the method can continuously track the moving target without changing the motion track into two sections.

In the process of foreground object segmentation, the initial background image of the starting stage in the video to be processed, which is obtained by the median filtering algorithm, is first sent to the foreground extraction network, and in order to obtain a more accurate background image in the whole video concentration process, the background image is usually updated regularly. The user may choose to update the background image only during the acquisition of the first sequence of masks or only during the acquisition of the second sequence of masks. Or updating the background image during the process of acquiring the first mask image sequence and the process of acquiring the second mask image sequence, and obtaining a more accurate target segmentation effect.

A method of maintaining the background image is explained next. According to the fourth frame skipping rule, a fourth image is obtained from the first video to be processed, and then the fourth image and the historical background image are sent to the same trained foreground extraction network in the step S103 to obtain a third mask image. The third mask map has the same semantic information as the first mask map and the second mask map, and similarly, a position with a pixel value of 1 indicates that a foreground object exists, and a position with a pixel value of 0 indicates that a foreground object does not exist. The historical background image is the background image before the moment corresponding to the fourth image.

Fusing the fourth image, the third mask image and the historical background image to obtain a new background image, wherein the specific method is that if the pixel value of the position corresponding to the fourth image in the third mask image is 1 (a moving target exists), the position of the fourth image does not participate in the fusion of the background image, and the pixel value of the position is multiplied by 0; if the pixel value of the third mask image is 0 (no moving object), the pixel value of the position of the fourth image is multiplied by a coefficient theta to obtain a first fusion background image.

If the pixel value of the position corresponding to the historical background image in the third mask image is 1 (a moving object exists), the position of the historical background image does not participate in the fusion of the background image, and the pixel value of the position is multiplied by 0; and if the pixel value of the third mask image is 0 (no moving object), multiplying the pixel value of the position of the historical background image by a coefficient 1-theta to obtain a second fusion background image.

And adding the pixel values of the corresponding positions of the first fusion background image and the second fusion background image to obtain a new background image, and updating the historical background image by using the new background image. Theta is a preset coefficient between 0 and 1, and can be set to be 0.5 as an example.

It should be noted that, the maintenance of the background image may also determine the fourth frame skipping rule by combining whether the foreground object exists in the first mask image of the first mask image sequence, and select the video frame without the foreground object as much as possible to update the background image, so as to obtain the background image with a better effect, which is helpful to improve the accuracy of semantic segmentation.

It should be noted that the background image, the first image, the second image, the third image, and the like sent to the foreground extraction network all need to be subjected to image preprocessing according to the input requirement of the foreground extraction network. Methods of image pre-processing include, and are not limited to, image scaling, image filling, image storage format conversion, normalization, and the like.

In this embodiment, the foreground extraction network needs to input the background image and the to-be-processed image captured from the video at the same time, so the network of the present invention has two inputs, and each input has the same requirement, for example, the image resolution is 512 × 288, and the number of input channels is 3. Therefore, the size of the truncated video image is first converted into a first intermediate image with a resolution of 512 × 288 by scaling, padding, etc.; converting the format of the first intermediate image into an RGB format to obtain a second intermediate image; to speed up the image processing, the value of R, G, B for each pixel of the second intermediate image is normalized, for example, the value of R, G, B is normalized from 0-255 to 0-1, resulting in an image that meets the requirements of the foreground extraction network input.

It should be noted that, the training method of the convolutional neural network is not limited in the present invention, and the convolutional neural network may be trained by using the CDNet 2014 data set or the self-established data set as an example.

It should be noted that the reason for setting the frame skipping rule is to obtain a faster video processing speed on the premise of ensuring the video processing effect. Although examples of several frame skipping rules are listed here, it will be understood by those skilled in the art that these examples should not constitute any limitation to the scope of the present invention. Without changing the basic principle of the present invention, a person skilled in the art can set a first frame skipping rule, a second frame skipping rule, a third frame skipping rule and a fourth frame skipping rule according to the actual situation of a moving object in a video.

It should be noted that the object extraction method of the present invention includes and is not limited to the application scenario of video compression. The method is also suitable for other application scenes needing to extract the moving target, such as video abstraction, moving target tracking, moving target identification and the like.

Furthermore, the invention also provides a target extraction device. As shown in fig. 5, the target extracting apparatus 5 according to the embodiment of the present invention mainly includes: a background acquisition module 51, an image acquisition module 52, a foreground object segmentation module 53 and a moving object extraction module 54.

As an example, the background acquisition module 51 is configured to perform the operation in step S101. The image acquisition module 52 is configured to perform the operations in step S102, step S104, and step S105. The foreground object segmentation module 53 is configured to perform the operations in step S103 and step S106. The moving object extraction module 54 is configured to perform the operations of step S107, steps S1071 to S1074, and steps S401 to S405.

Further, the present invention also provides a computer device comprising a processor and a storage, the storage may be configured to store and execute the program of the moving object extraction method of the above method embodiment, and the processor may be configured to execute the program in the storage, the program including but not limited to the program of the moving object extraction method of the above method embodiment. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The moving object extracting device may be a control apparatus device formed including various electronic devices.

Further, the present invention also provides a storage medium that may be configured to store a program for executing the moving object extraction method of the above-described method embodiment, which may be loaded and executed by a processor to implement the above-described moving object extraction method. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The storage medium may be a storage device formed by various electronic devices, and optionally, the storage medium in the embodiment of the present invention is a non-transitory readable and writable storage medium.

Those of skill in the art will appreciate that the method steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It should be noted that in the description of the present application, the term "a and/or B" indicates all possible combinations of a and B, such as a alone, B alone, or a and B.

It should be noted that the terms "first," "second," "third," "fourth," and the like in the description and in the claims of the invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing or implying any particular order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A moving object extraction method, characterized in that the method comprises:

acquiring a background image;

2. The moving object extraction method according to claim 1,

the step of obtaining a first mask image sequence based on the first image and the background image specifically includes sending the first image and the background image to a trained foreground extraction network in sequence for semantic segmentation to obtain the first mask image sequence, where the first mask image sequence includes a plurality of first mask images;

the foreground extraction network is a convolutional neural network;

3. The moving object extraction method according to claim 2, characterized in that the method further comprises:

marking a connected domain with a pixel value of 1 in the second mask image;

4. The moving object extraction method according to claim 3, wherein the foreground object includes a first foreground object and a second foreground object, the first foreground target is the foreground target in a first resultant image, the second foreground target is the foreground target in a second resultant image, the first result image and the second result image are two adjacent second mask images in the second mask image sequence, the second result image is the second mask image of a frame previous to the first result image, the position information includes first position information, second position information, and predicted position information, the first position information, the second position information and the predicted position information all comprise respective foreground object IDs and rectangular frames corresponding to the respective foreground object IDs;

acquiring the first position information of the first foreground target;

and acquiring the motion trail of the motion target according to the IoU value.

5. The method according to claim 4, wherein the step of obtaining the motion trajectory of the moving object according to the IoU value specifically comprises:

acquiring the second position information of the second foreground target;

6. The method according to claim 4, wherein the foreground objects further include a third foreground object, the third foreground object is the foreground object in a third result image, the third result image is the second mask image of the first M frames of the second result image in the second mask image sequence, M is an integer greater than or equal to 1;

acquiring the current speed of the second foreground target;

7. The moving object extraction method according to claim 6, wherein the step of acquiring the predicted position information from a center point of a rectangular frame of the predicted position information specifically includes:

8. The moving object extraction method according to claim 1, wherein the method of "acquiring a background image" includes:

acquiring an initial background image;

the step of "acquiring the initial background image" specifically includes:

9. The moving object extraction method according to claim 1, wherein the method of "acquiring a background image" further comprises:

maintaining the background image;

the step of "maintaining the background image" specifically includes:

10. The moving object extracting method according to claim 9,

maintaining the background image in the process of acquiring the first mask image sequence;

11. The method according to claim 2, wherein the step of acquiring a second to-be-processed video from the first to-be-processed video according to the first mask image sequence specifically comprises:

12. A moving object extraction apparatus, characterized in that the apparatus comprises:

a context acquisition module configured to:

the initial background image is acquired and the background image,

maintaining the background image;

an image acquisition module configured to perform the following operations:

a foreground target segmentation module configured to:

13. The moving object extraction device of claim 12, wherein the foreground object segmentation module is configured to perform the following specific operations:

the foreground extraction network is a convolutional neural network;

14. A computer apparatus comprising a processor and a storage device adapted to store a plurality of program codes, wherein said program codes are adapted to be loaded and run by said processor to perform a moving object extraction method according to any one of claims 1 to 11.

15. A storage medium adapted to store a plurality of program codes, wherein the program codes are adapted to be loaded and run by a processor to perform the moving object extraction method of any one of claims 1 to 11.