CN118097566B

CN118097566B - Scene change detection method, device, medium and equipment based on deep learning

Info

Publication number: CN118097566B
Application number: CN202410487285.XA
Authority: CN
Inventors: 杨国锴; 卓涛; 程志勇; 高赞
Original assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Current assignee: Qilu University of Technology; Shandong Institute of Artificial Intelligence
Filing date: 2024-04-23
Publication date: 2024-06-28
Anticipated expiration: 2044-04-23

Abstract

The invention relates to the technical field of image recognition, in particular to a scene change detection method, device, medium and equipment based on deep learning. The method comprises the following steps: acquiring an image to be detected; inputting an image pair to be detected into a homography-based alignment module, and acquiring images with respective scenes aligned; the images related to the aligned images are extracted and input into a preliminary fluctuation detection network to obtain fluctuation information, the fluctuation information is input into respective positioning networks to output boundary boxes of fluctuation areas of the two images. According to the invention, the two images are aligned through homography, so that the characteristic that the fluctuation of the two images cannot be acquired quickly is made up, the corresponding relation of the two images is captured through a cross attention mechanism structure, and the loss of information in the unaligned images is made up. The network adopts a twin neural network architecture, two images are simultaneously operated, and the identification of a variation area is better completed through a feature fusion module.

Description

Scene change detection method, device, medium and equipment based on deep learning

Technical Field

The invention relates to the technical field of image recognition, in particular to a scene change detection method, device, medium and equipment based on deep learning.

Background

With the rapid development of computer vision technology, exploring scene changes plays an important role in the fields of image processing and computer graphics. Scene changes aim at developing algorithms and techniques to detect, analyze and understand changes in different scenes, the need to extract useful information from dynamic scenes. This subject relates to modeling, detecting and describing changes in image sequences or video to provide an understanding and analysis of scene evolution. With the popularization of digital image capturing apparatuses and the increase in computing power, it becomes easier to acquire and process image sequences and video data, and thus, it becomes increasingly important to accurately understand and analyze changes in scenes. However, there is still some knowledge gap for understanding the changes in the scene. Changes in the scene may involve the appearance, disappearance, movement, shape change, etc. of objects, as well as changes in illumination, background, etc. of the scene. For example, given a pair of images, the location of the change between them is determined. The primary solution is to protect against extraneous "noise" or "disturbance" variables. For example, in a fixed camera surveillance application, the "disturbance" parameter may be a scene change in illumination, changing weather conditions (e.g., rain, fog), etc., all of which prevent the application of common methods. Furthermore, the two images may come from different shooting angles entirely, there may be geometrical variations between them in addition to photometric variations. In this case, the effect of detecting the change in the image pair is not ideal. It can be seen that the understanding and analysis of these changes is still not deep enough at present. Therefore, how to provide a device that can accurately and reliably detect changes in an image pair without being affected by the external scene environment and disregarding geometric changes is a challenging problem.

Disclosure of Invention

Aiming at the defects of the prior art, the invention develops a scene change detection method, a scene change detection device, a scene change detection medium and scene change detection equipment based on deep learning.

The technical scheme for solving the technical problems is as follows: in one aspect, the invention provides a scene change detection method based on deep learning, which comprises the following steps:

a) Preprocessing the two original images to obtain a preprocessed size of Two images of (2)AndL and R are two images,The data type representing the matrix elements is real,For the height of the image to be high,Is the width of the image, 3 is the number of channels of the image,The representation image L is formed by a matrix of real numbers of a shape size 3 x h x w,And the same is done;

b) Constructing a homography-based alignment module, inputting the preprocessed two images L and R into the module to respectively obtain aligned images L 'and R', L 'corresponding to the two images, wherein the L' is the alignment of the image L based on an image R coordinate system, and the R is the alignment of the image L based on the image R coordinate system Is the alignment of the image R based on the image L coordinate system, so that the spatial position between the image L 'and the image R is consistent, and the spatial position between the image R' and the image L is consistent;

c) Constructing a preliminary fluctuation detection network composed of a feature extraction module and a fluctuation extraction module, respectively carrying out channel combination on two aligned images L 'and R' and corresponding preprocessed images R and L according to the spatial position consistency of the images, and combining L and R 'in the channel dimension in the channel combination process to obtain a 6-channel image LR' with the size of Where h and w represent the height and width of the image, respectively,The representation image LR' is formed by a real matrix with a shape size of 6×h×w. Similarly, R and L 'are combined in the channel dimension to obtain a 6-channel image RL' which is also of the size. The combined images LR 'and RL' are respectively input into the corresponding preliminary change detection network to obtain change information D _L and D _R corresponding to the images L and R, LR 'is the combined image of the preprocessed image L and the aligned image R', RL 'is the combined image of the preprocessed image R and the aligned image L', D _L is the change information of the image L, and D _R is the change information of the image R;

d) Constructing a positioning network formed by a feature fusion module and a frame detection module, inputting variation information D _L and D _R obtained by the preliminary variation detection network into the respective corresponding positioning networks, and then outputting a boundary frame of a variation area of each of two images L and R by the two positioning networks;

e) Training a positioning network.

Based on the scene change detection method based on deep learning, the step b) comprises the following steps:

b-1) based on the homography alignment module, the image feature point matching is composed of feature point detection, feature point description and feature point matching, and the image alignment is composed of a calculation homography transformation matrix and a registration image;

b-2) inputting the preprocessed images L and R into characteristic point matching of an alignment module, obtaining respective characteristic points of the image pairs, matching the characteristic points in the two images, and outputting to obtain characteristic points KP _L,KP_R,KP_L successfully matched with the two images as points with obvious local structures in the image L, and KP _R as points with obvious local structures in the image R;

b-3) inputting the matched characteristic points KP _L,KP_R into a method for calculating homography transformation matrix to calculate transformation matrices H _L-R and H _{R -L},H_L-R as images L-direction images The aligned transformation matrix, H _{R -L} is the transformation matrix for aligning the image R to the image L, and then the calculated transformation matrix is applied to the corresponding image to realize the image alignment, and the aligned image is obtained by outputtingAndL 'is the image in which the image L is aligned in the scene of the image R, and R' is the image in which the image R is aligned in the scene of the image L.

Based on the scene change detection method based on deep learning, the step c) includes the following steps:

c-1) the preliminary fluctuation detection network is composed of image channel connection, a U-Net encoder and a fluctuation information extraction module, wherein the fluctuation information extraction module is composed of a subtraction operation and a cross attention mechanism;

c-2) aligning the images 、And the corresponding image、Channel combination is carried out to obtain an image pairAndThe two combined image pairs are respectively input into a U-Net encoder, and respectively output to obtain two groups of five intermediate feature images with different scalesAnd,;

C-3) dividing the channels of the two generated feature maps in half, i.eIs divided intoAnd，Representing an imageA corresponding characteristic map is provided for the user,Representing an imageA corresponding feature map; Is divided into And，A feature map corresponding to the image R is shown,Representing an imageA corresponding feature map;

c-4) utilizing a change extraction module pair 、、AndProcessing to obtain variation information corresponding to each of the images L and RAndTaking the change information of the image L at the first level as an example, in the intermediate feature map at the first level, forAndPerforming subtraction operation, and comparing the feature map obtained after subtraction with the feature map obtained after subtractionFusing to obtain variation information of the image L in the first-level intermediate feature map，For the image L to change information in the first-level intermediate feature map,For an intermediate feature map of the image L at a first level,Is an imageIn the intermediate feature map of the first level,Is a fusion mechanism. In the same way, get，For the image R to change information in the first-level intermediate feature map,For the intermediate feature map of image R at the first level,Is an imageAn intermediate feature map at a first level; in the second through fifth levels of the two sets of intermediate feature maps,Taking the example of obtaining the variation information of the image L from the second level to the fifth level, firstly, forAndPerforming subtraction operation to obtainAnd then toAndCross-attention processing to obtainWill beAnd (3) withThe result obtained by the addition is compared withFusion is carried out to obtain an imageVariation information of feature map at the layer，,,，For the image L to change information in the intermediate feature map of the second to fifth layers,For the intermediate feature map of the image L at the second to fifth levels,Is an imageIn the intermediate feature maps of the second to fifth levels,In order to be a mechanism of fusion,Is a cross-attention mechanism; similarly, variation information of the image R from the second level to the fifth level is obtainedThe fluctuation information of the representative image L is collectively referred to asThe fluctuation information of the representative image R is collectively called as，。

Based on the scene change detection method based on deep learning, the step d) comprises the following steps:

d-1) feature variation information generated for preliminary variation detection network by U-Net decoder AndUpsampling and decoding to finally generate feature maps at the original image resolution, respectivelyAnd;

D-2) mapping featuresAndIn the component input to the prediction target bounding box, the changed region in the two images is output and a bounding box is generated around the region.

Based on the scene change detection method based on deep learning, the step e) comprises the following steps:

e-1) the preprocessed image pairs are aligned according to 20:1:2 is divided into a training set, a verification set and a test set;

e-2) training the network with key points loss and offsetloss, optimizing the overall objective with Adam, learning rate 0.00001, weight decay 0.0005, using DDP training strategy batchsize of 16, performing 200 iterations during training, and performing verification with verification set every 1 round interval.

In another aspect, an embodiment of the present invention provides a scene change detection apparatus based on deep learning, including:

Based on a homography alignment module, processing the preprocessed two images L and R to obtain alignment images L 'and R' corresponding to the two images; the preliminary change detection network module comprises a feature extraction module and a transformation extraction module, wherein the two images are input into the feature extraction module to obtain a preprocessed image and an image feature fusion module after the preprocessed image and the aligned image are combined; and the positioning network module, the characteristic fusion module and the frame detection module are used for obtaining the boundary frames of the respective change areas of the two images L and R.

In yet another aspect, embodiments of the present invention provide a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform steps in the scene change detection method.

In a final aspect, an embodiment of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the scene change detection method when executing the program.

The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and the above technical solution has the following advantages or beneficial effects:

A network formed by combining a homography image registration structure and a cross attention mechanism structure is adopted. The two images are aligned under the same coordinate system through homography, so that the characteristic that fluctuation in the two images cannot be acquired quickly is made up, the corresponding relation of the two images is captured through a cross attention mechanism structure, and the loss of information in the unaligned images is made up. The network adopts a twin neural network architecture, so that two images can be operated at the same time, and the fusion of the change characteristics is enhanced through the characteristic fusion module, so that the identification of a change area is better completed.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a network configuration diagram of scene change detection according to the present invention.

FIG. 3 is a first layer intermediate feature map transformation extraction operational map of the present invention.

Fig. 4 is a diagram of the second to fifth layer intermediate feature map transformation extraction operations of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

A scene change detection method based on deep learning comprises the following steps:

d) Constructing a positioning network composed of a feature fusion module and a frame detection module, inputting variation information D _L and D _R obtained by the preliminary variation detection network into respective corresponding positioning networks, and then outputting a boundary frame boundingbox of respective variation areas of two images L and R by the two positioning networks, wherein a variation area between each image is positioned by a positioning box;

e) Training a positioning network.

In this embodiment, step b) includes the steps of:

In this embodiment, step c) includes the steps of:

c-1) the preliminary fluctuation detection network is composed of an image channel connection, a U-Net encoder and a fluctuation information extraction module, wherein the fluctuation information extraction module is composed of a subtraction operation and a cross attention mechanism.

C-2) aligning the images、And the corresponding image、Channel combination is carried out to obtain an image pairAndThe two combined image pairs are respectively input into a U-Net encoder, and respectively output to obtain two groups of five intermediate feature images with different scalesAnd,;

c-4) utilizing a change extraction module pair 、、AndProcessing to obtain variation information corresponding to each of the images L and RAndTaking the change information of the image L at the first level as an example, in the intermediate feature map at the first level, forAndPerforming subtraction operation, and comparing the feature map obtained after subtraction with the feature map obtained after subtractionFusing to obtain variation information of the image L in the first-level intermediate feature map，For the image L to change information in the first-level intermediate feature map,For an intermediate feature map of the image L at a first level,Is an imageIn the intermediate feature map of the first level,Is a fusion mechanism. In the same way, get，For the image R to change information in the first-level intermediate feature map,For the intermediate feature map of image R at the first level,Is an imageAn intermediate feature map at a first level; in the second through fifth levels of the two sets of intermediate feature maps,Taking the example of obtaining the variation information of the image L from the second level to the fifth level, firstly, forAndPerforming subtraction operation to obtainAnd then toAndCross-attention processing to obtainWill beAnd (3) withThe result obtained by the addition is compared withFusion is carried out to obtain an imageVariation information of feature map at the layer，,,，For the image L to change information in the intermediate feature map of the second to fifth layers,For the intermediate feature map of the image L at the second to fifth levels,Is an imageIn the intermediate feature maps of the second to fifth levels,In order to be a mechanism of fusion,Is a cross-attention mechanism; similarly, variation information of the image R in the second to fifth levels is obtained; similarly, variation information of the image R from the second level to the fifth level is obtainedThe fluctuation information of the representative image L is collectively referred to asThe fluctuation information of the representative image R is collectively called as，。

In this embodiment, step d) includes the steps of:

In this embodiment, step e) includes the steps of:

To verify the effectiveness of the present invention, evaluations were performed on the COCO-INPAINTED dataset, the Synthtext-Change dataset, the VIRAT-STD dataset, and the Kubric-Change dataset, with the COCO-INPAINTED dataset being a variation-based test set that we collated from the COCO test subset. In this embodiment, the test set is divided into three categories according to the size of the variable object, namely small, medium and large, all represents the integration of the test sets of the three categories, and we sort 1655 pairs of images for small objects, 1747 pairs of images for medium objects, 1006 pairs of images for large objects, and 4408 pairs of images for COCO-INPAINTED test sets. Synthtext-Change dataset random text was added to the "background" image by synthesis techniques and 5000 pairs of images were generated in a manner consistent with their geometry. To detect changes in outdoor scenes, 1000 pairs of images are randomly selected from the STD dataset, since STD does not provide a base GroundTruth for variation, an automated tool is used to obtain base GroundTruth, since the camera is static, there is one and the same geometric transformation between images, but the photometric conditions may change due to time of day, weather conditions, etc. Kubric-Change datasets are 1605 varying pairs of realistic images, a scene consisting of a set of randomly selected 3D objects that lie on a randomly textured ground plane. For a given scene, objects are iteratively removed therefrom and pairs of "before" and "after" images are captured.

For quantitative evaluation, we calculate the average accuracy AP as an evaluation index based on the prediction bounding box and the ground bounding box according to the previous correlation method.

Comparison of the performance of the classical image change detection algorithm and the performance of the invention is shown in the following table, 200 epochs are experimentally set, an optimization method Adam is adopted, the default learning rate is 0.00001, and the weight attenuation is 0.0005; to enhance the fitting ability of the model to the data, we resort to random affine transformation, contrast enhancement, illumination enhancement and saturation enhancement.

TABLE 1 comparison of the currently optimal variation detection model with the performance of the present invention on different data sets

。

The CYWS model is the optimal change detection model in the current research field, and from table 1, it can be found that compared with the CYWS model, the model effect of the model obtains excellent performance in COCO-INPAINTED and VIRAT-STD data sets, and performance in other data sets is in a stable state.

Example 2

The embodiment of the invention provides a scene change detection device based on deep learning, which comprises the following components: based on a homography alignment module, processing the preprocessed two images L and R to obtain alignment images L 'and R' corresponding to the two images; the preliminary change detection network module comprises a feature extraction module and a change extraction module, wherein the two images are input into the feature extraction module to obtain a preprocessed image and an image feature fusion module after the preprocessed image and the aligned image are combined; and the positioning network module, the characteristic fusion module and the frame detection module are used for obtaining the boundary frames of the respective change areas of the two images L and R.

Example 3

Embodiments of the present invention provide a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the scene change detection method. The storage medium may include a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or the like.

Example 4

An embodiment of the application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the scene change detection method when executing the program. The computer device in the embodiment of the application can be a terminal or a server, and the terminal can be a terminal device such as a smart phone, a tablet Personal computer, a notebook computer, a touch screen, a game console, a Personal computer (PC, personal Computer), a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) and the like.

While the foregoing description of the embodiments of the present invention has been presented with reference to the drawings, it is not intended to limit the scope of the invention, but rather, it is apparent that various modifications or variations can be made by those skilled in the art without the need for inventive work on the basis of the technical solutions of the present invention.

Claims

1. The scene change detection method based on deep learning is characterized by comprising the following steps of:

a) Preprocessing the two original images to obtain a preprocessed size of Two images of (2)AndL and R are two images,The data type representing the matrix elements is real,For the height of the image to be high,Is the width of the image, 3 is the number of channels of the image,The representation image L is formed by a matrix of real numbers of a shape size 3 x h x w,The representation image R is composed of a real matrix with the shape size of 3 Xh x w;

c) Constructing a preliminary fluctuation detection network composed of a feature extraction module and a fluctuation extraction module, respectively carrying out channel combination on two images L 'and R' after alignment and corresponding preprocessed images R and L, and combining L and R 'in the channel dimension in the channel combination process to obtain a 6-channel image LR', wherein the size of the 6-channel image LR is as follows Where h and w represent the height and width of the image, respectively,The representation image LR ' is composed of a real matrix with a shape size of 6 Xh x w, R and L ' are combined in the channel dimension to obtain a 6-channel image RL ' with a size ofThe combined images LR 'and RL' are respectively input into the corresponding preliminary change detection network to obtain change information D _L and D _R corresponding to the images L and R, LR 'is the combined image of the preprocessed image L and the aligned image R', RL 'is the combined image of the preprocessed image R and the aligned image L', D _L is the change information of the image L, and D _R is the change information of the image R;

c) The method comprises the following steps:

c-4) utilizing a change extraction module pair 、、AndProcessing to obtain variation information corresponding to each of the images L and RAnd；

In the intermediate feature map of the first level, the change information of the image L at the first level is acquired, forAndPerforming subtraction operation, and comparing the feature map obtained after subtraction with the feature map obtained after subtractionThe fusion is carried out and the fusion is carried out,

Thereby obtaining the variation information of the image L in the first-level intermediate feature map，For the image L to change information in the first-level intermediate feature map,For an intermediate feature map of the image L at a first level,Is an imageIn the intermediate feature map of the first level,Is a fusion mechanism;

obtained by the same procedure as the acquisition of the variation information of the image L at the first level ，For the image R to change information in the first-level intermediate feature map,For the intermediate feature map of image R at the first level,Is an imageAn intermediate feature map at a first level;

in the second through fifth levels of the intermediate feature map, To acquire variation information of the images L and R at the second to fifth levels;

First pair AndPerforming subtraction operation to obtainAnd then toAndCross-attention processing to obtainWill beAnd (3) withThe result obtained by the addition is compared withFusion is carried out to obtain an imageVariation information of feature map at the layer，,,，For the image L to change information in the intermediate feature map of the second to fifth layers,For the intermediate feature map of the image L at the second to fifth levels,Is an imageIn the intermediate feature maps of the second to fifth levels,In order to be a mechanism of fusion,Is a cross-attention mechanism;

Obtaining the change information of the image R at the second to fifth levels by adopting the same steps as the change information of the image L at the second to fifth levels The fluctuation information of the representative image L is collectively referred to asThe fluctuation information of the representative image R is collectively called as，；

e) Training a positioning network.

2. The scene change detection method based on deep learning according to claim 1, wherein the step b) includes the steps of:

3. The scene change detection method based on deep learning according to claim 1, wherein the step d) includes the steps of:

4. The scene change detection method based on deep learning according to claim 1, wherein the step e) includes the steps of:

5. A scene change detection device based on deep learning, characterized in that the steps in the scene change detection method according to any one of claims 1 to 4 are performed, comprising:

Based on a homography alignment module, processing the preprocessed two images L and R to obtain alignment images L 'and R' corresponding to the two images;

the preliminary change detection network module comprises a feature extraction module and a change extraction module, wherein the two images are input into the feature extraction module to obtain a preprocessed image and an image feature fusion module after the preprocessed image and the aligned image are combined;

And the positioning network module, the characteristic fusion module and the frame detection module are used for obtaining the boundary frames of the respective change areas of the two images L and R.

6. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the scene change detection method of any of claims 1 to 4.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the scene change detection method according to any of claims 1 to 4 when the program is executed.