CN115546652A

CN115546652A - Multi-time-state target detection model and construction method, device and application thereof

Info

Publication number: CN115546652A
Application number: CN202211504037.9A
Authority: CN
Inventors: 李圣权; 黄乾玮; 王国梁; 毛云青; 韩致远
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2022-12-30
Anticipated expiration: 2042-11-29
Also published as: CN115546652B

Abstract

The scheme provides a multi-temporal target detection model and a construction method, a device and an application thereof, and the method comprises the following steps: acquiring a first time-state picture and a second time-state picture as training samples, constructing a multi-time-state target detection model, sending the first time-state picture into a first encoder, sending the second time-state picture into a second encoder to obtain multi-level coding information, decoding the coding information by using a panoramic decoder and a difference decoder to obtain a panoramic classification result image and a difference classification result image, and obtaining a target to be detected according to the panoramic classification result image and the difference classification result image. According to the scheme, semi-supervised learning is used, and the panoramic decoder is added during training, so that the difference of different tense pictures can be better judged, and the illegal building is detected.

Description

Multi-time-state target detection model and construction method, device and application thereof

Technical Field

The application relates to the field of computer algorithms and machine learning, in particular to a multi-temporal target detection model and a construction method, a device and application thereof.

Background

In recent years, along with the continuous enlargement of urban construction scale, the continuous perfection of functions and the gradual increase of novel communities, when urban construction is developed towards a good side, more and more illegal buildings, especially some old communities, villages and towns and factories are built in disorder for the benefit of self, the phenomenon of the construction in disorder not only affects the appearance of the city, but also can harm public safety and self safety, and therefore a method for timely discovering illegal buildings is urgently needed, and better social safety and personal safety are guaranteed.

With the continuous growth of machine learning technology, the violation buildings can be intelligently detected by using the machine learning technology, but because the samples of the violation building pictures are few, and the multiple time-phase pictures of the same place are needed for learning the violation building pictures by using the model, the detection effect of the violation buildings by using the model is not very good, and the conventional multi-time-based change detection algorithm is Binary Change Detection (BCD), namely, traditional matching and other algorithms, in the BCD, the change mapping distinguishes changed pixels and unchanged pixels by using binary labels. The BCD technology can be considered as two categories, and the defects that the variation range and the variation type cannot be determined and semantic information is lacked are overcome. The intelligent degree level of the city violation building supervision is still low nowadays, and the difference can not be satisfied by far only distinguishing.

In the prior art, a segmentation network based on deep learning is used for training a data set of a violation building in a complete supervision mode, but overfitting is easy to occur, generalization is poor, a large number of training samples are needed to be labeled, parameters of the network are huge in the classification process, although the problem of samples can be effectively solved by using a semi-supervision technology, learning performance is reduced due to too few label nodes in the semi-supervised training mode, if the violation building is classified by using a pure semantic segmentation network, although classification can be realized according to the training samples, sample differences and learning sample differences cannot be distinguished, and a large difference exists in the aspects of comprehensive change identification and understanding.

Disclosure of Invention

The scheme provides a multi-temporal target detection model and a construction method, a device and application thereof, and aims at the problem that the accuracy rate of intelligent detection by using the model is not high due to insufficient training samples at present, the scheme uses semi-supervised learning, and the structure of double encoders and double decoders is used for detecting buildings against regulations, so that the detection accuracy rate is improved.

In a first aspect, the present application provides a method for constructing a multi-temporal target detection model, including:

the method comprises the steps of obtaining at least one group of temporal pictures of at least one to-be-detected place, wherein each group of temporal pictures comprises a first temporal picture and a second temporal picture which are taken from different time points, and marking a to-be-detected target in each group of temporal pictures to obtain a training sample;

constructing a multi-temporal target detection model, wherein the multi-temporal target detection model consists of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent into the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent into the second encoder to be encoded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

sending the first coding information and the second coding information of each level into the panorama decoder for decoding to obtain a final panorama decoding result;

sending the first coding information and the second coding information of each level into the difference decoder, performing splicing-convolution operation on the first coding information and the second coding information of the last level to obtain a difference decoding result with the same depth as the first coding information of the previous level, performing difference jump connection on the difference decoding result to obtain difference decoding jump information, reducing the depth of the difference decoding jump information again to obtain a new difference decoding result, traversing the difference jump connection operation to obtain a final difference decoding result, splicing the difference decoding result and the difference information with the same depth by the difference jump connection operation, inputting the final difference decoding result into the predictor to obtain a difference classification result graph;

and respectively inputting the panoramic classification result graph and the difference classification result graph to a prediction head to obtain a target to be detected.

In a second aspect, the present disclosure provides a multi-temporal target detection model, which is constructed by using the method of the first aspect.

In a third aspect, the present disclosure provides a multi-temporal target detection method, including:

acquiring at least one group of temporal pictures of a to-be-detected place, wherein each group of temporal pictures comprises a first temporal picture and a second temporal picture which are taken from different time points;

and sending the first temporal image and the second temporal image of each group of temporal images into a multi-temporal target detection model to obtain a target to be detected.

In a fourth aspect, the present application provides a method for detecting a building against traffic regulations, including:

acquiring a group of first temporal pictures and second temporal pictures of a to-be-detected place;

sending a first temporal picture and a second temporal picture of each group of temporal pictures into a violation building detection model to obtain a panoramic classification result picture and a difference classification result picture, wherein the violation building detection model is obtained by training a multi-temporal target detection model by using the temporal pictures marked with buildings as training samples, a target to be detected corresponding to a panoramic decoder is the panoramic classification result picture, a target to be detected corresponding to a difference decoder is the difference classification result picture, and the acquisition time of the first temporal picture is earlier than that of the second temporal picture;

and combining the classification result of the panoramic classification result graph and the classification result of the difference classification result graph to obtain a change result graph, and judging the illegal building according to the change result graph.

In a fifth aspect, the present application provides a multi-temporal object detection model building apparatus, including:

an acquisition module: the method comprises the steps of obtaining at least one group of temporal pictures of at least one to-be-detected place, wherein each group of temporal pictures comprises a first temporal picture and a second temporal picture which are taken from different time points, and marking a to-be-detected target in each group of temporal pictures to obtain a training sample;

and an encoding module: constructing a multi-temporal target detection model, wherein the multi-temporal target detection model consists of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent into the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent into the second encoder to be encoded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

a first decoding module: sending the first coding information and the second coding information of each level into the panoramic decoder for decoding to obtain a final panoramic decoding result;

a second decoding module: sending the first coding information and the second coding information of each level into the difference decoder, performing splicing-convolution operation on the first coding information and the second coding information of the last level to obtain a difference decoding result with the same depth as the first coding information of the previous level, performing difference jump connection on the difference decoding result to obtain difference decoding jump information, reducing the depth of the difference decoding jump information again to obtain a new difference decoding result, traversing the difference jump connection operation to obtain a final difference decoding result, splicing the difference jump connection operation by using the difference decoding result and the difference information with the same depth, inputting the final difference decoding result into the predictor to obtain a difference classification result graph;

a detection module: and respectively inputting the panoramic classification result graph and the difference classification result graph to a prediction head to obtain a target to be detected.

In a fifth aspect, the present application provides an electronic device comprising a memory having stored therein a computer program and a processor configured to run the computer program to perform a method of constructing a multi-temporal object detection model or a multi-temporal object detection method or a violation building detection method.

In a sixth aspect, the present application provides a readable storage medium having stored therein a computer program comprising instructions for controlling a process to perform a method of constructing a multi-temporal object detection model or a multi-temporal object detection method or a violation building detection method.

Compared with the prior art, the technical scheme has the following characteristics and beneficial effects:

the scheme uses semi-supervised learning, solves the problem of poor accuracy caused by insufficient training samples in the prior art, and ensures normal fusion between feature maps by using perturbation processing and jump connection on the basis of the semi-supervised learning; the scheme divides the illegal building detection problem into two sub-problems, namely a panoramic building classification problem and a difference detection classification problem, through multi-task learning, combines the results of two decoders to output a change result graph in a parameter sharing mode, and classifies the results; the loss of the difference decoder in the scheme is composed of two classification losses and similarity losses, and the classification of the images can be realized while inputting a difference classification result graph; the two encoders of the scheme both adopt variable row convolution, pixel offset information is added in the variable row convolution, and compared with the common convolution, the pixel offset can be learned, so that the network learning change is facilitated; the decoder of the scheme adopts a full convolution form to prevent the high-resolution image from generating artifacts.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flowchart of a method for constructing a multi-temporal object detection model according to an embodiment of the present application;

FIG. 2 is a flow chart of encoding when decoding using a panorama decoder in a multi-temporal object detection model according to an embodiment of the present application;

FIG. 3 is a decoding flow diagram of a panorama decoder in a multi-temporal target detection model according to an embodiment of the present application;

FIG. 4 is a flow chart of encoding when decoding using a disparity decoder in a multi-temporal object detection model according to an embodiment of the present application;

FIG. 5 is a decoding flow diagram of a disparity decoder in a multi-temporal target detection model according to an embodiment of the present application;

FIG. 6 is a block diagram of an apparatus according to an embodiment of the present application;

fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the methods may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Example one

The scheme of the application provides a method for constructing a multi-temporal target detection model, and with reference to fig. 1, the method comprises the following steps:

In some embodiments, the first temporal image and the second temporal image are two images with different shooting times, and the first temporal image and the second temporal image are cut into the same size to obtain the training sample.

Specifically, openCV programming may be applied to uniformly crop the urban building orthophotograph into 521 × 512 pixel sizes.

In some specific embodiments of this scheme, use unmanned aerial vehicle to shoot equipment and obtain first tense picture and second tense picture, and fixed unmanned aerial vehicle removes the route and shoots the place to guarantee that two pictures only time is different.

In some embodiments, the application scenario of the scheme is illegal building detection, at this time, the target to be detected on the temporal image is a building, the shooting time of the first temporal image and the second temporal image in different temporal states cannot be too short, and the shooting time can be 1 month, 2 months or three months in the scheme, which is not limited by the scheme.

In some embodiments, the first encoder and the second encoder are identical in structure, and the first coding information and the second coding information are obtained by respectively coding the first temporal picture and the second temporal picture by using the deformable convolutional layer.

Specifically, the variability convolutional layer can learn one more additional pixel offset information relative to the standard convolutional layer, so as to better represent the geometric transformation of the picture, and to better find the scale change and the complex geometric change in the first temporal picture and the second temporal picture, and the formula of the variability convolutional layer is characterized as follows:

wherein, Δ p _n Representing the amount of pixel shift, R is the size and extent of the receptive field, p ₀ Position information, p, representing first coded information _n The representation enumerates a certain receptive field size and expansion range in the R range, w represents a weight.

In some embodiments, the first encoder and the second encoder encode in a weight sharing manner.

Specifically, the weight sharing is to synchronously update respective network weights when the first decoder and the second decoder perform back propagation, and the purpose of encoding by the first encoder and the second encoder in a weight sharing manner is to reduce model operation and maintain the effect of continuous associated feature information.

In some embodiments, the first encoded information and the second encoded information of the same level are perturbed.

Specifically, the perturbation process randomly flips horizontally for each pair of encoded information, rotates the same number of degrees, illustratively rotates a pair of encoded information of the first layer ninety degrees clockwise, rotates a pair of encoded information of the second layer ninety degrees again on the basis of the first layer, and so on.

Specifically, the purpose of the perturbation processing is to make the detection result of the trained model more accurate.

Specifically, the panorama decoder updates the weight learning classification through classification loss and iou loss back propagation, so that the first encoder and the second encoder have better encoding effects, and the first encoding information and the second encoding information are represented in the form of feature maps.

Specifically, the first encoder and the second encoder adopt a semantic segmentation mode, different types in the first temporal image and the second temporal image are segmented by using different colors, and a panoramic decoder is added in the training process, so that the segmentation effect of the first encoder and the second encoder is better, and a better result is obtained when a subsequent difference decoder is used for application.

Illustratively, as shown in fig. 2, the first encoder and the second encoder encode the first temporal picture and the second temporal picture respectively, so as to obtain 4 levels of 4 pairs of encoding information, where the encoding information is represented in the form of a feature map, and the depths of the 4 levels are 64, 128, 256, and 512, respectively.

In some embodiments, in the step of "sending the first coding information and the second coding information of each level into the panorama decoder for decoding to obtain a final panorama decoding result", the first coding information and the second coding information of the last level are subjected to a splicing-convolution operation to obtain a panorama decoding result with the same depth as the first coding information of the previous level, the panorama decoding result is subjected to panorama skip connection to obtain panorama decoding skip information, the panorama decoding skip information is used as an information panorama decoding result after the depth is reduced again, and the panorama skip connection operation is traversed to obtain the final panorama decoding result, wherein the panorama skip connection operation is used for splicing the panorama decoding result and the first coding information and the second coding information with the same depth.

For example, as shown in fig. 3, a pair of coded information with a depth of 512 at the last level is spliced, a convolution of 1x1 is used to convolve the splicing result to obtain a first convolution result with a depth of 256, a pair of coded information with a depth of 265 and the first convolution result with a depth of 256 are subjected to first jump connection to obtain first panorama jump information, a convolution of 1x1 is used to convolve the first panorama jump information to obtain a second convolution result with a depth of 128, a pair of coded information with a depth of 128 and the second convolution result with a depth of 128 are subjected to second jump connection to obtain second panorama jump information, a convolution of 1x1 is used to convolve the result of the second jump connection to obtain a third convolution result with a depth of 64, a pair of coded information with a depth of 64 and the third convolution result with a depth of 64 are subjected to third jump connection to obtain a final panorama decoding result, and the panorama decoding result is input to an FCN-Head predictor to obtain a panorama classification result map.

Specifically, the skip join may supplement some more abstract and locally less coding parsing for the panorama decoding result, so that the boundary can be more accurately segmented when the picture is segmented, and an accurate classification prediction is generated.

In some embodiments, the difference information is obtained by subtracting second coding information corresponding to a second temporal state from first coding information corresponding to a first temporal state picture, and a shooting time point of the first temporal state picture is earlier than a shooting time point of the second temporal state picture.

Illustratively, as shown in fig. 4, the first encoder and the second encoder encode the first temporal picture and the second temporal picture respectively to obtain 4 pairs of encoded information of 4 levels, the encoded information is represented in the form of a feature map, where a pair of encoded information of the first level is FM1 and FM1', a pair of encoded information of the second level is FM2 and FM2', a pair of encoded information of the third level is FM3 and FM3', and a pair of encoded information of the fourth level is FM4 and FM4', respectively, where the temporal states of FM1, FM2, FM3, and FM4 are earlier than the temporal states of FM1', FM2', FM3', and FM4'.

For example, as shown in fig. 5, FM5 is obtained by splicing FM4 and FM4', FM6 is obtained by splicing FM5 with FM 3-FM 3' after performing 1 × 1 convolution, FM7 is obtained by splicing FM6 with FM2-FM2', FM7 is obtained by performing 1 × 1 convolution and then splicing with FM1-FM1', a final difference decoding result is obtained, and the final difference decoding result is input to the FCN-Head predictor to obtain a difference classification result graph.

In particular, the panoramic decoder and the disparity decoder each take the form of a full convolution in order to prevent artifacts from occurring in the high resolution image.

In some embodiments, the multi-temporal object detection model employs semi-supervised learning, i.e.Labeling part of training samples, wherein the number of training iterations is 100, the size of batch size is 16, adam is used as an optimizer, and the initial learning rate is 10 ^-3 。

In some embodiments, the loss function of the multi-temporal target detection model is composed of a similarity loss, a panorama encoder classification loss, a disparity encoder loss, and a two-classification loss.

Further, the similarity loss is represented by a preliminary difference loss L _d Cross entropy loss L _ce The preliminary difference loss is obtained by performing a second-order norm formula on each pixel difference in a pair of encoding information of the same level, and the formula is characterized as follows:

wherein the content of the first and second substances,z _i pixel values of the first temporal picture of the ith level,z' _i pixel values representing a second temporal picture of an i-th level,w(t) In order to be a function of the weight,w(t) The longer the time, the heavier the weight, B is the batch input of each training, and C represents the number of channels.

The cross entropy loss L _ce Is to make the panoramic result graph y and the labeled corresponding training sample

Performing cross entropy to obtain cross entropy loss L _ce The formula is characterized as follows:

and weighting and adding the preliminary difference loss and the cross entropy loss to obtain similarity loss, wherein a formula is characterized as follows:

wherein, 417For the weight, it can be considered as setting, as a penalty term,λoverfitting of the model is prevented.

Specifically, the similarity loss is to ensure that the results are approximate, or the output vectors are closer to each other, and the comparison is performed from the output spatial distribution, and meanwhile, closer edge features can be better learned from the labeled training samples.

Further, the panoramic result chart Pc1 and the training sample with the label are used

And comparing to obtain the classification loss of the panoramic encoder, wherein the formula is characterized as follows:

wherein the content of the first and second substances,L _class1 representing a panorama encoder classification penalty.

Specifically, the panorama decoder classification loss adopts pixel-by-pixel cross entropy loss, that is, a prediction result is compared with training data labeled by a label corresponding to each pixel.

Further, the difference classification result graph Pc2 is compared with the training samples with labels to obtain the difference encoder loss, and the formula is characterized as follows:

wherein the content of the first and second substances,L _class2 indicating a differential classification loss.

Further, the classification loss is calculated by overlapping portions of the same layer in the first encoder and the second encoder, and the formula is characterized as follows:

wherein the content of the first and second substances,L _{Dice coefficient} representing a binary loss FM representing coding information corresponding to the first temporal picture, FM'And representing the coding information corresponding to the second temporal picture.

Specifically, the binary loss is a binary problem, and the dice coefficient loss can measure the overlapping portion of two pictures.

Further, the similarity loss, the classification loss of the panoramic encoder, the loss of the difference encoder and the two-classification loss are combined to obtain a total loss function of the violation building model, and a formula is characterized as follows:

L=L _s +L _class1 +L _class2 +L _{Dice coefficient}

example two

A multi-temporal target detection model is constructed by the method of the first embodiment.

EXAMPLE III

A violation building detection model is obtained by training the multi-temporal target detection model in the second embodiment by taking a temporal picture marked with a building as a training sample.

Example four

A multi-temporal target detection method comprises the following steps:

and sending the first temporal image and the second temporal image of each group of temporal images into the multi-temporal target detection model described in the second embodiment to obtain the target to be detected.

EXAMPLE five

A method of violation building detection comprising:

sending a first temporal image and a second temporal image of each group of temporal images into a violation building detection model to obtain a panoramic classification result image and a difference classification result image, wherein the violation building detection model is obtained by training a multi-temporal target detection model by using the temporal images marked with buildings as training samples, the target to be detected corresponding to a panoramic decoder is the panoramic classification result image, the target to be detected corresponding to a difference decoder is the difference classification result image, and the acquisition time of the first temporal image is earlier than that of the second temporal image;

In some embodiments, the change result graph represents the change situation of two different temporal pictures of the same place, and if there is a change, there may be a violation building in the place.

In some embodiments, if the place is considered to have the illegal buildings, the first time-state picture is used for constructing a comparison sample library, the area difference value between the difference classification result graph and the panoramic classification result graph is obtained, the calculation result is compared with a first set threshold value, if the calculation result is larger than the first set threshold value, the place is considered to have the illegal buildings, and the position information of the place is output.

In some embodiments, if it is necessary to judge that the house is removed illegally, if the setting is smaller than a first setting threshold, the difference classification result is compared with a first time-mode picture of the same place in the comparison sample library, and the place information is input.

Specifically, the first set threshold is artificially set and is used for determining the building area change of the same place in different tenses.

Specifically, after the location information is output, law enforcement personnel can be sent to go to the door for viewing.

EXAMPLE six

Based on the same concept, referring to fig. 6, the present application further provides a device for constructing a multi-temporal target detection model, including:

and an encoding module: constructing a multi-temporal target detection model, wherein the multi-temporal target detection model consists of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent to the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent to the second encoder to be coded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

a first decoding module: sending the first coding information and the second coding information of each level into the panorama decoder for decoding to obtain a final panorama decoding result;

a detection module: and respectively inputting the panoramic classification result image and the differential classification result image to a prediction head to obtain a target to be detected.

EXAMPLE seven

The present embodiment further provides an electronic apparatus, referring to fig. 7, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to run the computer program to perform the steps in any one of the above-described embodiments of the method for constructing a multi-temporal object detection model.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may include a hard disk drive (hard disk drive, HDD for short), a floppy disk drive, a solid state drive (SSD for short), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In certain embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory 404 (FPMDRAM), an extended data output dynamic random-access memory (EDODRAM), a synchronous dynamic random-access memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files needed for processing and/or communication purposes, as well as possibly computer program instructions executed by processor 402.

The processor 402 reads and executes the computer program instructions stored in the memory 404 to implement the implementation process of the method for constructing the multi-temporal object detection model in any one of the above embodiments.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include wired or wireless networks provided by communication providers of the electronic devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input and output devices 408 are used to input or output information. In this embodiment, the input information may be a first time image, a second time image, and the like, and the output information may be a place where a violation building exists, and the like.

Optionally, in this embodiment, the processor 402 may be configured to execute the following steps by a computer program:

s101, obtaining at least one group of temporal pictures of at least one to-be-detected place, wherein each group of temporal pictures comprises a first temporal picture and a second temporal picture which are taken from different time points, and marking a to-be-detected target in each group of temporal pictures to obtain a training sample;

s102, constructing a multi-temporal target detection model, wherein the multi-temporal target detection model is composed of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent to the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent to the second encoder to be coded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

s103, sending the first coding information and the second coding information of each level into the panorama decoder for decoding to obtain a final panorama decoding result;

s104, sending the first coding information and the second coding information of each level into the difference decoder, performing splicing-convolution operation on the first coding information and the second coding information of the last level to obtain a difference decoding result with the same depth as the first coding information of the previous level, performing difference jump connection on the difference decoding result to obtain difference decoding jump information, reducing the depth of the difference decoding jump information again to obtain a new difference decoding result, traversing the difference jump connection operation to obtain a final difference decoding result, splicing the difference decoding result and the difference information with the same depth by the difference jump connection operation, wherein the difference information is the difference value of the first coding information and the second coding information, and inputting the final difference decoding result into the predictor to obtain a difference classification result graph;

and S105, respectively inputting the panoramic classification result image and the difference classification result image to a prediction head to obtain a target to be detected.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also called program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in figure 7 may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A method for constructing a multi-temporal target detection model is characterized by comprising the following steps:

constructing a multi-temporal target detection model, wherein the multi-temporal target detection model consists of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent to the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent to the second encoder to be coded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

sending the first coding information and the second coding information of each level into the panoramic decoder for decoding to obtain a final panoramic decoding result;

2. The method of claim 1, wherein the first encoder and the second encoder have the same structure, and the first temporal picture and the second temporal picture are encoded by using a deformable convolutional layer to obtain first encoding information and second encoding information.

3. The method for constructing the multi-temporal object detection model according to claim 1, wherein the first encoder and the second encoder perform encoding in a weight sharing manner, and perform perturbation processing on the first encoding information and the second encoding information of the same level.

4. The method for constructing the multi-temporal target detection model according to claim 1, wherein in the step of sending the first coding information and the second coding information of each level into the panorama decoder for decoding to obtain the final panorama decoding result, the first coding information and the second coding information of the last level are subjected to a splicing-convolution operation to obtain a panorama decoding result with the same depth as the first coding information of the previous level, the panorama decoding result is subjected to panorama skip connection to obtain panorama decoding skip information, the panorama decoding skip information is used as an information panorama decoding result after the depth is reduced again, and a final panorama decoding result is obtained by traversing panorama skip connection operation, wherein the panorama skip connection operation is performed by splicing the panorama decoding result with the first coding information and the second coding information with the same depth.

5. The method as claimed in claim 1, wherein the difference information is obtained by subtracting second coding information corresponding to a second temporal state from first coding information corresponding to a first temporal state picture, and a shooting time point of the first temporal state picture is earlier than a shooting time point of the second temporal state picture.

6. A multi-temporal object detection model constructed using the method of any one of claims 1 to 5.

7. A violation building detection model, characterized in that the multi-temporal target detection model of claim 6 is trained by using a temporal picture marked with a building as a training sample.

8. A multi-temporal target detection method is characterized by comprising the following steps:

and sending the first temporal image and the second temporal image of each group of temporal images into the multi-temporal target detection model of claim 6 to obtain the target to be detected.

9. A method for detecting illegal buildings is characterized by comprising the following steps:

sending the first time-state picture and the second time-state picture of each group of time-state pictures into the violation building detection model of claim 7 to obtain a panoramic classification result picture and a difference classification result picture, wherein the target to be detected corresponding to the panoramic decoder is the panoramic classification result picture, the target to be detected corresponding to the difference decoder is the difference classification result picture, and the acquisition time of the first time-state picture is earlier than that of the second time-state picture;

10. The illegal building detection method according to claim 9, characterized in that the change result graph represents the change situation of two different temporal pictures at the same place, if there is a change, the area difference between the difference classification result graph and the panoramic classification result graph is obtained, and if the area difference is greater than a first set threshold, it is determined that there is an illegal building at the place.

11. A device for constructing a multi-temporal target detection model is characterized by comprising:

an acquisition module: the method comprises the steps of obtaining at least one group of temporal pictures of at least one to-be-detected place, marking a to-be-detected target in each group of temporal pictures to obtain a training sample, wherein each group of temporal pictures comprises a first temporal picture and a second temporal picture which are taken from different time points;

the coding module: constructing a multi-temporal target detection model, wherein the multi-temporal target detection model consists of a first encoder, a second encoder, a panoramic decoder and a difference decoder, the first temporal picture of each group of temporal pictures is sent to the first encoder to obtain multi-level first coding information with the depth from low to high, the second temporal picture is sent to the second encoder to be coded to obtain multi-level second coding information with the depth from low to high, and the depth of the first coding information and the depth of the second coding information of each level are the same;

a second decoding module: sending the first coding information and the second coding information of each level into the difference decoder, performing splicing-convolution operation on the first coding information and the second coding information of the last level to obtain a difference decoding result with the same depth as the first coding information of the previous level, performing difference jump connection on the difference decoding result to obtain difference decoding jump information, reducing the depth of the difference decoding jump information again to obtain a new difference decoding result, traversing the difference jump connection operation to obtain a final difference decoding result, splicing the difference decoding result and the difference information with the same depth by the difference jump connection operation, inputting the final difference decoding result into the predictor to obtain a difference classification result graph;

12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to execute the computer program to perform a method of constructing a multi-temporal object detection model according to any one of claims 1 to 5 or a method of multi-temporal object detection according to claim 8 or a method of violation building detection according to claim 9.

13. A readable storage medium having stored thereon a computer program comprising instructions for controlling a process to perform a method of constructing a multi-temporal object detection model according to any one of claims 1 to 5 or a multi-temporal object detection method according to claim 8 or a violation building detection method according to claim 9.