CN117496130A

CN117496130A - Basic model weak supervision target detection method based on context awareness self-training

Info

Publication number: CN117496130A
Application number: CN202311561164.7A
Authority: CN
Inventors: 冯瑛超; 许光銮; 杜璇仪; 闫志远; 尹文昕; 吴有明
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-02-02
Anticipated expiration: 2043-11-22
Also published as: CN117496130B

Abstract

The invention provides a basic model weak supervision target detection method and a device based on context awareness self-training, wherein the method comprises the following steps: inputting a plurality of input images into a backbone network for image processing to obtain a feature map of each input image, and selectively searching the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image; inputting the feature map of each input image and a plurality of suggestion boxes corresponding to each input image into a ROIAlign structure to perform feature extraction to obtain the internal region feature, the peripheral region feature and the internal peripheral region feature of each suggestion box; obtaining a positioning score matrix and a classification score matrix of each input image according to the internal region features, the external region features and the internal region features of each suggestion frame; obtaining a target scoring matrix of each input image according to the positioning scoring matrix and the classification scoring matrix; and obtaining a significance loss function according to the target score matrix of each input image and the preset target true value.

Description

Basic model weak supervision target detection method based on context awareness self-training

Technical Field

The invention relates to the technical field of computer vision, in particular to a basic model weak supervision target detection method and device based on context awareness self-training, electronic equipment and a storage medium.

Background

The increase in the amount of data has prompted the development of target detection techniques, but it has also brought about an increasing need for manual labeling. The data tagging process is typically time consuming and labor intensive due to the need to label the location and class of each instance in the most common fully supervised target detection tasks. To alleviate such problems, a weak supervision paradigm has been developed that requires only image-level classification labels, which is called weak supervision target detection (Weakly Supervised Object Detection, WSOD). When the image is wide in breadth and complex in background, the weak supervision method can greatly reduce the labeling difficulty, and has a wide prospect.

Although the weakly supervised detection method reduces the cost of target detection, the model effect is generally not fully supervised superior due to the use of low precision labeling. The method proposed later is often used for reducing the gap between the weak supervision algorithm and the full supervision algorithm, and the main effort is to improve the positioning accuracy of the target. One of the significant problems is that the detector tends to focus on the most discriminating part of the object rather than its entirety, and confusion between foreground and background frames may also occur.

In addition, in the method using the self-training algorithm, the selection process of the pseudo-true value lacks robustness. Once the false positive boxes are selected based on the previous scores, they serve as the partitioning criteria for other suggested boxes. Thus, incomplete analysis and utilization of scores may lead to a large number of instances of false label inaccuracy.

In the related method in the prior art, the score vectors of the suggestion boxes are analyzed usually category by category, and the suggestion boxes with high scores in each category are assigned to candidate false true values of the category. However, this method ignores the constraint relation between the categories, and the order of consideration of the categories brings about different results.

Disclosure of Invention

In order to solve the problems in the prior art, the method, the device, the electronic equipment and the storage medium for detecting the weak supervision target of the basic model based on the context awareness self-training provided by the embodiment of the invention optimize the basic model by utilizing the overall loss, and the optimized model is used for monitoring the target and has excellent performance on weak supervision detection.

The first aspect of the invention provides a method for detecting a weak supervision target of a basic model based on context awareness self-training, wherein the basic model comprises a backbone network and a ROIAlign structure, and the method comprises the following steps: inputting a plurality of input images into a backbone network for image processing to obtain a feature map of each input image, and selectively searching the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image; respectively inputting the feature map of each input image and a plurality of suggestion boxes corresponding to each input image into a ROIAlign structure to perform feature extraction to obtain the internal region feature, the peripheral region feature and the internal peripheral region feature of each suggestion box; obtaining a positioning score matrix and a classification score matrix of each input image according to the internal region features, the external region features and the internal region features of each suggestion frame; obtaining a target score matrix of each input image according to the positioning score matrix and the classification score matrix of each input image; obtaining a significance loss function according to the target score matrix of each input image and a preset target true value; the significance loss function is used for optimizing a basic model, and the basic model is used for target detection.

Further, the basic model further includes a semantic balance positioning module and a classification full-connection layer, where obtaining a positioning score matrix and a classification score matrix of each input image according to the internal region feature, the external region feature and the internal region feature of each suggestion frame includes: inputting the internal region features, the peripheral region features and the internal region features of each suggestion frame to a semantic balance positioning module to obtain a positioning score matrix; and inputting the internal region characteristics of each suggestion frame into a classification full-connection layer to obtain a classification score matrix.

Further, the basic model also comprises a comprehensive clustering module and a label filtering module; according to the target score matrix and the preset target true value of each input image, obtaining a significance loss function, wherein the significance loss function comprises the following steps: inputting the target score matrix of each input image into a comprehensive clustering module for candidate analysis to obtain candidate false true values of each input image; and inputting the candidate false true value of each input image to a label filtering module to obtain a significance loss function.

Further, inputting the candidate false true value of each input image to a label filtering module to obtain a significance loss function, including: denoising the region features corresponding to the candidate false true values of each input image to obtain each region feature after denoising; inputting each region characteristic into a tag filtering module to obtain a salient picture corresponding to each region characteristic; and obtaining a significance loss function according to the significance picture corresponding to each region characteristic.

Further, the backbone network is a res net50 network, where inputting a plurality of input images into the backbone network for image processing, obtaining a feature map of each input image includes: inputting a plurality of input images into a ResNet50 network for image processing to obtain a feature map of each input image; wherein the ResNet50 network comprises a 33-sized convolutional layer.

Further, according to the positioning score matrix and the classification score matrix of each input image, a target score matrix of each input image is obtained, including: and carrying out Hadamard product processing on the positioning score matrix and the classification score matrix of each input image to obtain a target score matrix of each input image.

Further, the target is one or more of an aircraft, an automobile, a building, a court, and a ship.

A second aspect of the present invention provides a basic model weakly supervised target detection apparatus based on context awareness self training, including: the input image processing module is configured to input a plurality of input images into a backbone network for image processing to obtain a feature map of each input image, and selectively search the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image; the feature extraction module is configured to input a feature map of each input image and a plurality of suggestion boxes corresponding to each input image into the ROIAlign structure respectively for feature extraction to obtain an internal region feature, an external region feature and an internal region feature of each suggestion box; the feature processing module is configured to obtain a positioning score matrix and a classification score matrix of each input image according to the internal region feature, the peripheral region feature and the internal region feature of each suggestion frame, and obtain a target score matrix of each input image according to the positioning score matrix and the classification score matrix of each input image; the model optimization module is configured to obtain a significance loss function according to the target score matrix of each input image and a preset target true value; the significance loss function is used for optimizing a basic model, and the basic model is used for target detection.

A third aspect of the present invention provides an electronic device comprising: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the basic model weak supervision target detection method based on the context awareness self-training provided by the first aspect of the invention.

A fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the context-aware self-training based basic model weakly supervised target detection method provided by the first aspect of the present invention.

The invention provides a basic model weak supervision target detection method based on context awareness self-training. Inputting the extracted features into a semantic balance positioning module to obtain positioning scores of all the suggestion frames. And then analyzing each scoring through a comprehensive clustering module to obtain candidate false truth boxes, and then evaluating the candidate boxes by using a label filtering module based on a saliency map to calculate loss. The method completes optimization of the whole network in iterative self-training, so that the whole network has excellent performance in weak supervision detection.

Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 schematically illustrates a flow chart of a basic model weakly supervised target detection method based on context awareness self training, in accordance with an embodiment of the present invention;

FIG. 2 schematically illustrates a block diagram of a basic model weakly supervised target detection apparatus based on context awareness self training, in accordance with an embodiment of the present invention; and

fig. 3 schematically shows a block diagram of an electronic device adapted to implement the method described above according to an embodiment of the invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Some of the block diagrams and/or flowchart illustrations are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, when executed by the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart. The techniques of the present invention may be implemented in hardware and/or software (including firmware, microcode, etc.). Furthermore, the techniques of the present invention may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.

The technical scheme of the invention will be described in detail below with reference to a specific flow of a basic model weak supervision target detection method based on context awareness self-training in a specific embodiment of the invention. It should be understood that the flow and the computing structure of the basic model weakly supervised target detection method based on context awareness self training and the like shown in the drawings are only exemplary to help those skilled in the art understand the technical scheme of the present invention, and are not used to limit the protection scope of the present invention.

FIG. 1 schematically illustrates a flow chart of a method of weakly supervised target detection based on a context awareness self-training based basis model.

As shown in fig. 1, the method for detecting the basic model weak supervision target based on the context awareness self-training comprises the following steps: and step S1 to S5.

In the embodiment of the invention, the basic model at least comprises a Backbone Network (Backbone Network), a ROIAlign structure, a Semantic balance positioning module (Semantic-balanced Localization Module, SBLM), a classification full-connection layer and a comprehensive clustering module (Comprehensive Clustering Method, CCM).

In operation S1, a plurality of input images are input into a backbone network for image processing to obtain a feature map of each input image, and a plurality of input images are selectively searched to obtain a plurality of suggestion boxes corresponding to each input image.

In an embodiment of the present invention, the plurality of input images may be optical remote sensing images. For example, the optical remote sensing image may be an image that images various types of objects.

For example, the backbone network may be a ResNet50 network. And inputting the plurality of input images into a ResNet50 network for feature extraction to obtain a feature map of each input image. Wherein the ResNet50 network comprises a 33-sized convolutional layer.

In the embodiment of the invention, the selective search is a non-parametric suggestion box generation method, and a plurality of suggestion boxes corresponding to each input image can be obtained by selectively searching a plurality of input images.

For example, the number of the plurality of suggestion boxes corresponding to each input image may be more than 1000.

In operation S2, the feature map of each input image and the plurality of suggestion boxes corresponding to each input image are respectively input into the ROIAlign structure to perform feature extraction, so as to obtain an inner region feature, an outer region feature and an inner region feature of each suggestion box.

In the embodiment of the invention, the ROIAlign structure uses a bilinear interpolation method to obtain the image value on the pixel point with the floating point number coordinate of each suggestion frame, and converts the whole characteristic aggregation process into a continuous operation. The ROIAlign structure has less error than the conventional ROI pooling method.

For example, the suggestion box interior region feature is a suggestion box complete region feature. The suggested frame peripheral region features and the inner peripheral region features refer to inner and outer region features, respectively, near the edge of the suggested frame.

In operation S3, a location score matrix and a classification score matrix of each input image are obtained according to the internal region feature, the external region feature, and the internal region feature of each suggestion frame.

In the embodiment of the invention, the semantic balance positioning module is utilized to combine the internal and external semantic information of the suggestion frames, and the characteristics of each suggestion frame are subjected to positioning scoring to obtain a positioning scoring matrix of each input image. And classifying and scoring each suggested frame characteristic by using the positioning full-connection layer to obtain a classification score matrix of each input image.

In operation S4, a target score matrix of each input image is obtained according to the location score matrix and the classification score matrix of each input image.

In the embodiment of the invention, hadamard product processing is carried out on the positioning score matrix and the classification score matrix of each input image to obtain the target score matrix of each input image.

In operation S5, a significance loss function is obtained according to the target score matrix and the preset target truth value of each input image. The significance loss function is used for optimizing a basic model, and the basic model is used for target detection.

In the embodiment of the invention, the preset target true value is a known value, which is a target prediction value and can be set according to actual conditions.

In the embodiment of the invention, the significance loss function is used for optimizing the basic model, so that the quality of candidate false true values can be improved, and the robustness in the model training process can be improved.

For example, the target detection may be one or more of an aircraft, an automobile, a building, a court, and a ship.

According to an embodiment of the present invention, S3 specifically includes: steps S31 to S32.

In operation S31, the internal region feature, the external region feature, and the internal region feature of each suggestion frame are input to the semantic balance positioning module to obtain a positioning score matrix.

In the embodiment of the invention, the semantic balance positioning module is utilized to perform positioning scoring on each feature by combining the internal and external semantic information of the suggestion frame, specifically, three regional features of the suggestion frame are input into the same positioning full-connection layer to generate respective positioning scoring matrixes, namely, a suggestion frame internal scoring matrix M is generated ^roi Inner periphery context scoring matrix M ⁱⁿ And a peripheral context score matrix M ^out 。

For example, by calculating the residual error of the context information of the inner and outer periphery of the proposed frame, a large residual error of the information inside and outside one proposed frame means accurate positioning, since the objects in the optical remote sensing image are usually not overlapping. Meanwhile, the internal information of the suggestion frame is considered, and a positioning score matrix M is generated ^l Can be expressed as:

wherein the method comprises the steps of，Is a super parameter set to 0.5. r and c represent the suggestion box and the category number, respectively. />A score indicating the presence of class c objects within the r suggestion box. />A score representing the presence of Zhou Leibie c object in the r suggestion box.A score representing the presence of the r-th suggested out-of-box Zhou Leibie c object.

Normalizing the score matrices to obtain normalized positioning score matricesThe method comprises the following steps:

wherein R represents the total number of suggested boxes.Representing the index as a location score matrix M ^l Is a function of the exponent of (a).

In operation S32, the internal region features of each suggestion box are input to the classification full-connection layer, resulting in a classification score matrix.

In the embodiment of the invention, the input suggestion frames are classified and scored by using the classification full-connection layer, specifically, the internal area characteristics of each suggestion frame are input into one classification full-connection layer, so that the method can be obtainedSize classification score matrix M ^c 。

In the embodiment of the invention, the classification score matrix M ^c Normalized along the suggestion box dimension and the category dimension,a final classification score matrix can be obtained。

For example, a location score matrix will be locatedAnd a classification score matrix->Carrying out Hadamard product calculation to obtain a target score matrix of each input image>Can be expressed as:

wherein,representing the hadamard product.

According to an embodiment of the present invention, S4 specifically includes: and S41-S42.

In operation S41, a target scoring matrix is computed along the suggestion box dimensionSumming to obtain probability of one category existing in the image>Can be expressed as:

wherein,refers to the probability that category c exists in the image.

Due to image category truth value in weak supervision detectionImage classification Loss function Loss _cls Can be expressed as:

in operation S42, the target score matrix of each input image is input to the comprehensive clustering module for candidate analysis, so as to obtain candidate pseudo-true values of each input image.

In the embodiment of the invention, the comprehensive clustering module is used as a clustering standard according to the scoring matrix of all the existing categories in the image. Input data for clustering is a target scoring matrixIts size is +.>Where R is the number of suggested boxes and C is the number of categories present in the image. Since mean-teacher clustering is employed, a hyper-parameter is required to represent the number of clusters. In the embodiment of the invention, the number of possible categories (C+1) is set as follows: foreground classes present in the image plus background. In this way, the suggestion boxes of the background class can be promptly excluded. Considering the scores of multiple categories also reduces the randomness of the candidate box partitions.

In operation S43, candidate pseudo-true values for each input image are input to the label filter module, resulting in a significance loss function.

In the embodiment of the present invention, after determining the candidate pseudo-true value in step S42, the embodiment of the present invention extracts the region in the pseudo-true value box as an independent image, and may perform gaussian denoising processing on the candidate pseudo-true value of each input image to suppress background noise. Then the saliency target detection model is utilized to obtain a corresponding saliency map, and the centroid of the saliency map is calculated, namely a saliency map center c _s . While the center of the suggestion box is called the geometric center c _g . Based on saliency map center c _s And geometric center c _g Combined with the length a of the suggestion frame _r Sum of width and b _r Normalized Euclidean distance D between them _r Can be expressed as:

since a larger distance means a larger positioning deviation, the disclosed embodiment proposes a D with a false true value box _r A proportional significance loss function. Loss of significance function Loss _saliency Can be expressed as:

wherein N is _pgt Representing the number of false truth boxes. D (D) _m And the normalized euclidean distance of the mth candidate box is represented. l (D) _m ) The following custom functions are represented:

where u (x) represents a step function, x represents an argument,the value was constant and was 0.5.C represents a constant set.

In the significance Loss function Loss _saliency The loss size ranges from 0 to e. Embodiments of the present invention select an exponential function because a convex function facilitates optimization. It follows D _r Is an increasing gradient increase, indicating an excessive D _r Will be subject to a more severe penalty. l (x) is a step function which is also used for the same purpose, i.e. to determine the D of the frame _r Giving a certain latitude.

Pseudo-true value based on predictive probability yRegression Loss in the self-training process can be performed _ref Can be expressed as:

the network model is optimized in such a way that the loss function counter-propagates and the overall loss can be expressed as:

according to the embodiment of the invention, the method optimizes the basic model by utilizing the overall loss, and the optimized model is used for monitoring the target and has excellent performance on weak supervision detection.

According to the basic model weak supervision target detection method based on context awareness self-training, the balance of the context information and the information in the suggestion frame is fully utilized to optimize positioning capability, multiple types of vectors are combined, candidate false true values are further screened, and accuracy of model optimization is improved.

FIG. 2 schematically illustrates a block diagram of a basic model weakly supervised target detection apparatus based on context awareness self training, in accordance with an embodiment of the present invention.

As shown in fig. 2, the context-aware self-training-based basic model weak supervision target detection device 200 includes: an input image processing module 210, a feature extraction module 220, a feature processing module 230, and a model optimization module 240. The apparatus 200 may be used to implement the context-aware self-training based basic model weakly supervised target detection method described with reference to FIG. 1.

The input image processing module 210 is configured to input a plurality of input images into a backbone network for image processing, obtain a feature map of each input image, and selectively search the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image. The input image processing module 210 may be used, for example, to perform the step S1 described above with reference to fig. 1, which is not described herein.

The feature extraction module 220 is configured to input the feature map of each input image and the plurality of suggestion boxes corresponding to each input image into the ROIAlign structure to perform feature extraction, so as to obtain an internal region feature, an external region feature and an internal region feature of each suggestion box. The feature extraction module 220 may be used, for example, to perform the step S2 described above with reference to fig. 1, which is not described herein.

The feature processing module 230 is configured to obtain a location score matrix and a classification score matrix of each input image according to the internal region feature, the external region feature and the internal region feature of each suggestion frame, and obtain a target score matrix of each input image according to the location score matrix and the classification score matrix of each input image. The feature processing module 230 may be used for executing steps S3 to S4 described above with reference to fig. 1, which are not described herein.

The model optimization module 240 is configured to obtain a significance loss function according to the target score matrix and a preset target true value of each input image; the significance loss function is used for optimizing a basic model, and the basic model is used for target detection. The model optimization module 240 may be used, for example, to perform the step S5 described above with reference to fig. 1, which is not described herein.

Any number of the modules, sub-modules, units, sub-units, or at least part of the functionality of any number of the sub-units according to embodiments of the invention may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present invention may be implemented as a split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present invention may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a device on a chip, a device on a substrate, a device on a package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or packages the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the invention may be at least partly implemented as computer program modules, which, when run, may perform the respective functions.

For example, any of the input image processing module 210, the feature extraction module 220, the feature processing module 230, and the model optimization module 240 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the input image processing module 210, the feature extraction module 220, the feature processing module 230, and the model optimization module 240 may be implemented, at least in part, as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a device on a chip, a device on a substrate, a device on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the input image processing module 210, the feature extraction module 220, the feature processing module 230, and the model optimization module 240 may be at least partially implemented as a computer program module that, when executed, performs the corresponding functions.

Fig. 3 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the invention. The electronic device shown in fig. 3 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the invention.

As shown in fig. 3, the electronic device 300 described in the present embodiment includes: a processor 301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. Processor 301 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 301 may also include on-board memory for caching purposes. Processor 301 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the invention.

In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are stored. The processor 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. The processor 301 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 302 and/or the RAM 303. Note that the program may be stored in one or more memories other than the ROM 302 and the RAM 303. The processor 301 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in the one or more memories.

According to an embodiment of the invention, the electronic device 300 may further comprise an input/output (I/O) interface 305, the input/output (I/O) interface 305 also being connected to the bus 304. The electronic device 300 may also include one or more of the following components connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, and the like; an output portion 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 308 including a hard disk or the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. The drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 310 as needed, so that a computer program read therefrom is installed into the storage section 308 as needed.

According to an embodiment of the present invention, the method flow according to an embodiment of the present invention may be implemented as a computer software program. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 309, and/or installed from the removable medium 311. The above-described functions defined in the apparatus of the embodiment of the present invention are performed when the computer program is executed by the processor 301. The above-described apparatuses, devices, means, modules, units, etc. may be implemented by computer program modules according to an embodiment of the invention.

Embodiments of the present invention also provide a computer-readable storage medium that may be included in the apparatus/device/means described in the above embodiments; or may exist alone without being assembled into the apparatus/device/means. The computer-readable storage medium carries one or more programs that, when executed, implement a context-aware self-training based basic model weakly supervised target detection method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the invention, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. For example, according to an embodiment of the invention, the computer-readable storage medium may include ROM 302 and/or RAM 303 and/or one or more memories other than ROM 302 and RAM 303 described above.

Embodiments of the present invention also include a computer program product comprising a computer program containing program code for performing the method shown in the flowcharts. When the computer program product runs in a computer device, the program code is used for enabling the computer device to realize the basic model weak supervision target detection method based on the context awareness self-training provided by the embodiment of the invention.

The above-described functions defined in the apparatus/means of the embodiments of the present invention are performed when the computer program is executed by the processor 301. The above-described apparatuses, modules, units, etc. may be implemented by computer program modules according to an embodiment of the invention.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed over a network medium in the form of signals, downloaded and installed via the communication part 309, and/or installed from the removable medium 311. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 309, and/or installed from the removable medium 311. The above-described functions defined in the apparatus of the embodiment of the present invention are performed when the computer program is executed by the processor 301. The above-described apparatuses, devices, means, modules, units, etc. may be implemented by computer program modules according to an embodiment of the invention.

According to embodiments of the present invention, program code for carrying out computer programs provided by embodiments of the present invention may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or in assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that, in each embodiment of the present invention, each functional module may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such an understanding, the technical solution of the invention may be embodied essentially or partly in the form of a software product or in part in addition to the prior art.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments and/or scope of the invention can be implemented in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the present invention. In particular, the features recited in the various embodiments and/or the scope of the invention can be combined in various combinations and/or combinations without departing from the spirit and teachings of the invention. All such combinations and/or combinations fall within the scope of the invention.

While the invention has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents. The scope of the invention should therefore not be limited to the embodiments described above, but should be determined not only by the appended claims, but also by the equivalents of the claims.

Claims

1. A basic model weak supervision target detection method based on context awareness self-training is characterized in that the basic model comprises a backbone network and a ROIAlign structure, and the method comprises the following steps:

inputting a plurality of input images into the backbone network for image processing to obtain a feature map of each input image, and selectively searching the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image;

respectively inputting the feature map of each input image and a plurality of suggestion boxes corresponding to each input image into a ROIAlign structure to perform feature extraction, so as to obtain the internal region feature, the peripheral region feature and the internal region feature of each suggestion box;

obtaining a positioning score matrix and a classification score matrix of each input image according to the internal region characteristics, the external region characteristics and the internal region characteristics of each suggestion frame;

obtaining a target score matrix of each input image according to the positioning score matrix and the classification score matrix of each input image;

obtaining a significance loss function according to the target score matrix of each input image and a preset target true value; the significance loss function is used for optimizing the basic model, and the basic model is used for target detection.

2. The method for detecting a weakly supervised target based on context aware self training as set forth in claim 1, wherein the base model further includes a semantic balance positioning module and a classification fully connected layer, wherein the obtaining the positioning score matrix and the classification score matrix of each input image according to the internal region feature, the peripheral region feature, and the internal region feature of each suggestion frame includes:

inputting the internal region features, the peripheral region features and the internal region features of each suggestion frame to the semantic balance positioning module to obtain the positioning score matrix;

and inputting the internal region characteristics of each suggestion frame to the classification full-connection layer to obtain the classification score matrix.

3. The method for detecting the weak supervision target of the basic model based on the context awareness self-training according to claim 1, wherein the basic model further comprises a comprehensive clustering module and a label filtering module; the obtaining a significance loss function according to the target score matrix of each input image and a preset target true value comprises the following steps:

inputting the target score matrix of each input image into the comprehensive clustering module for candidate analysis to obtain candidate false true values of each input image;

and inputting the candidate false true value of each input image to the tag filtering module to obtain the significance loss function.

4. The method for detecting a weakly supervised target based on context aware self training as set forth in claim 3, wherein said inputting candidate false true values for each of said input images to said label filter module results in said significance loss function comprising:

denoising the region features corresponding to the candidate false true values of each input image to obtain denoised region features;

inputting each region characteristic into the tag filtering module to obtain a significant picture corresponding to each region characteristic;

and obtaining the significance loss function according to the significance pictures corresponding to each region characteristic.

5. The method for detecting the basic model weak supervision target based on context-aware self-training according to claim 1, wherein the backbone network is a res net50 network, wherein the inputting the plurality of input images into the backbone network for image processing to obtain the feature map of each input image comprises:

inputting the plurality of input images into the ResNet50 network for image processing to obtain a feature map of each input image; wherein the ResNet50 network comprises a 33-sized convolutional layer.

6. The method for detecting a weakly supervised target based on context aware self training as set forth in claim 1, wherein the obtaining the target score matrix for each input image based on the localization score matrix and the classification score matrix for each input image comprises:

and carrying out Hadamard product processing on the positioning score matrix and the classification score matrix of each input image to obtain a target score matrix of each input image.

7. The context-aware self-training based basic model weak supervision target detection method according to claim 1, wherein the target is one or more of an aircraft, an automobile, a building, a court, and a ship.

8. A basic model weak supervision target detection device based on context awareness self-training, comprising:

the input image processing module is configured to input a plurality of input images into the backbone network for image processing to obtain a feature map of each input image, and selectively search the plurality of input images to obtain a plurality of suggestion boxes corresponding to each input image;

the feature extraction module is configured to input the feature map of each input image and a plurality of suggestion boxes corresponding to each input image into the ROIAlign structure respectively for feature extraction to obtain the internal region feature, the peripheral region feature and the internal region feature of each suggestion box;

the feature processing module is configured to obtain a positioning score matrix and a classification score matrix of each input image according to the internal region feature, the peripheral region feature and the internal region feature of each suggestion frame, and obtain a target score matrix of each input image according to the positioning score matrix and the classification score matrix of each input image;

the model optimization module is configured to obtain a significance loss function according to the target score matrix of each input image and a preset target true value; the significance loss function is used for optimizing the basic model, and the basic model is used for target detection.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the context-aware self-training based basic model weakly supervised target detection method as set forth in any one of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to perform the context-aware self-training based basic model weakly supervised target detection method as set forth in any one of claims 1-7.