CN116958919A

CN116958919A - Target detection method, target detection device, computer readable medium and electronic equipment

Info

Publication number: CN116958919A
Application number: CN202211476436.9A
Authority: CN
Inventors: 高斌斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-10-27

Abstract

The embodiment of the application provides a target detection method, a target detection device, a computer readable medium and electronic equipment, wherein the method comprises the following steps: obtaining an original model; extracting a feature image of a target sample image through a feature extraction network, and generating a candidate frame through a region recommendation network; dividing the candidate frame into a positive sample and a negative sample, and generating candidate frame characteristics through an information integration network; generating a class prediction result through a classification head, generating a boundary frame regression position through a regression head, and determining a total loss value, wherein a loss function of the classification head comprises a first loss function corresponding to a positive sample and a second loss function corresponding to a negative sample, and the second loss function comprises an adjustment parameter positively related to the class prediction result; and training a target model according to the total loss value, and carrying out target detection according to the target model. The application improves the learning effect of the new class instance under a small number of marks. The embodiment of the application can be applied to various scenes such as industrial defect quality inspection, maps, automatic driving, intelligent traffic, auxiliary driving and the like.

Description

Target detection method, target detection device, computer readable medium and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a target detection method, a target detection device, a computer readable medium and electronic equipment.

Background

In the field of object detection and instance segmentation, object instances may exist in a dataset for training a model that lack labeling information, and these object instances, while not for the current learning task, reduce the learning effect on the new learning task when the model is trained with the new learning task and the new learning task needs to learn the class of the object instance.

Disclosure of Invention

The embodiment of the application provides a target detection method, a target detection device, a computer-readable medium and electronic equipment, which can further improve the learning effect of new class examples under a small number of marks at least to a certain extent and avoid forgetting the learned old class.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a target detection method, the method including: acquiring an original model, wherein the original model comprises a feature extraction network, a region recommendation network, an information integration network, a classification head and a regression head; extracting a feature map of a target sample image through the feature extraction network, and generating a plurality of candidate frames according to the feature map through the region recommendation network; dividing each candidate frame into a positive sample and a negative sample, and integrating each candidate frame with the feature map through the information integration network respectively to generate candidate frame features corresponding to each candidate frame; generating category prediction results corresponding to each positive sample and each negative sample respectively through the classification head, generating a boundary box regression position corresponding to each positive sample through the regression head, and determining total loss values of the classification head and the regression head according to the category prediction results, the boundary box regression position and marking information corresponding to the target sample image, wherein a loss function of the classification head comprises a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function comprises an adjustment parameter positively correlated with the category prediction results; and adjusting parameters of the original model according to the total loss value to obtain a target model, and performing target detection according to the target model.

According to an aspect of an embodiment of the present application, there is provided an object detection apparatus including: the system comprises an acquisition unit, a regression unit and a processing unit, wherein the acquisition unit is used for acquiring an original model, and the original model comprises a feature extraction network, a region recommendation network, an information integration network, a classification head and a regression head; the extraction and generation unit is used for extracting a feature map of the target sample image through the feature extraction network and generating a plurality of candidate frames according to the feature map through the region recommendation network; the integration and division unit is used for dividing each candidate frame into a positive sample and a negative sample, and integrating each candidate frame with the feature map through the information integration network respectively so as to generate candidate frame features corresponding to each candidate frame; a loss value determining unit, configured to generate, by using the classification head, a class prediction result corresponding to each positive sample and each negative sample, generate, by using the regression head, a bounding box regression position corresponding to each positive sample, and determine a total loss value of the classification head and the regression head according to the class prediction result, the bounding box regression position, and labeling information corresponding to the target sample image, where a loss function of the classification head includes a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function includes an adjustment parameter positively correlated to the class prediction result; and the parameter adjustment and target detection unit is used for adjusting the parameters of the original model according to the total loss value to obtain a target model, and carrying out target detection according to the target model.

In some embodiments of the present application, based on the foregoing solution, the information integration network includes a size correction module and a plurality of convolution layers connected to the size correction module, where the size correction module is configured to convert a mapping result of each candidate frame in the feature map into a corresponding target size feature, and the plurality of convolution layers are configured to map each target size feature into a corresponding candidate frame feature.

In some embodiments of the present application, based on the foregoing solution, the original model further includes an instance segmentation header connected to the size correction module, the target model is further configured to perform instance segmentation, and the loss value determining unit is further configured to, before determining the total loss value of the classification header and the regression header according to the class prediction result, the bounding box regression location, and the labeling information corresponding to the target sample image: generating an instance segmentation mask corresponding to each positive sample through the instance segmentation head; the loss value determination unit is configured to: and determining the total loss values of the classification head, the regression head and the example segmentation head according to the category prediction result, the regression position of the boundary box, the example segmentation mask and the labeling information corresponding to the target sample image.

In some embodiments of the application, based on the foregoing scheme, the integrating and dividing unit is configured to: determining the intersection ratio of each candidate frame and a real annotation frame in the annotation information corresponding to the target sample image; and dividing each candidate frame into a positive sample and a negative sample according to the magnitude relation between the cross-over ratio corresponding to each candidate frame and a preset cross-over ratio threshold, wherein the cross-over ratio corresponding to the candidate frame of the positive sample is larger than the cross-over ratio corresponding to the candidate frame of the negative sample.

In some embodiments of the present application, based on the foregoing solution, before each of the candidate boxes is integrated with the feature map through the information integration network, the integration and partitioning unit is further configured to: the positive and negative samples are sampled at a predetermined ratio.

In some embodiments of the present application, based on the foregoing, the target sample image is one sample image in a set of original sample images for training the original model, and the set of original sample images includes a plurality of sample images and labeling information corresponding to each sample image.

In some embodiments of the present application, based on the foregoing solution, at least one sample image in the original sample image set includes at least one first target instance without corresponding labeling information, and after adjusting parameters of the original model according to the total loss value to obtain a target model, the parameter adjustment and target detection unit is further configured to: training the target model is continued based on a first sample data set, wherein the first sample data set comprises a plurality of first sample images and labeling information corresponding to each first sample image, at least one first sample image comprises a first designated instance which is the same as the category of the first target instance, and the first sample data set comprises labeling information corresponding to each first designated instance.

In some embodiments of the present application, based on the foregoing solution, after adjusting the parameters of the original model according to the total loss value to obtain a target model, the parameter adjustment and target detection unit is further configured to: training the target model is continued based on a second sample data set, wherein the second sample data set comprises a plurality of second sample images and labeling information corresponding to each second sample image, at least one second sample image comprises at least one second target instance without corresponding labeling information, at least one sample image in the original sample image set comprises a second designated instance with the same category as the second target instance, and the original sample image set comprises the labeling information corresponding to the second designated instance.

In some embodiments of the present application, based on the foregoing solution, the set of original sample images is used for performing small sample learning on the target model, the labeling information in the set of original sample images includes labeling information corresponding to a predetermined number of instances of each category, and the number of instances belonging to at least one category in the sample images in the set of original sample images is greater than the predetermined number.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the object detection method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the target detection method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided a computer program product including computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the object detection method as described in the above embodiment.

In the technical solutions provided in some embodiments of the present application, when training an original model including a feature extraction network, an area recommendation network, an information integration network, a classification head and a regression head by using a target sample image, a feature map of the target sample image is extracted by using the feature extraction network, then a plurality of candidate frames are generated by using the area recommendation network according to the feature map, each candidate frame feature is divided into a positive sample and a negative sample, then each positive sample and each negative sample are respectively input to the classification head, so that a corresponding class prediction result can be generated, each positive sample is input to the regression head, and a corresponding boundary frame regression position can be obtained, so that a total loss value can be determined according to the class prediction result and the boundary frame regression position, and further training of the original model is guided. Because the loss functions respectively corresponding to the positive sample and the negative sample are adopted by the classification head, and the loss functions corresponding to the negative sample comprise the adjustment parameters positively correlated with the output result of the classification head, when the output result of the classification head is smaller, namely the negative sample is more likely to correspond to a background area, the smaller the adjustment parameters are, the smaller the contribution of the negative sample to the loss function is, the smaller the loss value obtained according to the loss function corresponding to the negative sample is, namely the information learned from the negative sample is reduced, therefore, even if the target sample image contains some unlabeled examples belonging to new categories, the examples also belong to the negative sample in the learning task, the interference of the negative sample is small, and after the target model is obtained by training, if the new category needs to be learned, the learning effect of the new category examples under a small number of labels can be improved; in addition, when learning a new category, the learned category is not learned again, and forgetting of the learned old category can be avoided by reducing the information learned from the negative sample.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture that may be used to implement aspects of embodiments of the present application;

FIG. 2 shows a flow chart of a method of object detection according to one embodiment of the application;

FIG. 3 shows a network structure diagram of a raw model according to one embodiment of the application;

FIG. 4 shows a flowchart of details of step 230 in the embodiment of FIG. 2, according to one embodiment of the application;

FIG. 5 shows a flowchart of details of step 230 in the embodiment of FIG. 2, according to another embodiment of the present application;

FIG. 6 shows a flowchart of the details of step 240 in the embodiment of FIG. 2, according to one embodiment of the application;

FIG. 7 shows a flowchart of steps subsequent to step 250 in the embodiment of FIG. 2, according to one embodiment of the application;

FIG. 8 shows a flowchart of steps subsequent to step 250 in the embodiment of FIG. 2, according to another embodiment of the present application;

FIG. 9 shows a modified cross entropy loss function at different parameters and a comparison of the cross entropy loss function with a standard cross entropy loss function according to an embodiment of the application;

FIG. 10 shows a block diagram of an object detection device according to one embodiment of the application;

fig. 11 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Object detection is a task of detecting the class and position of an object or instance of interest in an image, and instance segmentation is a task of segmenting an instance from the pixel level on the basis of object detection.

For both object detection and instance segmentation tasks, a sample dataset is required to train the model, however, the problem of sample datasets containing instances of missing annotation information is widespread in real-world applications.

For a model, it may be necessary to learn multiple tasks, for example, after the model learns to classify instances of an old class, it may be necessary to learn to classify instances of a new class later. This learning approach often occurs in an incremental learning scenario and a small sample detection scenario.

In an incremental learning scenario, tasks built at past times may only relate to a portion of the categories based on the circumstances at the time, and new categories typically need to be expanded at future times due to demand changes. It is apparent that the future time data may contain defined categories at the past time, and if the new task data does not mark the defined categories of the old task, there is a problem of marking missing. In the small sample detection and instance segmentation scenarios, the learning of the model involves the basic class and new class data, the purpose of training the model is to hope to give a few small number of new class labeling samples, and the model can quickly migrate the learned knowledge on the basic class to the new class identification. Here, the label deficiency is more serious and mainly appears in two aspects. First, similar to the incremental learning scenario, there is a label missing between the old and new tasks. This is because the base class data inevitably has new class instances, which are unavoidable in the new class data. Second, the set condition limits for small samples only label a few instances for each new class, and when the number of instances in the labeled image exceeds the number defined by the small sample, some potential new class instances will miss labels. The lack of labeling in this case generally presents an increasing trend as the definition of categories increases.

In practical application scenarios, such as industrial defect detection, on one hand, due to changes in production materials and product designs, product appearance defect characteristics on a production line may significantly shift, which means that the defect type distribution will also change; on the other hand, the probability of occurrence of partial defects is very low, so that the corresponding sample collection is very difficult, and the small data volume is a very remarkable characteristic.

In order to adapt to the environment with small defect transition and data volume on the industrial production line and improve the accuracy of defect quality inspection under the environment, a powerful sample-less example segmentation algorithm needs to be developed. Considering that the existing few sample example segmentation algorithm mainly uses the fully supervised learning idea, under the condition of few samples or increment setting, the problem of label missing exists in the way that the existing few sample example segmentation algorithm directly uses the fully supervised learning framework, and the serious conflict exists: i.e. confusing foreground and background. In the incremental learning scenario, the conflict appears between new and old tasks, if the purpose of old task learning is to identify old categories, new potentially unlabeled categories will be identified as background, which will limit the learning of new tasks; when learning a new task, the goal is just opposite to learning an old task, and the learning of a new model is to identify a new class, and the old class needs to be used as a background, which can cause the model to forget disastrous to the old class. In a small sample scenario, the conflict of new and old task learning is the same as incremental learning. In addition, when learning a new task, because of incomplete labeling, there is a huge conflict in learning itself, which can disturb learning of a model, thereby limiting the recognition capability of the model on a new class.

To this end, the present application first provides a target detection method. The target detection method provided by the embodiment of the application can overcome the defects, can relieve the interference caused by the lack of the labeling information, can improve the learning effect of the new class instance under a small number of labels, and can avoid forgetting the learned old class.

FIG. 1 shows a schematic diagram of an exemplary system architecture that may be used to implement the technical solution of an embodiment of the present application. As shown in fig. 1, the system architecture 100 includes a vehicle 110, a cloud 120 and a user terminal 130, and communication can be performed between the vehicle 110 and the cloud 120, and between the user terminal 130 and the cloud 120. The vehicle 110 specifically includes a camera 130 for capturing an image of the environment around the vehicle 110, and an original model to be trained is set in the cloud 120. When the object detection method provided by the embodiment of the present application is applied to the system architecture shown in fig. 1, one process may be as follows: firstly, starting and running a vehicle 110, acquiring environmental images around the vehicle 110 by a camera 130 of the vehicle 110, uploading the environmental images to a cloud end 120 through a communication module of the vehicle 110, accessing each environmental image in the cloud end 120 by a user through a user terminal 130, and submitting labeling information corresponding to an instance in the environmental images to the cloud end 120; the cloud end 120 builds a data set by utilizing the environment image and the corresponding labeling information, and trains an original model in the cloud end 120 by utilizing the target detection method provided by the embodiment of the application based on the data set, so as to obtain a target model; finally, the cloud 120 may issue the target model to the vehicle 110 or other vehicles for target detection or driving assistance of the vehicle.

In some embodiments of the present application, when the number of unlabeled environmental images stored in the cloud 120 reaches a predetermined number, the cloud 120 sends a reminder to the user terminal 130 to prompt the user to label the unlabeled environmental images.

In some embodiments of the present application, the vehicle 610 may periodically upload the acquired environmental images to the cloud 120.

In some embodiments of the present application, the user terminal 130 submits only annotation information corresponding to a portion of the instances in the environmental image to the cloud 120.

In some embodiments of the present application, after training in the cloud 120 to obtain the target model, the vehicle 110 may continuously acquire a new environment image and upload the new environment image to the cloud 120, the user terminal 130 may submit the labeling information corresponding to the new class of examples in the new environment image to the cloud 120, the cloud 120 constructs a new data set based on the new environment image and the labeling information corresponding to the new class of examples, and continuously trains the target model based on the new data set.

It should be understood that the number of vehicles, cameras on vehicles, and user terminals in fig. 1 is merely illustrative. According to the implementation requirement, the vehicle can be provided with any number, any number of cameras can be arranged at various positions on the vehicle, and a plurality of user terminals for submitting the labeling information can also be arranged.

It should be noted that fig. 1 shows only one embodiment of the present application. Although the solution of the embodiment of fig. 1 is used in the intelligent driving field of vehicles, and is used for training environmental images on roads, in other embodiments of the present application, the solution may be applied to various other fields, for example, in an automatic article sorting scene in the intelligent logistics field, and may also be applied to an industrial defect detection scene for detecting product appearance defects on a production line; although in the solution of the embodiment of fig. 1, the user terminal is a desktop computer, and the training of the model is performed at the cloud end, in other embodiments of the present application, the user terminal may also be various other types of terminal devices, such as a smart phone, a tablet computer, a notebook computer, a wearable device, and the model may also be trained on various types of terminal devices, such as a desktop computer, a notebook computer, a workstation, and the like; although in the solution of the embodiment of fig. 1, the training of the model, the deployment of the model and the labeling of the images are performed on different terminal devices, in other embodiments of the present application, any two of the three may be performed on the same terminal device. The embodiments of the present application should not be limited in any way, nor should the scope of the application be limited in any way.

It is easy to understand that the object detection method provided by the embodiment of the present application is generally performed by a user terminal, and accordingly, the object detection device is generally disposed in the user terminal. However, in other embodiments of the present application, the server may also have a similar function to the user terminal, so as to execute the target detection scheme provided by the embodiments of the present application.

Therefore, the embodiment of the application can be applied to a terminal or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The implementation details of the technical scheme of the embodiment of the application are described in detail below:

Fig. 2 shows a flow chart of a method of object detection according to an embodiment of the present application, which may be performed by various computing and processing capable devices, such as a user terminal including, but not limited to, a cell phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle terminal, an aircraft, etc., or a cloud server. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like. Referring to fig. 2, the target detection method at least includes the following steps:

in step 210, an original model is obtained, the original model including a feature extraction network, a region recommendation network, an information integration network, a classification header, and a regression header.

The original model is a model which needs to be further trained, and the original model can be a model which is subjected to preliminary training or a model which is not subjected to complete training.

The feature extraction network may be a pre-trained network, and the region recommendation network is trained based on the pre-trained feature extraction network.

Fig. 3 shows a network structure diagram of a raw model according to an embodiment of the present application. Referring to fig. 3, the original model is a two-stage model, and the first stage mainly includes three sub-modules, namely a Backbone (Backbone network), an RPN (Region Proposal Network, region recommended network) and an ROI-alignment (Region of Interest-alignment), which is a feature extraction network, and the ROI-alignment is a part of the information integration network. The second stage includes a classification head and a regression head.

In step 220, feature images of the target sample image are extracted by the feature extraction network, and a plurality of candidate boxes are generated from the feature images by the region recommendation network.

The target sample image may contain one or more examples, wherein the examples are targets to be detected, examples may be various entities such as people, animals and the like which can be recorded in the image, and examples may also be part of a certain entity, for example, may be defects in appearance of products. Specifically, the target sample image may be a picture or a photograph in various formats such as jpeg and bmp, or may be a video image frame included in a certain video. Typically, the target sample image may be captured by a camera, but in some cases, the target sample image may also be automatically generated by a computer device.

In one embodiment of the present application, the target sample image is one sample image in a set of original sample images for training the original model, the set of original sample images including a plurality of sample images and labeling information corresponding to each sample image.

With continued reference to fig. 3, the image input to the original model is first input to the backup, and the corresponding feature map (feature map) is extracted by the backup. The backbox inputs the feature map to the RPN, which outputs a corresponding candidate box (proposal) based on the feature map input. Specifically, the RPN generally includes two branches, one branch is used for classifying generated anchors, determining whether the anchors represent a foreground or a background, that is, the anchors do not include an object, the foreground, that is, the anchors include an object, the other branch is used for predicting a bounding box regression position corresponding to the anchors, the RPN finally selects a plurality of anchors from all anchors according to classification results corresponding to the anchors, and adjusts the position and the size of the selected anchors according to the bounding box regression position, thereby obtaining a plurality of candidate boxes (proposal).

In step 230, each candidate frame is divided into a positive sample and a negative sample, and each candidate frame is integrated with the feature map through the information integration network, so as to generate candidate frame features corresponding to each candidate frame.

Positive samples are candidate boxes corresponding to foreground objects, and negative samples are candidate boxes corresponding to background areas.

Fig. 4 shows a flowchart of the details of step 230 in the embodiment of fig. 2, according to one embodiment of the application. Referring to fig. 4, the step of dividing each candidate box into a positive sample and a negative sample may specifically include the following steps:

in step 231, the intersection ratio of each candidate frame and the true annotation frame in the annotation information corresponding to the target sample image is determined.

The intersection ratio (Intersection overUnion, ioU) is the ratio of the intersection and union of the candidate frame and the true annotation frame.

The annotation information corresponding to the target sample image can comprise a real annotation frame for identifying the position of the instance and the category of the instance.

In step 232, each candidate frame is divided into a positive sample and a negative sample according to the magnitude relation between the corresponding intersection ratio of each candidate frame and the preset intersection ratio threshold, wherein the intersection ratio of the candidate frames of the positive sample is greater than the intersection ratio of the candidate frames of the negative sample.

Specifically, a candidate frame whose intersection ratio with the real annotation frame reaches a first predetermined intersection ratio threshold may be taken as a positive sample, and a candidate frame whose intersection ratio with the real annotation frame is smaller than a second predetermined intersection ratio threshold may be taken as a negative sample. The first predetermined cross-ratio threshold and the second predetermined cross-ratio threshold may be the same, for example, both may be set to 0.5, although the first predetermined cross-ratio threshold and the second predetermined cross-ratio threshold may be the same, the first predetermined cross-ratio threshold may be greater than the second predetermined cross-ratio threshold, for example, the first predetermined cross-ratio threshold is set to 0.7, and the second predetermined cross-ratio threshold is set to 0.3.

In some embodiments of the present application, the information integration network includes a size correction module for converting a mapping result of each candidate frame in the feature map into a corresponding target size feature, and a plurality of convolution layers connected to the size correction module for mapping each target size feature into a corresponding candidate frame feature.

Specifically, the size correction module may adopt an ROI-alignment module or an ROI Pooling (Region of Interest-alignment) module, where the ROI Pooling module is adopted in the case that the second stage includes only a classification header and a regression header, and the ROI-alignment module may be adopted in the case that the second stage further includes an instance segmentation header, where the ROI Pooling module performs quantization operations twice according to a candidate frame on an original image scale, to obtain a mapping result of a target size corresponding to the candidate frame in the feature map. The ROI-alignment module is used for realizing pixel alignment from the original image to the feature image to the candidate frame feature through a bilinear interpolation algorithm. The region corresponding to the candidate frame can be uniformly processed into the feature with the fixed size through the ROI-Align module and the ROI Pooling module. The several convolution layers may for example comprise a fully connected layer. With continued reference to fig. 3, ROI features, which are target size features, are obtained by the ROI-alignment module, and although not shown in fig. 3, it is easy to understand that in the first stage, a number of convolution layers may also be included after the ROI features. The candidate frame features obtained by several convolution layer mappings may be one feature vector.

Fig. 5 shows a flowchart of the details of step 230 in the embodiment of fig. 2 according to another embodiment of the application. Referring to fig. 5, step 230 may further specifically include the following steps:

in step 230', each candidate frame is divided into a positive sample and a negative sample, the positive sample and the negative sample are sampled according to a predetermined ratio, and each sampling result is integrated with the feature map through the information integration network, so as to generate candidate frame features corresponding to each sampling result.

It can be seen that, before integrating each candidate box with the feature map through the information integration network, the positive sample and the negative sample are sampled according to a predetermined proportion, and then the sampled positive sample and negative sample are input to the information integration network to be integrated with the feature map respectively. The predetermined ratio between positive and negative samples may be 1:1, 1:3, etc.

The number of candidate frames used subsequently can be reduced by sampling, so that the training speed is increased.

Based on the above, the flow of the first stage in the original model shown in fig. 3 can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,for element-by-element multiplication between vectors, s1 is the first stage, θ _s1 Is all parameters of the original model in the first stage, and after the target sample image is input into the feature extraction network, the feature extraction network extracts a feature map F _EF Feature map F _EF Is a robust and discriminant advanced feature; category-independent candidate boxes F are then generated from these extracted features by the RPN _RPN The method comprises the steps of carrying out a first treatment on the surface of the Finally, integrating the network according to the candidate frame F through the information containing the ROI-alignment module or the ROI Pooling module _RPN Generating candidate frame features F _ROI For the next stage.

In step 240, a classification head generates a class prediction result corresponding to each positive sample and each negative sample, a regression head generates a bounding box regression position corresponding to each positive sample, and a total loss value of the classification head and the regression head is determined according to the class prediction result, the bounding box regression position and the labeling information corresponding to the target sample image, wherein a loss function of the classification head comprises a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function comprises an adjustment parameter positively correlated with the class prediction result.

Referring to fig. 3, a classification head and a regression head may be used for target detection tasks, positive samples and negative samples are input to the classification head and the regression head in parallel, the candidate frames are classified by the classification head to obtain a classification prediction result, and the candidate frames are subjected to positioning regression by the regression head to generate a bounding box regression position for correcting the position and size of the candidate frames.

The loss function of the target detection task is:

where OD is target detection (Object Dectection), s2 is the second phase,loss function, θ, for target detection task _cls To classify the parameters of the head, θ _reg For the parameter of the regression head, L _CLS To classify the loss function of the head, L _REG Is the loss function of the regression head.

The total loss value of the classification head and the regression head can be obtained through the loss function.

For the target detection task, an input image X and corresponding labeling information are givenBased on->This loss function may enable end-to-end optimization of the model, where b _k True callout box for kth instance, c _k For the category of the kth instance, M is the number of instances contained in image X.

In some embodiments of the application, the original model further comprises an instance segmentation head coupled to the size correction module, and the target model is further configured to perform instance segmentation.

With continued reference to fig. 3, in the second stage, in addition to the classification header and the regression header, a segmentation header is further included, and the Up-sampling result obtained by the Up-sampling (Up-Sample) module of the output result of the ROI-alignment module is input into the segmentation header.

FIG. 6 shows a flowchart of the details of step 240 in the embodiment of FIG. 2, according to one embodiment of the application. As shown in fig. 6, in determining the total loss value of the classification head and the regression head according to the classification prediction result, the regression position of the bounding box and the labeling information corresponding to the target sample image, the target detection method may further include:

In step 241, an instance segmentation mask corresponding to each positive sample is generated by the instance segmentation head.

The instance segmentation Mask is a Mask (Mask) that distinguishes instances and backgrounds from the pixel level. The pixel-level instance segmentation task can be completed through the instance segmentation head, so that the target model can be used as an instance segmentation model.

From the above, it can be seen that the regression head and the segmentation head only process the foreground target region corresponding to the positive sample, and the input of the classification head includes the background region corresponding to the negative sample in addition to the foreground target region corresponding to the positive sample.

The step of determining the total loss value of the classification head and the regression head according to the category prediction result, the regression position of the boundary box and the labeling information corresponding to the target sample image may specifically include:

in step 242, the total loss values of the classification header, the regression header, and the instance segmentation header are determined according to the classification prediction result, the bounding box regression location, the instance segmentation mask, and the labeling information corresponding to the target sample image.

The loss function of the instance segmentation task is:

wherein, the liquid crystal display device comprises a liquid crystal display device,for the loss function of the instance partition task, IS IS the instance partition (Instance Segmentation), θ _seg For example parameters of the partitioning head, L _SEG For example, the meaning of the other symbols is identical to the meaning of the symbols in the loss function of the target detection task, and will not be described here.

For an instance segmentation task, a given input imageX and labeling information corresponding to XBased on->This loss function may enable end-to-end optimization of the model, where b _k True callout box for kth instance, c _k Class of kth instance, m _k The mask for the kth instance, M, is the number of instances contained in image X.

In step 250, the parameters of the original model are adjusted according to the total loss value to obtain a target model, and target detection is performed according to the target model.

Parameters of the original model can be adjusted and optimized according to the total loss value, and training of the original model is achieved.

Based on the foregoing, it is apparent that there are instances of missing labels in the label data with very few sample labels. Thus, the candidate box features corresponding to the negative sample fed into the classification head may not be a true background, i.e., it is highly likely to be potential foreground target areas of the missed label; for the regression head and the instance segmentation head, since the foreground region is manually noted, the sampled foreground region must be exact. In summary, under the condition of few labels, the frame regression and the instance segmentation of the instance segmentation model can be normally learned, but learning of the classification head may affect the generalization capability of the model on new classes because of noise data existing in the input background sample.

Based on this, further technical details of the embodiment of the present application are as follows:

to be used forFor candidate frame features mapped by several convolution layers, it can be seen that it is a feature vector, R is the real number field,/v>The corresponding class label vector is +.>Wherein C is the number of foreground classes, 1 represents background class, and y is the number of foreground classes if the candidate frame feature belongs to the ith class _i =1, otherwise, y _i =0, then, by the softmax function pair of the sorting head +.>One probability prediction distribution generated by the transformation is:

wherein, the liquid crystal display device comprises a liquid crystal display device,for prediction +.>Probability of belonging to the ith class, x _i Is->The result is output at the ith neural pathway of the classification head.

The cross entropy loss can be utilized to measure the true class label vector asAnd predictive distribution->The class header typically employs a loss function of:

wherein L is _CLS To classify the head's loss function, y _i Is a class markThe element of the i-th class in the signature vector,for prediction +.>Probability of belonging to the ith class, C is the number of classes of the foreground object.

In order to process positive and negative samples differently, embodiments of the present application propose the following modified loss functions for the classification head:

it can be seen that the first loss function is a first loss function corresponding to a positive sample and the second loss function is a second loss function corresponding to a negative sample, wherein the second loss function is one of I.e. the adjustment parameter positively correlated with the category prediction result,/->For the prediction result, α and γ are parameters.

The first loss function is a standard cross entropy loss function, and when α=1 and γ=0, the second loss function is a standard cross entropy loss function.

The second Loss function and the Focal Loss function proposed by the embodiment of the application are different. Specifically, the form of the Focal loss function is as follows:

wherein p is _t For measuring the proximity of the predicted result to the true category, p _t The larger the instruction the closer the predicted outcome to the true category.

It can be seen that the main objective of the Focal loss function is to mine the contribution of difficult samples to the objective. However, given that in the case of small samples, the target is subject to a miss-mark, so that the difficult samples mined out may be noise samples, appropriate relaxed conditions should be given to reduce their contribution to the loss function.

Based on the above, the adjustment parameters introduced in the modified loss function according to the embodiment of the present application are as followsRather than alpha (1-p in the Focal loss function _t ) ^γ 。

In one embodiment of the application, at least one sample image in the original sample image set includes at least one first target instance without corresponding annotation information, and the steps after obtaining the target model are as shown in FIG. 7.

FIG. 7 shows a flowchart of steps subsequent to step 250 in the embodiment of FIG. 2, according to one embodiment of the application. Referring to fig. 7, the following steps may be further included after step 250:

in step 260, training of the target model is continued based on a first sample dataset comprising a plurality of first sample images and annotation information corresponding to each first sample image, at least one of the first sample images comprising a first specified instance of the same class as the first target instance, the first sample dataset comprising annotation information corresponding to each first specified instance.

It is easy to understand that the annotation information corresponding to the sample image includes annotation information corresponding to each of the plurality of instances in the sample image.

Since the first target instance in the original sample image set does not have the corresponding labeling information, and the first sample data set also needs to learn the first designated instance with the same class as the first target instance, the learning of the new class of the first sample data set is limited when training based on the traditional loss function, and the training based on the scheme of the embodiment of the application can overcome the defect.

Fig. 8 shows a flow chart of steps subsequent to step 250 in the embodiment of fig. 2 according to another embodiment of the application. Referring to fig. 8, the following steps may be further included after step 250:

in step 270, training of the target model is continued based on a second sample dataset comprising a plurality of second sample images and labeling information corresponding to each second sample image, at least one second sample image comprising at least one second target instance without corresponding labeling information, at least one sample image in the original sample image set comprising a second designated instance of the same class as the second target instance, the original sample image set comprising labeling information corresponding to the second designated instance.

And when the training of the target model is continued based on the second sample data set, training the original model based on the original sample image set.

Since the original sample image set includes the second designated instance with the corresponding labeling information, the target model has already completed learning the class of the second designated instance, at this time, if training of the target model is continued based on the second sample data set again, and the second sample data set includes the second target instance identical to the class of the second designated instance, but since the second sample data set does not include the labeling information corresponding to the second target instance, the target model may recognize the region corresponding to the class of the second target instance as the background, that is, cause catastrophic forgetting of the old class, if the conventional training manner is adopted. While training based on the scheme of the embodiment of the application can effectively overcome the defect.

In one embodiment of the present application, the set of original sample images is used for small sample learning of the target model, the labeling information in the set of original sample images includes labeling information corresponding to a predetermined number of instances of each category, and the number of instances belonging to at least one category in the sample images of the set of original sample images is greater than the predetermined number.

When the number of the examples belonging to at least one category in the sample images of the original sample image set is greater than the number of the examples marked for each category in the small sample learning task, the original sample image set is also provided with the examples lacking marking information. By carrying out model training on the small sample according to the scheme of the embodiment of the application, the learning effect is not affected even if the example lacking the labeling information exists.

FIG. 9 shows a modified cross-entropy loss function at different parameters and a comparison of the cross-entropy loss function with a standard cross-entropy loss function, according to an embodiment of the application. Referring to FIG. 9, the abscissa isThe ordinate is the loss value, the first value after cali in the legend is α and the second value is γ, e.g., cali-0.25-0 represents α=1 and γ=0, each curve represents a specific loss function, and normal represents the standard cross entropy loss function. As can be seen from fig. 9, the adjustment parameters The arrangement is such that the loss value of the classification head is in most cases smaller than the loss value based on a standard cross entropy loss function, when α=1, γ=0, the curve of the modified cross entropy loss function is covered by the curve of the standard cross entropy loss function, representing that the two loss functions are equivalent.

On the MS-COCO data set, the same 20 classes as the PASCAL-VOC data are used as new classes, the remaining 60 classes are used as base classes, and only one instance is marked for each class. Experimental data for the new class instance detection results and segmentation results under different configurations of α and γ are shown in table 1:

TABLE 1

As can be seen from table 1, the best example detection and segmentation results were obtained when α=0.25 and γ=0.2, compared to the Baseline (Baseline) method, and the AP50 index was improved by 2.67 and 1.86, respectively.

In summary, the inventors of the present application have discovered a novel and challenging scenario for detection of few sample instances: most real object instances are presented in the dataset but lack any markers, so during training these areas of missing markers will be considered as background, which can lead to models tending to misrecognize potential new class objects as background, thus limiting the detection and segmentation performance of the model on the new class; aiming at the scene, the embodiment of the application provides a target detection method which can at least obtain the following technical effects: the loss function based on negative sample recalibration is provided, loss signal loss can be automatically calibrated, the bias classification problem caused by missing labeling information is relieved, and on a common MS COCO standard data set, the proposed method is obviously superior to a baseline method under the single-instance labeling condition. The method provided by the embodiment of the application can be directly applied to quality inspection of industrial defects, and is particularly suitable for scenes with difficult data scale collection and incomplete instance labeling information.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the object detection method in the above-described embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the above-mentioned embodiments of the target detection method of the present application.

Fig. 10 shows a block diagram of an object detection device according to an embodiment of the application.

Referring to fig. 10, an object detection apparatus 1000 according to an embodiment of the present application includes: an acquisition unit 1010, an extraction and generation unit 1020, an integration and division unit 1030, a loss value determination unit 1040, and a parameter adjustment and target detection unit 1050. The acquiring unit 1010 is configured to acquire an original model, where the original model includes a feature extraction network, an area recommendation network, an information integration network, a classification header, and a regression header; the extracting and generating unit 1020 is configured to extract a feature map of a target sample image through the feature extraction network, and generate a plurality of candidate frames according to the feature map through the region recommendation network; the integrating and dividing unit 1030 is configured to divide each candidate frame into a positive sample and a negative sample, and integrate each candidate frame with the feature map through the information integration network, so as to generate a candidate frame feature corresponding to each candidate frame; the loss value determining unit 1040 is configured to generate, by using the classification head, a class prediction result corresponding to each positive sample and each negative sample, generate, by using the regression head, a bounding box regression position corresponding to each positive sample, and determine a total loss value of the classification head and the regression head according to the class prediction result, the bounding box regression position, and labeling information corresponding to the target sample image, where a loss function of the classification head includes a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function includes an adjustment parameter positively correlated to the class prediction result; the parameter adjustment and target detection unit 1050 is configured to adjust parameters of the original model according to the total loss value, obtain a target model, and perform target detection according to the target model.

In some embodiments of the present application, based on the foregoing solution, the original model further includes an instance segmentation header connected to the size correction module, the target model is further configured to perform instance segmentation, and before determining the total loss value of the classification header and the regression header according to the class prediction result, the bounding box regression location, and the labeling information corresponding to the target sample image, the loss value determining unit 1040 is further configured to: generating an instance segmentation mask corresponding to each positive sample through the instance segmentation head; the loss value determining unit 1040 is configured to: and determining the total loss values of the classification head, the regression head and the example segmentation head according to the category prediction result, the regression position of the boundary box, the example segmentation mask and the labeling information corresponding to the target sample image.

In some embodiments of the present application, based on the foregoing scheme, the integrating and dividing unit 1030 is configured to: determining the intersection ratio of each candidate frame and a real annotation frame in the annotation information corresponding to the target sample image; and dividing each candidate frame into a positive sample and a negative sample according to the magnitude relation between the cross-over ratio corresponding to each candidate frame and a preset cross-over ratio threshold, wherein the cross-over ratio corresponding to the candidate frame of the positive sample is larger than the cross-over ratio corresponding to the candidate frame of the negative sample.

In some embodiments of the present application, based on the foregoing solution, before each of the candidate frames is integrated with the feature map through the information integration network, the integrating and dividing unit 1030 is further configured to: the positive and negative samples are sampled at a predetermined ratio.

In some embodiments of the present application, based on the foregoing solution, at least one sample image in the original sample image set includes at least one first target instance without corresponding labeling information, and after adjusting parameters of the original model according to the total loss value to obtain a target model, the parameter adjustment and target detection unit 1050 is further configured to: training the target model is continued based on a first sample data set, wherein the first sample data set comprises a plurality of first sample images and labeling information corresponding to each first sample image, at least one first sample image comprises a first designated instance which is the same as the category of the first target instance, and the first sample data set comprises labeling information corresponding to each first designated instance.

In some embodiments of the present application, based on the foregoing solution, after adjusting the parameters of the original model according to the total loss value to obtain the target model, the parameter adjustment and target detection unit 1050 is further configured to: training the target model is continued based on a second sample data set, wherein the second sample data set comprises a plurality of second sample images and labeling information corresponding to each second sample image, at least one second sample image comprises at least one second target instance without corresponding labeling information, at least one sample image in the original sample image set comprises a second designated instance with the same category as the second target instance, and the original sample image set comprises the labeling information corresponding to the second designated instance.

It should be noted that, the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a central processing unit (Central Processing Unit, CPU) 1101 that can perform various appropriate actions and processes, such as performing the method described in the above embodiment, according to a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a random access Memory (Random Access Memory, RAM) 1103. In the RAM 1103, various programs and data required for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. When executed by a Central Processing Unit (CPU) 1101, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As an aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

It will be appreciated that in particular embodiments of the present application, where data relating to image processing is involved, user approval or consent is required when the above embodiments of the present application are applied to particular products or technologies, and the collection, use and processing of the relevant data is required to comply with relevant legal regulations and standards in the relevant countries and regions.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of target detection, the method comprising:

acquiring an original model, wherein the original model comprises a feature extraction network, a region recommendation network, an information integration network, a classification head and a regression head;

Extracting a feature map of a target sample image through the feature extraction network, and generating a plurality of candidate frames according to the feature map through the region recommendation network;

dividing each candidate frame into a positive sample and a negative sample, and integrating each candidate frame with the feature map through the information integration network respectively to generate candidate frame features corresponding to each candidate frame;

generating category prediction results corresponding to each positive sample and each negative sample respectively through the classification head, generating a boundary box regression position corresponding to each positive sample through the regression head, and determining total loss values of the classification head and the regression head according to the category prediction results, the boundary box regression position and marking information corresponding to the target sample image, wherein a loss function of the classification head comprises a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function comprises an adjustment parameter positively correlated with the category prediction results;

and adjusting parameters of the original model according to the total loss value to obtain a target model, and performing target detection according to the target model.

2. The method of claim 1, wherein the information integration network includes a size correction module and a plurality of convolution layers coupled to the size correction module, the size correction module configured to convert a mapping result of each of the candidate frames in the feature map to a corresponding target size feature, the plurality of convolution layers configured to map each of the target size features to a corresponding candidate frame feature.

3. The method of claim 2, wherein the original model further comprises an instance segmentation header coupled to the size correction module, the target model further configured to perform instance segmentation, the method further comprising, prior to determining the total loss value of the classification header and the regression header based on the class prediction result, the bounding box regression location, and the labeling information corresponding to the target sample image:

generating an instance segmentation mask corresponding to each positive sample through the instance segmentation head;

the determining the total loss value of the classification head and the regression head according to the category prediction result, the regression position of the boundary box and the labeling information corresponding to the target sample image comprises the following steps:

And determining the total loss values of the classification head, the regression head and the example segmentation head according to the category prediction result, the regression position of the boundary box, the example segmentation mask and the labeling information corresponding to the target sample image.

4. The method of claim 1, wherein the dividing each of the candidate boxes into positive and negative samples comprises:

determining the intersection ratio of each candidate frame and a real annotation frame in the annotation information corresponding to the target sample image;

and dividing each candidate frame into a positive sample and a negative sample according to the magnitude relation between the cross-over ratio corresponding to each candidate frame and a preset cross-over ratio threshold, wherein the cross-over ratio corresponding to the candidate frame of the positive sample is larger than the cross-over ratio corresponding to the candidate frame of the negative sample.

5. The object detection method according to claim 1, wherein before integrating each of the candidate boxes with the feature map via the information integration network, respectively, the method further comprises:

the positive and negative samples are sampled at a predetermined ratio.

6. The method according to any one of claims 1 to 5, wherein the target sample image is one sample image in a set of original sample images for training the original model, the set of original sample images including a plurality of sample images and labeling information corresponding to each sample image.

7. The method of claim 6, wherein at least one sample image in the set of original sample images includes at least one first target instance without corresponding labeling information, and wherein after adjusting parameters of the original model according to the total loss value to obtain a target model, the method further comprises:

training the target model is continued based on a first sample data set, wherein the first sample data set comprises a plurality of first sample images and labeling information corresponding to each first sample image, at least one first sample image comprises a first designated instance which is the same as the category of the first target instance, and the first sample data set comprises labeling information corresponding to each first designated instance.

8. The method according to claim 6, wherein after adjusting parameters of the original model according to the total loss value to obtain a target model, the method further comprises:

training the target model is continued based on a second sample data set, wherein the second sample data set comprises a plurality of second sample images and labeling information corresponding to each second sample image, at least one second sample image comprises at least one second target instance without corresponding labeling information, at least one sample image in the original sample image set comprises a second designated instance with the same category as the second target instance, and the original sample image set comprises the labeling information corresponding to the second designated instance.

9. The object detection method according to claim 8, wherein the set of original sample images is used for small sample learning of the object model, the labeling information in the set of original sample images includes labeling information corresponding to a predetermined number of instances of each category, respectively, and the number of instances belonging to at least one category in the sample images of the set of original sample images is greater than the predetermined number.

10. An object detection device, the device comprising:

the system comprises an acquisition unit, a regression unit and a processing unit, wherein the acquisition unit is used for acquiring an original model, and the original model comprises a feature extraction network, a region recommendation network, an information integration network, a classification head and a regression head;

the extraction and generation unit is used for extracting a feature map of the target sample image through the feature extraction network and generating a plurality of candidate frames according to the feature map through the region recommendation network;

the integration and division unit is used for dividing each candidate frame into a positive sample and a negative sample, and integrating each candidate frame with the feature map through the information integration network respectively so as to generate candidate frame features corresponding to each candidate frame;

a loss value determining unit, configured to generate, by using the classification head, a class prediction result corresponding to each positive sample and each negative sample, generate, by using the regression head, a bounding box regression position corresponding to each positive sample, and determine a total loss value of the classification head and the regression head according to the class prediction result, the bounding box regression position, and labeling information corresponding to the target sample image, where a loss function of the classification head includes a first loss function corresponding to the positive sample and a second loss function corresponding to the negative sample, and the second loss function includes an adjustment parameter positively correlated to the class prediction result;

And the parameter adjustment and target detection unit is used for adjusting the parameters of the original model according to the total loss value to obtain a target model, and carrying out target detection according to the target model.

11. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the object detection method according to any one of claims 1 to 9.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the target detection method of any of claims 1 to 9.

13. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the object detection method according to any one of claims 1 to 9.