CN114927236A

CN114927236A - Detection method and system for multiple target images

Info

Publication number: CN114927236A
Application number: CN202210655674.XA
Authority: CN
Inventors: 梁浩; 费伦科; 苏建澎; 江巧娴; 梁立斌; 张诗乔
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-19

Abstract

The invention provides a detection method and a detection system for multiple target images, which relate to the technical field of image detection, and are characterized in that an original image data set is obtained, each original image in the original image data set comprises a large target area and a small target area, a target detection model is constructed, the target detection model comprises a YOLO network model, a FPN network model, a PAN network model and a detection network model which are connected in sequence, the method comprises the steps of training a target detection model through a preprocessed original image data set to obtain a trained target detection model, obtaining an image to be detected, inputting the image to be detected into the trained target detection model, outputting detection results of each target area of the image to be detected, and improving the positioning precision of a small target and the characteristic extraction effect of the small target.

Description

Detection method and system for multiple target images

Technical Field

The invention relates to the technical field of image detection, in particular to a detection method and a detection system for multiple target images.

Background

In recent years, the demand of new corona reagents for detecting new corona viruses is increased, and the classification and statistics of the detection results of the new corona reagents require manual operation, including the classification and statistics of the reagent serving as a large target and the reagent serving as a small target in the acquired new corona reagent images.

The target detection refers to image segmentation based on target geometry and statistical characteristics, combines extraction and identification of targets, can process a plurality of targets in real time in a complex scene, and automatically extracts and identifies the required targets.

The traditional target detection method is realized based on a deep neural network, the convolutional network is taken as a basis, a classification network is taken as a main trunk, because the small target in the image to be detected is small relative to the size of the image, the convolutional network carries out downsampling processing on the image to be detected for a plurality of times, and the small target has low pixel in a feature map output after the feature extraction of the image to be detected by the convolutional network, so the classification effect of the classification network on the small target is poor, and the detection effect on the small target is poor, in order to solve the above problems, the prior art provides a target detection method, based on a YOLO network model, the downsampling multiplying factor of the image to be detected is reduced by increasing the number of feature maps output by a feature extraction module in the YOLO network model according to the image to be detected, so the detection effect on the small target is enhanced, however, the image to be detected with higher resolution is generally used in the detection of a rectangular small target represented by reagent result detection, the YOLO network model cannot fully extract feature information from an image to be detected with a high resolution, and the image to be detected has a large number of small targets and a large size difference of the small targets, and the YOLO network model has low positioning accuracy on the small targets and a poor feature extraction effect on the small targets.

Disclosure of Invention

In order to solve the problems that the existing target detection method has low positioning precision on multiple large and small target images and especially has poor characteristic extraction effect on small targets, the invention provides a detection method and a detection system for multiple target images.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a detection method for multiple target images comprises the following steps:

s1, acquiring an original image data set, wherein each original image in the original image data set comprises a large target area and a small target area;

s2, preprocessing an original image data set to obtain a preprocessed image data set, and dividing the image data set into a training set, a verification set and a test set;

s3, constructing a target detection model, wherein the target detection model comprises a YOLO network model, a FPN network model, a PAN network model and a detection network model which are sequentially connected;

s4, training the target detection model by using the training set, evaluating the target detection model in the training process by using the verification set, and testing the effectiveness of the target detection model by using the test set to obtain the trained target detection model;

and S5, acquiring an image to be detected, wherein the image to be detected comprises a multiple target area formed by a large target area and a small target area, inputting the image to be detected into a trained target detection model, and outputting a detection result of each target area of the image to be detected.

Preferably, each original image in the original image data set is obtained by shooting through a high-definition camera of a mobile phone.

Preferably, the process of preprocessing the original image data set includes:

labeling each original image of the original image data set, labeling a large target area real frame and a small target area real frame in each original image, and obtaining image labeling data sets respectively corresponding to each original image.

Preferably, the process of preprocessing the original image data set further includes:

turning over, zooming and data enhancing operation are carried out on each original image in the original image data set, and numerical value information of each image annotation data corresponding to the original image in the image annotation data set is changed according to the turning over, zooming and data enhancing operation, wherein the numerical value information comprises coordinate information of a real frame of a large target area in the image and coordinate information of a real frame of a small target area in the image;

and splicing a plurality of original images in the original image data set into one image.

Preferably, the YOLO network model includes a CSPDarkent53 network and an SPPF module connected in sequence, and the network parameters and weights of the CSPDarkent53 network are pre-trained on a general ImageNet image classification dataset.

Preferably, in step S4, performing a feature extraction operation on the preprocessed original image data set through the CSPDarkent53 network to obtain a first feature map, performing a pooling operation and a feature fusion operation on the first feature map through the SPPF module to obtain a second feature map, inputting the second feature map into the FPN network model to perform multi-scale feature learning to obtain a third feature map, inputting the third feature map into the PAN network model to perform feature size localization learning to obtain a fourth feature map, inputting the fourth feature map into the detection network model, performing automatic labeling and classification prediction at the detection network model based on the fourth feature map to obtain a prediction boundary box of the large target region, a prediction boundary box of the small target region, and classification probabilities corresponding to the prediction boundary box of the large target region and the prediction boundary box of the small target region, respectively, when a difference between the prediction boundary box of the large target region and a real frame of the large target region is less than or equal to a preset difference, and when the difference between the predicted boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference, finishing the training to obtain a trained target detection model.

Preferably, the detection network model automatically marks the large target prediction region and the small target prediction region in a predefined frame tracing marking mode, the predefined frame tracing marking is an adaptive frame tracing marking, and an adaptive calculation process of the adaptive frame tracing marking is as follows:

setting the width and height of an initial drawing frame for marking a large target prediction area and a small target prediction area;

zooming the feature image according to the width and the height of the feature image in the fourth feature image according to a preset proportion to obtain a zoomed feature image;

introducing a K-means clustering algorithm, and setting a clustering center of the K-means clustering algorithm according to the zoomed characteristic image, wherein the clustering center is a rectangular frame;

determining the intersection area of the initial drawing frame and the clustering center and the union area of the initial drawing frame and the clustering center, and updating the clustering result of the K-means clustering algorithm according to the ratio of the intersection area to the union area;

and updating the width and height of the initial drawing frame according to the clustering result to obtain a large target area prediction boundary frame and a small target area prediction boundary frame.

Preferably, the difference between the predicted bounding box of the large target area and the real frame of the large target area and the difference between the predicted bounding box of the small target area and the real frame of the small target area are both evaluated by a numerical value of a Loss function Loss, the numerical value of Loss represents the difference, and a preset numerical value of Loss is used for representing the preset difference, and the specific values are as follows:

setting parts in a large target area and a small target area in an original image as a foreground, setting parts outside the large target area and the small target area as a background, equally dividing the original image into a plurality of grids, and introducing a loss function formula as follows:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc (1)

wherein λ is ₁ 、λ ₂ 、λ ₃ Is hyperparametric, L _cls Is to determine the error, L, produced by the classification of the original image _obj Is an error, L, generated by judging whether the target is a foreground target _loc Errors caused by positioning of a large target area boundary frame and a small target area boundary frame;

L _cls the formula (2) is specifically as follows:

wherein B is the number of real frames of the large target area and the small target area,

representing whether the jth prediction bounding box in the ith grid is a foreground target or not, if so, taking the value of 1, otherwise, taking the value of 0, and p _i (c) Is a classification probability, p' _i (c)＝1-p _i (c) Log () is a logarithmic function;

L _obj the formula (2) is specifically as follows:

representing whether the jth prediction bounding box in the ith grid is a background target or not, if so, the value is 1, if not, the value is 0, and c _i Is the true confidence, the value is 1 if the foreground target is, and the value is 0, c 'if the background target is' _i If the confidence coefficient is the predicted confidence coefficient, the value is 1 if the object is a foreground object, and the value is 0 if the object is a background object;

L _loc the formula (2) is specifically as follows:

L _loc ＝L _CIoU ＝1-CIoU (4)

CIoU＝IoU-(ρ ² (b，b ^gt )/c ² +αv) (5)

a＝v/(1-IoU)+v (6)

v＝4/Π ² (tan ^-1 w ^gt /h ^gt -tan ^-1 w/h) (7)

wherein IoU is the ratio of intersection area and union area of real box and predicted boundary box, rho ² (b，b ^gt ) Square of the distance representing the center points of the predicted bounding box and the real box, c ² Square of diagonal distance, w, representing minimum closure area capable of containing both predicted bounding box and true box ^gt /h ^gt Is the aspect ratio of the real box, w/h is the aspect ratio of the predicted bounding box, tan ^-1 () Is an arctangent function.

The invention also provides a detection system for multiple target images, which comprises:

an acquisition unit configured to acquire an original image dataset, each original image in the original image dataset including a large target region and a small target region;

the system comprises a preprocessing unit, a test unit and a data processing unit, wherein the preprocessing unit is used for preprocessing an original image data set to obtain a preprocessed image data set and dividing the image data set into a training set, a verification set and a test set;

the device comprises a construction unit, a detection unit and a processing unit, wherein the construction unit is used for constructing a target detection model, and the target detection model comprises a YOLO network model, a FPN network model, a PAN network model and a detection network model which are connected in sequence;

the training unit is used for training the target detection model by using the training set, evaluating the target detection model in the training process by using the verification set, and testing the effectiveness of the target detection model by using the test set to obtain the trained target detection model;

the detection unit is used for acquiring an image to be detected, the image to be detected comprises a multiple target area formed by a large target area and a small target area, the image to be detected is input into a trained target detection model, and the detection result of each target area of the image to be detected is output.

Preferably, the YOLO network model includes a CSPDarkent53 network and an SPPF module connected in sequence, the network parameters and weights of the CSPDarkent53 network are obtained by using network parameters and weights obtained by pre-training a generic ImageNet image classification dataset, the training unit is specifically configured to perform a feature extraction operation on the preprocessed raw image dataset through the CSPDarkent53 network to obtain a first feature map, perform a pooling operation and a feature fusion operation on the first feature map through the SPPF module to obtain a second feature map, input the second feature map into the FPN network model to perform multi-scale feature learning to obtain a third feature map, input the third feature map into the PAN network model to perform feature size positioning learning to obtain a fourth feature map, input the fourth feature map into the detection network model, and perform automatic labeling and classification prediction at the detection network model based on the fourth feature map to obtain a prediction boundary frame, and a prediction unit of a large target region, And when the difference between the predicted boundary frame of the large target area and the real frame of the large target area is less than or equal to the preset difference, and the difference between the predicted boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference, finishing training to obtain a trained target detection model.

The detection system for the multiple target images is used for executing the detection method for the multiple target images.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a detection method and a detection system for multiple target images, wherein when a target detection model is constructed, an FPN network model and a PAN network model are added on the basis of a YOLO network model, the FPN network model learns feature information with different sizes through a top-down structure, and the PAN network model strengthens the positioning effect on a small target through a bottom-up structure, so that the positioning precision of the small target and the feature extraction effect on the small target can be improved.

Drawings

FIG. 1 is a schematic flow chart of a detection method for multiple target images according to the present invention;

FIG. 2 is a schematic diagram of a CSP module in the CSP vendor 53 network according to the present invention;

FIG. 3 is a schematic diagram of an SPPF module according to the present invention;

FIG. 4 is a diagram illustrating an FPN network model and a PAN network model according to the present invention;

FIG. 5 is a diagram illustrating the parameters of the loss function proposed by the present invention;

FIG. 6 is a schematic diagram of a multi-object image-oriented detection system according to the present invention;

FIG. 7 is a schematic diagram showing an example of a pretreatment process in the target detection of the novel corona reagent proposed by the present invention;

fig. 8 is a schematic diagram showing an example of the target detection process of the new corona reagent proposed by the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for better illustration of the present embodiment, some parts of the drawings may be omitted, enlarged or reduced, and do not represent actual sizes;

it will be understood by those skilled in the art that certain descriptions of well-known structures in the drawings may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

example 1

In consideration of the problems that the existing target detection method has low positioning accuracy for multiple large and small target images, and especially has poor feature extraction effect for small targets, the embodiment provides a detection method for multiple target images, which can enhance the positioning effect for small targets by constructing an improved target detection model, can improve the positioning accuracy for small targets and the feature extraction effect for small targets, and takes the current neocoronary reagent target detection as an example, and the method is described by combining with a flow diagram shown in fig. 1, and referring to fig. 1, including the following steps:

in this step, each original image in the original image data set is obtained by shooting a new crown reagent and a reagent result corresponding to the new crown reagent through a mobile phone high-definition camera, each original image includes a large target area and a small target area, the large target area is an area where the new crown reagent is located in the original image, the small target area is an area where the reagent result is located in the original image, the sizes of the large target and the small target are various, the large target and the small target jointly form a plurality of targets, and the sizes of the large target and the small target are relatively speaking.

in this step, the specific pretreatment process is as follows:

marking each original image of the original image data set in a manual marking mode, marking a large target area real frame and a small target area real frame in each original image to obtain image marking data sets respectively corresponding to each original image, and dividing the original image data set and the image marking data set corresponding to each original image into a training set, a verification set and a test set.

Optionally, the division ratio of the training set, the verification set and the test set is 6:2: 2.

The method comprises the steps of carrying out turning operation, zooming operation and data enhancement operation on each original image in an original image data set, changing numerical value information of each image labeling data corresponding to the original image in an image labeling data set according to the turning operation, the zooming operation and the data enhancement operation, wherein the numerical value information comprises coordinate information of a large target area real frame in the image and coordinate information of a small target area real frame in the image, and splicing a plurality of original images in the original image data set into one image.

S3, constructing a target detection model, wherein the target detection model comprises a YOLO network model, a FPN network model, a PAN network model and a detection network model which are connected in sequence;

referring to fig. 2, 3 and 4, in this step, the YOLO network model includes a CSPDarkent53 network and an SPPF module connected in sequence, the network parameters and weights of the CSPDarkent53 network are obtained by using a network parameter and weights pre-trained in a generic ImageNet image classification dataset, the CSPDarkent53 network includes a CSP module and a Darknet53 model, the design concept of the CSP module is shown in fig. 2, the design concept of the SPPF module is shown in fig. 3, and the design concept of the FPN network model and the PAN network model is shown in fig. 4.

in the step, a CSPDarkent53 network is used for carrying out feature extraction operation on a preprocessed original image data set to obtain a first feature map, an SPPF module is used for carrying out pooling operation and feature fusion operation on the first feature map to obtain a second feature map, the second feature map is input into an FPN network model for carrying out multi-scale feature learning to obtain a third feature map, the third feature map is input into a PAN network model for carrying out feature size positioning learning to obtain a fourth feature map, the fourth feature map is input into a detection network model, automatic labeling and classification prediction are carried out at the detection network model based on the fourth feature map to obtain a prediction boundary frame of a large target area, a prediction boundary frame of a small target area and classification probabilities respectively corresponding to the prediction boundary frame of the large target area and the prediction boundary frame of the small target area, when the difference between the prediction boundary frame of the large target area and a real frame of the large target area is less than or equal to a preset difference, and when the difference between the predicted boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference, finishing the training to obtain a trained target detection model.

The detection network model automatically marks a large target prediction area and a small target prediction area in a predefined frame marking mode, the predefined frame marking mode is a self-adaptive frame marking mode, and the self-adaptive calculation process of the self-adaptive frame marking mode is as follows:

introducing a K-means clustering algorithm, and setting a clustering center of the K-means clustering algorithm according to the scaled characteristic image, wherein the clustering center is a rectangular frame;

S5, obtaining an image to be detected, wherein the image to be detected comprises a multiple target area formed by a large target area and a small target area, inputting the image to be detected into a trained target detection model, and outputting a detection result of each target area of the image to be detected.

In the embodiment, on the whole, when the target detection model is constructed, the FPN network model and the PAN network model are added on the basis of the YOLO network model, the FPN network model learns the feature information of different sizes through a top-down structure, and the PAN network model strengthens the positioning effect on the small target through a bottom-up structure, so that the positioning accuracy of the small target and the feature extraction effect on the small target can be improved.

Example 2

In this embodiment, on the basis of embodiment 1, the difference between the predicted bounding box of the large target area and the real frame of the large target area, and the difference between the predicted bounding box of the small target area and the real frame of the small target area, which are mentioned in embodiment 1, are evaluated.

The difference between the predicted boundary frame of the large target area and the real frame of the large target area and the difference between the predicted boundary frame of the small target area and the real frame of the small target area are evaluated through a Loss function Loss value, wherein the Loss value represents the difference, and a preset value of the Loss value is used for representing the preset difference, and the method specifically comprises the following steps:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc (1)

L _cls the formula (2) is specifically as follows:

L _obj the formula (2) is specifically as follows:

representing whether the jth prediction bounding box in the ith grid is a background target, if so, the value is 1, if not, the value is 0, and c _i Is the true confidence, the value is 1 if the object is the foreground object, and the value is 0, c 'if the object is the background object' _i If the target is a foreground target, the value is 1, and if the target is a background target, the value is 0;

please refer to fig. 5, L _loc The formula (2) is specifically as follows:

L _loc ＝L _CIoU ＝1-CIoU (4)

CIoU＝IoU-(ρ ² (b，b ^gt )/c ² +αv) (5)

a＝v/(1-IoU)+v (6)

v＝4/Π ² (tan ^-1 w ^gt /h ^gt -tan ^-1 w/h) (7)

wherein IoU is the ratio of intersection area and union area of real box and predicted boundary box, rho ² (b，b ^gt ) Represents the square of the distance between the center points of the predicted bounding box and the real box, as shown in FIG. 5, p ² (b，b ^gt ) Is the square of the value d, c in FIG. 5 ² As shown in FIG. 5, the square of the diagonal distance, w, representing the minimum bounding region that can contain both the predicted bounding box and the true box ^gt /h ^gt Is the aspect ratio of the real box, w/h is the aspect ratio of the predicted bounding box, tan ^-1 () Is an arctangent function.

In this embodiment, the difference between the predicted bounding box of the large target area and the real box of the large target area and the difference between the predicted bounding box of the small target area and the real box of the small target area are evaluated by using the CIoU Loss function, so that the performance of the network can be improved.

Example 3

Referring to fig. 6, the present embodiment describes a detection system for multiple target images in the present invention, and the detection system for multiple target images in the present embodiment includes:

an acquisition unit 601 configured to acquire an original image data set, each original image in the original image data set including a large target region and a small target region;

a preprocessing unit 602, configured to preprocess an original image data set to obtain a preprocessed image data set, and divide the image data set into a training set, a verification set, and a test set;

a constructing unit 603, configured to construct a target detection model, where the target detection model includes a YOLO network model, a FPN network model, a PAN network model, and a detection network model, which are connected in sequence;

the training unit 604 is configured to train a target detection model using a training set, evaluate the target detection model in the training process using a validation set, and test the effectiveness of the target detection model using a test set to obtain a trained target detection model;

the detection unit 605 is configured to obtain an image to be detected, where the image to be detected includes multiple target regions formed by a large target region and a small target region, input the image to be detected into a trained target detection model, and output a detection result of each target region of the image to be detected.

The YOLO network model includes a CSPDarkent53 network and an SPPF module connected in sequence, where the network parameters and weights of the CSPDarkent53 network are obtained by using network parameters and weights obtained by pretraining a generic ImageNet image classification dataset, the training unit 604 is specifically configured to perform a feature extraction operation on the preprocessed raw image dataset through the CSPDarkent53 network to obtain a first feature map, perform a pooling operation and a feature fusion operation on the first feature map through the SPPF module to obtain a second feature map, input the second feature map into the FPN network model to perform multi-scale feature learning to obtain a third feature map, input the third feature map into the PAN network model to perform feature size positioning learning to obtain a fourth feature map, input the fourth feature map into the detection network model, and perform automatic labeling and classification prediction at the detection network model based on the fourth feature map to obtain a prediction boundary frame of a large target region, and perform classification prediction at the detection network model based on the third feature map, And when the difference between the predicted boundary frame of the large target area and the real frame of the large target area is less than or equal to the preset difference, and the difference between the predicted boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference, finishing training to obtain a trained target detection model.

Example 4

In this embodiment, taking current new crown reagent target detection as an example, and describing a new crown reagent target detection process by combining schematic diagrams shown in fig. 7 and fig. 8, referring to fig. 7, first, a mobile phone high-definition camera is used to shoot a new crown reagent and a reagent result corresponding to the new crown reagent to obtain an original image data set, and the original image data set is preprocessed, specifically, a large target area and a small target area of the original image are labeled by performing a flipping operation, a zooming operation, and a data enhancing operation on the original image, where the large target area is an area where the new crown reagent is located in the original image, the small target area is an area where the reagent result is located in the original image, sizes of the large target and the small target are various, the large target and the small target jointly form multiple targets, and the large target and the small target are relatively related to each other, so as to obtain a preprocessed image data set, dividing the preprocessed image data set into a training set, a verification set and a test set, training a target detection model by using the training set, evaluating the target detection model in the training process by using the verification set, testing the effectiveness of the target detection model by using the test set to obtain the trained target detection model, referring to fig. 7, inputting the image to be detected into the trained target detection model, and reasoning the trained target detection model to obtain a final detection result, namely a prediction boundary box of a large target area and a prediction boundary box of a small target area in the image to be detected.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A detection method for multiple target images is characterized by comprising the following steps:

2. The method for detecting multiple target-oriented images according to claim 1, wherein each raw image in the raw image data set is captured by a high-definition camera of a mobile phone.

3. The method for detecting multiple target images according to claim 2, wherein the preprocessing the original image data set comprises:

4. The method for detecting multiple target images according to claim 3, wherein the preprocessing the original image data set further comprises:

turning over, zooming and data enhancing operation are carried out on each original image in an original image data set, and numerical value information of each image labeling data corresponding to the original image in an image labeling data set is changed according to the turning over, zooming and data enhancing operation, wherein the numerical value information comprises coordinate information of a large target area real frame in the image and coordinate information of a small target area real frame in the image;

5. The method for detecting multiple target-oriented images according to claim 4, wherein the YOLO network model comprises a CSPDarkent53 network and an SPPF module which are connected in sequence, and the network parameters and the weights of the CSPDarkent53 network are pre-trained by using a general ImageNet image classification data set.

6. The multiple object image-oriented detection method according to claim 5, wherein, in step S4,

performing feature extraction operation on the preprocessed original image data set through a CSPDarkent53 network to obtain a first feature map, performing pooling operation and feature fusion operation on the first feature map through an SPPF module to obtain a second feature map, inputting the second feature map into an FPN network model to perform multi-scale feature learning to obtain a third feature map, inputting the third feature map into the PAN network model to perform feature size positioning learning to obtain a fourth feature map, inputting the fourth feature map into a detection network model, performing automatic labeling and classification prediction at the detection network model based on the fourth feature map to obtain a prediction boundary frame of a large target area, a prediction boundary frame of a small target area and classification probabilities respectively corresponding to the prediction boundary frame of the large target area and the prediction boundary frame of the small target area, when the difference degree between the prediction boundary frame of the large target area and a real frame of the large target area is less than or equal to a preset difference degree, and when the difference between the predicted boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference, finishing the training to obtain a trained target detection model.

7. The method for detecting the multiple target images according to claim 6, wherein the detection network model automatically labels the large target prediction area and the small target prediction area in a predefined frame labeling mode, the predefined frame labeling is an adaptive frame labeling, and an adaptive calculation process of the adaptive frame labeling is as follows:

8. The method as claimed in claim 7, wherein the difference between the predicted bounding box of the large target area and the real bounding box of the large target area and the difference between the predicted bounding box of the small target area and the real bounding box of the small target area are evaluated by a value of a Loss function Loss, wherein the value of Loss represents the difference, and a preset value of Loss represents a preset difference, specifically as follows:

setting the parts in the large target area and the small target area in the original image as the foreground, setting the parts outside the large target area and the small target area as the background, equally dividing the original image into a plurality of grids, and introducing a loss function formula as follows:

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc (1)

wherein λ is ₁ 、λ ₂ 、λ ₃ Is a hyperparameter, L _cls Is to determine the error, L, produced by the classification of the original image _obj Is to determine whether the error, L, is an error generated by the foreground object _loc Errors caused by positioning of a large target area boundary frame and a small target area boundary frame;

L _cls the formula (2) is specifically as follows:

representing whether the jth prediction bounding box in the ith grid is a foreground target or not, if so, the value is 1Otherwise, the value is 0, p _i (c) Is classification probability, p' _i (c)＝1-p _i (c) Log () is a logarithmic function;

L _obj the formula (2) is specifically as follows:

L _loc the formula (2) is specifically as follows:

L _loc ＝L _CIoU ＝1-CIoU (4)

CIoU＝IoU-(ρ ² (b，b ^gt )/c ² +αv) (5)

a＝v/(1-IoU)+v (6)

v＝4/Π ² (tan ^-1 w ^gt /h ^gt -tan ^-1 w/h) (7)

wherein IoU is the ratio of the intersection area and the union area of the real box and the predicted boundary box, rho ² (b，b ^gt ) Square of the distance representing the center points of the predicted bounding box and the real box, c ² Square of diagonal distance, w, representing minimum occlusion region capable of containing both predicted bounding box and true box ^gt /h ^gt Is the aspect ratio of the real box, w/h is the aspect ratio of the predicted bounding box, tan ^-1 () Is an arctangent function.

9. A multiple-object-image-oriented detection system, comprising:

the device comprises a construction unit, a detection unit and a processing unit, wherein the construction unit is used for constructing a target detection model, and the target detection model comprises a YOLO network model, a FPN network model, a PAN network model and a detection network model which are sequentially connected;

10. The multi-object image-oriented detection system of claim 9, wherein the YOLO network model includes a CSPDarkent53 network and an SPPF module connected in sequence, the network parameters and weights of the CSPDarkent53 network are obtained by using network parameters and weights obtained by pre-training a generic ImageNet image classification dataset, the training unit is specifically configured to perform feature extraction on the preprocessed raw image dataset through the CSPDarkent53 network to obtain a first feature map, perform pooling operation and feature fusion operation on the first feature map through the SPPF module to obtain a second feature map, input the second feature map into the FPN network model for multi-scale feature learning to obtain a third feature map, input the third feature map into the PAN network model for feature size positioning learning to obtain a fourth feature map, input the fourth feature map into the detection network model, and based on the fourth feature map, and carrying out automatic labeling and classification prediction at the detection network model to obtain a prediction boundary frame of a large target area, a prediction boundary frame of a small target area and classification probabilities respectively corresponding to the prediction boundary frame of the large target area and the prediction boundary frame of the small target area, and finishing training when the difference between the prediction boundary frame of the large target area and the real frame of the large target area is less than or equal to a preset difference and the difference between the prediction boundary frame of the small target area and the real frame of the small target area is less than or equal to the preset difference to obtain a trained target detection model.