CN116486254A

CN116486254A - Target counting model training method based on enhanced local context supervision information

Info

Publication number: CN116486254A
Application number: CN202310238457.5A
Authority: CN
Inventors: 臧贺藏; 王言景; 申华磊; 周萌; 张�杰; 郑国清; 李国强; 赵晴
Original assignee: Institute Of Agricultural Economics And Information Henan Academy Of Agricultural Sciences
Current assignee: Institute Of Agricultural Economics And Information Henan Academy Of Agricultural Sciences
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-07-25

Abstract

The invention belongs to the technical field of remote sensing image processing, and discloses a target counting model training method based on enhanced local context supervision information. The method comprises the following steps: the method comprises the steps of obtaining a sample image set, and dividing the sample image set into a training set, a verification set and a test set, wherein the sample image set comprises a plurality of sample images containing target objects and point labeling results corresponding to each sample image; training the training set through the target counting model to obtain a final target counting model; the target counting model is obtained by adding a local segmentation branch and a feature fusion module on the basis of P2 PNet. The target counting model has the advantages of small error and better generalization performance, can accurately identify the positions and the numbers of the target objects, and solves the problem of limited performance of the existing counting model caused by factors such as illumination, shielding and overlapping of the target objects.

Description

Target counting model training method based on enhanced local context supervision information

Technical Field

The invention belongs to the technical field of remote sensing image processing, and particularly relates to a target counting model training method based on enhanced local context supervision information.

Background

Wheat is an important grain crop in China, wherein the production of Henan wheat accounts for 1/4 of China, and the wheat is sufficiently alive for about 4 hundred million people, so that the maintenance of the continuous high yield of the wheat has important significance for maintaining the grain safety in China. In the wheat growth process, the wheat seedling number is a key factor limiting the yield, and the wheat yield is greatly influenced by too sparse or dense wheat seedlings. Therefore, timely and accurate statistics of wheat seedling number will provide important scientific basis for subsequent production links such as seedling rate estimation, yield prediction, grain quality estimation and the like.

The traditional wheat seedling counting work mainly relies on manual work to carry out a plurality of seedlings in the field, has the problems of high economic cost, labor consumption, low counting efficiency and the like, and the counting result is easily influenced by subjective factors. With the development of deep learning, automatic target object counting using a deep neural network is becoming a new research hotspot. Compared with the manual seedling counting method, the method has the advantages that the collected wheat seedling images are analyzed by using the deep neural network, so that the wheat seedling number is automatically detected, space-time limitation and dependence on agricultural specialists can be broken, and the labor efficiency is improved.

Deep learning techniques have been used by scholars to count target subjects such as cells, populations, pigs, and ears. Among them, the method based on detection and density map regression has been widely applied to the above fields. The counting method based on detection mainly uses detectors such as YOLO, SSD, fast R-CNN and the like to detect target objects in the image, and then obtains the number of the target objects. The method not only can provide the counting result of the target object, but also can provide the position information of the target object through the frame. In the counting method, a frame is required to be used as a true value in a training stage, and the problems of shielding, overlapping, twisting and the like of wheat seedling images are solved, so that direct frame marking is difficult; meanwhile, the method for generating the pseudo block diagram by using the point labeling is easy to make mistakes, and the technical difficulty of subsequent processing is high. The counting method based on density map regression uses point labeling to generate a density map of a target object for a training sample to be used as a learning target of a model, so that the number of the target objects is obtained by integrating the density map predicted by the model; the marking cost for counting wheat seedlings by adopting the method is low, but the specific position of the wheat seedlings cannot be clearly marked, the application of downstream tasks such as planting planning, fertile farmland cultivation and the like is not facilitated, the influence of perspective distortion is easy to cause, and the robustness of the model is low. At present, the P2PNet proposed by Song et al provides a new solution for the target object counting method. And the P2PNet directly takes the point labels as learning targets of the model, predicts the point coordinates of all target objects, and can obtain the total number of the target objects. Compared with the two types of counting models, the P2PNet does not need to carry out frame labeling on target objects in training samples, and does not need to generate a pseudo-density map or a pseudo-block diagram through point labeling to indirectly obtain learning targets. In addition, the P2PNet can clearly identify the position of the target object, and can meet the requirements of downstream tasks.

However, the wheat seedling growing environment is complex, so that serious noise exists in the wheat seedling image, and the existing target object counting method is directly used for wheat seedling counting, so that the performance is poor. On one hand, shadows with different directions and sizes appear in wheat seedling images due to dead leaves and different illumination angles in wheat fields, so that interference noise is brought to a counting model, and the performance of the existing counting model is seriously influenced; on the other hand, the shielding of the wheat seedlings by the soil blocks in the wheat fields and the overlapping of the blades when the wheat seedlings grow densely lead to the misjudgment of the existing wheat seedling counting model.

Disclosure of Invention

Aiming at the problems and the shortcomings existing in the prior art, the invention aims to provide a target counting model training method based on enhanced local context supervision information.

Based on the above purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides a target counting model training method based on enhanced local context supervision information, which comprises the following steps:

s1, acquiring a sample image set, wherein the sample image set comprises a plurality of sample images containing target objects and point labeling results corresponding to each sample image; the point marking result of the sample image is the position information corresponding to the marking point of the target object; randomly dividing a sample image set into a training set, a verification set and a test set according to a proportion;

S2, inputting a sample image in a training set into a pre-constructed target counting model for counting to obtain a target counting result of the sample image, wherein the target counting result is the position information of a target object obtained based on the target counting model; constructing a loss function according to a target counting result of the sample image and a point labeling result of the sample image, and updating parameters of a target counting model by adopting back propagation according to the loss function to obtain a trained target counting model; the target counting model is obtained by adding a local segmentation branch and a feature fusion module between a point regression branch and a classification branch of a point 2PNet based on a point-to-point network P2PNet with a positioning classification framework VGG16_bn as a basic network;

and S3, sequentially passing the trained target counting model through the verification set for verification and the test set for testing to obtain an optimal target counting model.

More preferably, the point-to-point P2PNet is a counting model based on point labeling, the position coordinates of the target object are labeled in the form of points, and then the labeling result is directly used as a learning target of the model. The P2PNet before improvement comprises a basic network, point regression branches and classification branches, and specifically comprises the following steps: firstly, extracting global features of a target object by taking VGG16_bn as a basic network; then, the global features are simultaneously sent into point regression branches and classification branches, and candidate points of the target object and confidence scores (namely classification results) corresponding to the candidate points are respectively generated; and finally, screening the position coordinates of the target object from the candidate points according to the classification result, wherein the total number of the position coordinates is the counting result of the target object.

Further, in the P2PNet model before improvement, the extracted global feature map F is directly extracted ₀ And simultaneously sending the data to a point regression branch and a classification branch, wherein the structures of the two branches are consistent. Given global feature map F ₀ Is 256 XH W, wherein H, W represents height and width, respectively. First, for F ₀ The 3×3 convolution layer and the ReLU activation function are continuously executed twice, and the obtained tensor dimension is unchanged and is 256×h×w; then, 3×3 convolution is performed once to change the number of channels from 256 to 2*Z, and the dimension of the output tensor to 2*Z ×h×w; finally, through operations such as dimension transposition, shape change and the like, a two-dimensional tensor with the dimension of Mx 2 is output. Here, Z is a super parameter (generally, 4 is the best), and determines the total number M of coordinates of the output candidate points and the total number M of confidence scores, where the total number m=h×w×z. This two-dimensional tensor represents the results of M two classifications for the classification branches, i.e., the results of classifying M point coordinates as target, non-target; for the point regression branch, M point coordinate offsets are generated, and the M point coordinate offsets are added with the predefined point coordinates in the P2PNet to obtain final M candidate point coordinates. In the reasoning stage, the orders can be screened out from the candidate points according to the corresponding confidence scores The final predicted result of the target coordinates is the result of the target counting.

Still preferably, the object counting model of the present invention is an improved P2PNet, inherits the base network vgg16_bn, the point regression branch and the classification branch in the P2PNet before improvement, and inserts the local segmentation branch and the feature fusion module after the base network and before the point regression branch and the classification branch.

Further, the base network vgg16_bn in the target counting network according to the present invention still comprises 13 convolution blocks and 2 1×1 convolution layers. Wherein, 13 convolution blocks sequentially extract the characteristics of the input sample images and sequentially output C ₁ 、C ₂ 、C ₃ 、C ₄ 4 feature maps of different scales, the feature map C ₁ 、C ₂ 、C ₃ 、C ₄ The corresponding width and height are sequentially halved; to reduce the parameter of the model, 2 1×1 convolution layers are applied to the feature map C ₃ 、C ₄ Channel compression was performed to obtain Conv (C) ₃ )、Conv(C ₄ ) Said Conv (C) ₃ )、Conv(C ₄ ) The number of the corresponding channels is halved; finally, conv (C ₄ ) Up-sampling by 2-fold nearest neighbor interpolation and Conv (C ₃ ) Adding to obtain global feature map F ₀ The F is ₀ The generation process can be expressed as:

F ₀ ＝Conv(C ₃ )+Up(Conv(C ₄ ))

in the equation, conv and Up represent convolution and upsampling, respectively.

Preferably, the local segmentation branch comprises a local feature extraction module; the local feature extraction module is used for carrying out local feature extraction on the global feature map extracted by the basic network to obtain a local feature map; the local segmentation branches are combined with the point labeling results of the sample images to conduct optimization judgment on the local feature images, and the optimized local feature images are obtained; the feature fusion module is used for carrying out fusion processing on the optimized local feature map and the global feature map to obtain a fusion feature map, and then taking the fusion feature map as the input of the point regression branch and the classification branch.

Preferably, the local feature extraction module comprises 3 dimension reduction convolution modules and 1 3×3 convolution layers which are identical in structure and connected in sequence, and the dimension reduction convolution modules are used for compressing channels of an input feature map; the global feature map F ₀ After the channel compression is carried out by the input dimension reduction convolution module, a one-time dimension reduction feature map with unchanged width and height and halved channel is obtained; continuously inputting the primary dimension-reducing feature map into a 2-time dimension-reducing convolution module for channel compression, and sequentially obtaining a secondary dimension-reducing feature map and a tertiary dimension-reducing feature map; after the three dimension reduction feature map is processed by the 3X 3 convolution layer, a local feature map F with the unchanged width and height and 2 channels is obtained ₁ . Here, it is specified that: the channel marked 1 on the feature map corresponds to the feature map representing the local information; the channel marked 0 on the feature map corresponds to the feature map representing the non-local information and is described as L below _G The superscript of the loss function corresponds. Furthermore, the local feature map with the channel number of 2 is formed by splicing feature maps representing local information and non-local information, and the feature map not only represents local high-level semantic features, but also is local context supervision information emphasized by the invention.

More preferably, the generation process of the local feature map may be expressed as:

F ₁ ＝Conv(f(F ₀ ))

wherein f is the continuous 3-time dimension reduction convolution module processing; conv is a convolution process.

More preferably, the dimension reduction convolution module is composed of 2 3×3 convolution layers and 2 ReLU functions alternately, wherein the 1 st convolution layer is connected with the 2 nd convolution layer through residual after nonlinear activation by the 1 st ReLU function. Further, the ReLU function can improve the nonlinear expression capability of the network model, and residual connection can reduce the risk of over-fitting of the model.

Preferably, the process of performing optimization judgment on the local feature map by combining the local segmentation branch with the point labeling result of the sample image specifically includes:

(1) Generating a local segmentation map: firstly, generating a circle domain with marked point coordinates as a circle center and sigma as a radius according to a point marking result of the sample image; then binarizing the pixel points inside and outside the circle, and assigning the pixel value of the pixel point inside the circle as 1, otherwise, assigning the pixel value as 0; and finally obtaining a local segmentation graph G. The generation process of the local segmentation graph G and the circle domain radius sigma is as follows:

wherein p is the pixel position in the partial segmentation map; p is p _i Representing the coordinates of the ith marking point; p= { P _i I e {1, … N } represents the coordinates of all the annotation points; r (w, h) is represented by the mark point p _i A rectangular area with the width w and the height h is taken as the center (w, h is a super parameter); r (w, h) is the number of marking points contained in the rectangular area; a is any marking point in the region R (w, h); k is the number of K marking points (super parameters) closest to a; d, d _k，a Is the Euclidean distance between the kth annotation point and a.

The local segmentation map is a binary image, wherein a circle domain with a pixel value of 1 is a region representing the information of the labeling point (i.e., a local context supervision information region), and a circle domain with a pixel value of 0 is a region representing the information of the non-labeling point.

(2) Generating a prediction segmentation map: the local feature extraction branches further comprise 1 3 x 3 convolution layer; the local feature map F ₁ 8 times of up-sampling treatment is carried out, and then the 3 multiplied by 3 convolution layer is input for convolution treatment, thus obtaining a prediction segmentation map F _G The method comprises the steps of carrying out a first treatment on the surface of the The generation process of the prediction segmentation map is as follows:

F _G ＝Conv(Up(F ₁ ))

where Up is the upsampling process.

More preferably, the 8-fold upsampling process uses nearest neighbor interpolation to make the prediction segmentation map F _G Is consistent with the width and height of the learning object (i.e., the partial segmentation graph G). The 3 x 3 convolution layer is used to smooth noise generated by up-sampling, resulting in a feature map with more stable mathematical properties.

(3) And (3) optimizing and judging: partitioning the graph F according to predictions _G Constructing a loss function L of a local segmentation branch with a local segmentation map G _G According to the loss function L _G Performing parameter updating on the target counting model by adopting back propagation to obtain a target counting model after parameter updating; re-extracting local features by using the target counting model with updated parameters to obtain an optimized local feature map; the loss function L _G The method comprises the following steps:

wherein w is a weight; l is a superscript (value 0 or 1); g ^l Tensors formed for the channel labeled l above in the partial segmentation map; i G ^l I is tensor G ^l The sum of all numerical elements contained; the I G I is the sum of all numerical elements in the local segmentation map; f (F) _G ^l Tensors formed for the channel labeled l in the predictive segmentation map; mean is the average of all numerical elements contained by the tensor; gamma is a super parameter.

The prediction partition map F generated by the local partition branch _G The method is a pixel-level classification result, and in order to relieve the problem of sample unbalance between foreground class and background class and reduce the influence of the sample unbalance on the counting precision, the method adds a loss function L of a local segmentation branch _G 。

Preferably, the process of the feature fusion module for fusing the optimized local feature map and the global feature map specifically includes:

(1) Generating a local strengthening characteristic diagram: the feature fusion module comprises a softmax function and a repeat function; after the optimized local feature map is input into a feature fusion module, classifying the optimized local feature map by using 1 softmax function to obtain 2 tensors with the scale of H multiplied by W, wherein one tensor represents local feature information and the other tensor represents non-local feature information; copying tensors representing local feature information through repeat functions, and splicing to obtain a global feature map F ₀ Local strengthening characteristic diagrams with the same channel number;

it should be noted that, the classification process of the softmax function may normalize the value of each tensor element obtained to be [0,1], which indicates the probability that the corresponding pixel point is determined by the network to be indicative of local feature information or indicative of two categories of non-local feature information, where a closer value to 1 indicates a greater probability that the candidate point is determined to be indicative of local feature information.

(2) Feature fusion: local enhancement feature map and global feature map F ₀ Performing element-by-element dot multiplication to obtain a fusion feature map F ₂ . Further, the fusion feature map F ₂ The local context supervision information and the global characteristic information of the target object are fused, and the recognition capability of the network to the target object is further enhanced, so that the counting accuracy of the model to the target object is effectively improved.

Preferably, in step S2, the loss function is composed of a loss function of a point regression branch, a loss function of a classification branch, and a loss function of a partial segmentation branch, and the loss function L is specifically:

L＝L _CE +λ ₁ L _P +λ ₂ L _G

wherein lambda is ₁ 、λ ₂ Are super parameters; l (L) _P A loss function for the point regression branch;point labeling coordinate p with ith target object _i Candidate point coordinates successfully matched (namely predicted target object coordinates); l (L) _CE A loss function for classifying branches; />Representing a confidence score, i.e., a probability of being predicted as a target; y represents a label (value 0 or 1); l (L) _G A loss function for a locally split branch; w is a weight; l is a superscript (value 0 or 1); g ^l Tensors formed for the channel labeled l above in the partial segmentation map; i G ^l I is tensor G ^l The sum of all numerical elements contained; the I G I is the sum of all numerical elements in the local segmentation map; f (F) _G ^l Tensors formed for the channel labeled l in the predictive segmentation map; mean is the average of all numerical elements contained by the tensor; gamma is a super parameter.

The loss function of the point regression branch and the loss function of the classification branch are the loss functions in P2PNet before improvement, respectively.

A second aspect of the present invention provides an image object counting method, the method comprising: acquiring an image to be identified, and inputting the image to be identified into a target counting model to obtain a target counting result of the image to be identified; the target counting model is a trained target counting model obtained by training the target counting model training method according to any one of the first aspect.

Preferably, the image to be identified is a plant seedling image, and the target counting result is the position and number of plant seedlings. More preferably, the image to be identified is a wheat seedling image, and the target count result is the position and number of wheat seedlings.

A third aspect of the present invention provides an electronic device comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, implements any of the steps of the object recognition model training method as described in the first aspect above, and/or the image object counting method as described in any of the second aspect above.

A fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the steps of the object recognition model training method as described in the first aspect above, and/or the image object counting method as described in any of the second aspect above.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides a target counting model P2P_Seg for enhancing local context supervision information, which solves the problem of limited performance of the existing counting model caused by factors such as illumination, shielding, overlapping and the like. The target counting model P2P_Seg introduces wheat seedling local segmentation branches on the basis of P2PNet to acquire more wheat seedling local context supervision information, and fuses the local context supervision information with global information extracted by a basic network by using an element-by-element point multiplication mechanism. The improvement measure of the network structure and the feature fusion improves the feature extraction capability of the model on wheat seedlings, enhances the countermeasure capability of the model on factors such as illumination, shielding and overlapping, improves the robustness of the model, and effectively avoids false counting and missing counting of the model on wheat seedlings. In one embodiment, the MAE of the target counting model P2P_Seg obtained by the method is 5.86, the RMSE is 7.68, and the MAE is reduced by 0.74 and 1.78 respectively compared with the P2PNet before improvement; compared with other existing counting models CSRNet, CANet, SCAR, BL, DM-Count, the counting error of MAE and RMSE of P2P_Seg is minimum, and the counting performance is more accurate.

(2) The target counting model P2P_Seg can more accurately predict the plant number of wheat seedlings, and solves the problems of time and labor waste of traditional manual seedling counting; and meanwhile, the position of wheat seedlings can be predicted, effective supporting information is provided for downstream tasks such as planting planning, fertile farmland cultivation and the like, and the method is more beneficial to actual agricultural production.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

FIG. 1 is a flow chart of a sample image set acquisition process in an embodiment of the invention;

FIG. 2 is a flow chart of a sample image preprocessing process in an embodiment of the invention;

FIG. 3 is a representation of wheat seedling annotation images of different density levels in an embodiment of the invention;

FIG. 4 is a schematic diagram of a network structure of the object counting model P2P_Seg according to the present invention;

FIG. 5 is a schematic diagram of a network structure of a local feature extraction module according to the present invention;

FIG. 6 is a schematic diagram of a network architecture of a feature fusion module according to the present invention;

FIG. 7 is a graph showing the results of counting seedlings on the same wheat data set for the target counting model P2P_Seg of the present invention and the existing models CSRNet, CANet, SCAR, BL, DM-Count, P2 PNet; in the figure, (a) is listed as a labeling image, namely, a point labeling result is directly used as the true values of the P2PNet and the P2 P_Seg; (b) The density images generated by the point labeling are listed as the true values based on the density map counting model; (c) Column (g) to column CSRNet, CANet, SCAR, BL, DM-Count respectively; (h) Columns (i) and (i) are the count results of P2PNet and P2P_Seg, respectively.

Detailed Description

The present invention will be further described in detail below with reference to the accompanying drawings by way of examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Example 1

The embodiment provides a target counting model training method based on enhanced local context supervision information, which comprises the following steps:

S1, acquiring a sample image set, wherein the sample image set comprises a plurality of sample images containing a target object and point labeling results corresponding to each sample image as shown in FIG. 1; the point marking result of the sample image is the position information corresponding to the marking point of the target object; the sample image set is divided into a training set, a verification set and a test set at random in proportion. The method comprises the following specific steps:

firstly, a sample image is acquired, and the specific process is as follows:

(1) And (3) data acquisition: the wheat in seedling stage at the test place is photographed and sampled in 10 months of 2021 by using a mobile device (camera pixel is 4800 ten thousand pixels, sensor type is BSI CMOS, aperture f/2.2) with model number of HONOR 20PRO, and the sampling standard is a target counting area marked by a red trapezoid frame. After the sampling is finished, 317 wheat seedling images (the resolution is 4000 pixels×3000 pixels) are collected in total, and 295 images are screened out as initial experimental images of the invention after images with blurred image quality and serious shielding are removed.

The test land is a wheat test at a modern agricultural research and development base of Henan province, 35 degrees 0'28 of North latitude at the land, 113 degrees 41'48 of east longitude and 97m of altitude. The test adopts a complete random block design, the sowing date is 2021, 10, 15, the sampling date is 2021, 11, 12, 400 cells are all used, and the area of each cell is 36m ² 。

(2) Pretreatment: the preprocessing step mainly comprises black filling and redundancy elimination of non-target counting areas outside the red trapezoid frame, as shown in fig. 2. To avoid being affected by wheat seedlings in the non-target areas, the non-target areas are black filled using a pretreatment tool (see stage (1)). To avoid subsequent data-enhanced random clipping strategies may result in large areas of non-target count regions, thereby interfering with the count results of target regions, with maximum redundancy rejection of non-target regions (see stage (2)). The final experimental image (i.e., sample image) of the present invention was obtained through the above preprocessing steps.

Then, a point labeling result of the sample image is obtained, and the specific process is as follows:

because the wheat seedlings are small in shape and easy to shade, overlap and the like, the frame marking is very difficult to use, so that a point marking method (one point marking represents the point coordinates of the corresponding wheat seedlings) with low cost and convenience is adopted. Meanwhile, an online efficient marking tool developed based on HTML5, javascript and Python is adopted for marking the data set, and the marking tool supports two types of label forms of points and surrounding frames, so that not only can the image be marked in a blocking manner, but also the blocking area can be marked in a flexible amplifying manner. For the areas with relatively dense, shielding and serious overlapping in the wheat seedling image, the tool can be used for amplifying and marking the areas, so that the marking speed and quality are effectively improved. The marked area is the root and stem part (stem part close to the ground) of the wheat seedling with relatively obvious characteristics, and is convenient for network model identification.

The method is used for point marking of 295 initial experimental images, and 32237 wheat seedlings are marked. Wherein, the number of marked points of a single image is 321 points at most, and 18 points at least, and about 109 points are marked on each wheat seedling image. The wheat seedling annotation images of different density grades are shown in figure 3.

Finally, randomly dividing a wheat seedling data set (sample image set) consisting of 295 initial experimental images and corresponding point labeling results into a training set, a verification set and a test set according to the proportion of 6:1:3; wherein the training set, the verification set and the test set respectively contain 177 pairs of images, 29 images and 89 images of wheat seedlings. The wheat seedling data set division results are shown in table 1.

TABLE 1 wheat seedling dataset partitioning results

Data set	Number of images/frame	Wheat seedling total number/strain
			Training set	177	19622
Verification set	29	2849
			Test set	89	9766
Totalizing	295	32237

It should be noted that, before model training, data enhancement is performed on training sample images by adopting random clipping and random rotation: each image was randomly cropped to four portions, each portion having a size of 128 pixels by 128 pixels; the cropped image is then data enhanced with a random rotation with a probability of 0.5.

S2, inputting a sample image in a training set into a pre-constructed target counting model P2P_Seg for counting to obtain a target counting result of the sample image, wherein the target counting result is position point information of a target object obtained based on the target counting model; and constructing a loss function according to the target counting result of the sample image and the point labeling result of the sample image, and updating parameters of the target counting model by adopting back propagation according to the loss function to obtain a trained target counting model.

The target counting model P2P_Seg is obtained by adding a local segmentation branch and a feature fusion module between a basic network and two branches of a point regression branch and a classification branch on the basis of a point-to-point network P2PNet taking a positioning classification framework VGG16_bn as a basic network; the local segmentation branch comprises a local feature extraction module; the local feature extraction module is used for carrying out local feature extraction on the global feature map extracted by the basic network to obtain a local feature map; the local segmentation branches are combined with the point labeling results of the sample images to conduct optimization judgment on the local feature images, and the optimized local feature images are obtained; the feature fusion module is used for carrying out fusion processing on the optimized local feature map and the global feature map to obtain a fusion feature map, and then taking the fusion feature map as the input of the point regression branch and the classification branch.

To train the network more fully, the loss function L of the point regression branch is designed for the point regression branch, the classification branch and the wheat seedling local segmentation branch _P Loss function L of classification branch _CE Loss function L of local segmentation branch _G Three different loss functions. The Euclidean distance is used for optimizing the point regression branch, and the cross entropy loss function is used for representing the loss function of the classification branch, so that the total loss function L is specifically:

L＝L _cE +λ ₁ L _P +λ ₂ L _G

Wherein lambda is ₁ Is super parameter, and is set to 0.002 in the experiment; lambda (lambda) ₂ Is super-parametric, in experimentsSet to 0.005; l (L) _P A loss function for the point regression branch;point marking coordinate p with ith wheat seedling _i Candidate point coordinates successfully matched (namely predicted wheat seedling coordinates); l (L) _CE A loss function for classifying branches; />Representing a confidence score, i.e., a probability of being predicted as wheat seedlings; y represents a label (value 0 or 1); l (L) _G A loss function for a locally split branch; w is a weight; l is a superscript (value 0 or 1); g ^l Tensors formed for the channel labeled l above in the partial segmentation map; i G ^l I is tensor G ^l The sum of all numerical elements contained; the I G I is the sum of all numerical elements in the local segmentation map; f (F) _G ^l Tensors formed for the channel labeled l in the predictive segmentation map; mean is the average of all numerical elements contained by the tensor; gamma is a super parameter and is set to 2 in the experiment.

It should be noted that the machine used in the experiment was configured as Intel ^(R) Core ^(TM) i7-10600 [email protected],GPU is NVIDIA GeForce RTX3090, and the video memory size is 24GB. The experiment uses PyTorch as a deep learning framework, the training batch is set to 8, the training round number is 1000, the learning rate is fixed to 0.0001, and the optimization is carried out by adopting an Adam algorithm. The base network was pre-trained on ImageNet, trained on it using a smaller learning rate of 0.00001.

The method comprises the following specific steps:

s21, constructing a target counting model P2P_Seg: in order to reduce the influence of factors such as illumination, shielding, overlapping and the like on wheat seedling counting, the invention improves P2PNet, introduces a wheat seedling local segmentation branch to obtain a local feature map, blends the local feature map into a global feature map to strengthen local context supervision information of the wheat seedlings, and proposes a target counting model P2P_Seg for the wheat seedlings, which is used for strengthening the local context supervision information, wherein the whole structure is shown in figure 4 (the specific improved structure is described in detail in the following steps).

S22, inputting sample images in a training set into a pre-constructed target counting model for counting, wherein the process specifically comprises the following steps:

s221, extracting the global features of the wheat seedling sample image by adopting the VGG16_bn basic network to obtain a 256 XHXW global feature map F ₀ (see block (1)).

S222, generating a local feature map F by partial segmentation branches of wheat seedlings ₁ And (3) carrying out optimization judgment on the local feature map by combining the point labeling result of the sample image input by the target counting model so as to extract local context supervision information of wheat seedlings and further obtain an optimized local feature map (see a module (2)), wherein the local segmentation branches of the wheat seedlings aim at extracting the local context supervision information of the root and the stem of the wheat seedlings, and have two functions. First, the attention of the model is concentrated to the target area of the rootstock part of the wheat seedling marked by points, and the interference of shadows, field weeds and other noise caused by illumination is ignored. Secondly, when the positions of the wheat seedling marking points are shielded by sundries such as soil blocks, more context reference information can be provided, and the recognition range of the model is enlarged. The key technology included in the wheat seedling local segmentation branch comprises a wheat seedling local feature extraction module and a design for generating a wheat seedling local segmentation map. The method comprises the following steps:

(1) Generating a local feature map F ₁ : the wheat seedling local feature extraction module is an important component part of a wheat seedling local segmentation branch, and aims to generate a local feature map F1. The invention relates to a wheat seedling local feature extraction module which is shown in figure 5. The local feature extraction module comprises 3 dimension reduction convolution modules and 1 3 multiplied by 3 convolution layers which are identical in structure and are connected in sequence, and the dimension reduction convolution modules are used for compressing channels of an input feature map; the 256 XH XW global feature map F ₀ After the channel compression is carried out by the input dimension reduction convolution module, a 128 XH XW one-time dimension reduction characteristic diagram with unchanged width and height and halved channel is obtained; continuously inputting the primary dimension reduction feature map into a 2-time dimension reduction convolution module for channel compression to sequentially obtain a 64 XH XW secondary dimension reduction feature map and a 32 XH XW tertiary dimension reduction feature map; after the three dimension reduction feature graphs are processed by the 3X 3 convolution layer, a 2 XH X W local feature graph F with the unchanged width and height and the channel number of 2 is obtained ₁ . Here, it is specified that: on the characteristic diagramThe channel marked 1 corresponds to a feature map representing local information; the channel marked 0 on the feature map corresponds to the feature map representing the non-local information and is described as L below _G The superscript of the loss function corresponds. The local feature map with the channel number of 2 is formed by splicing feature maps representing local information and non-local information, and the feature map not only represents local high-level semantic features, but also is local context supervision information emphasized by the invention.

In combination, the local feature map F ₁ The generation process of (2) can be expressed as:

F ₁ ＝Conv(f(F ₀ ))

The dimension reduction convolution module is formed by 2 3×3 convolution layers and 2 ReLU functions alternately, wherein the 1 st convolution layer is connected with the 2 nd convolution layer through residual errors after nonlinear activation by the 1 st ReLU function. Further, the ReLU function can improve the nonlinear expression capability of the network model, and residual connection can reduce the risk of over-fitting of the model.

The partial feature map F ₁ The function of (2) is two. First, F ₁ For associating with global feature map F ₀ Fusion is carried out, so that fusion of local context supervision information and global information of wheat seedlings is realized. Second, F ₁ Up-sampling by 8 times nearest neighbor interpolation method and generating predictive segmentation map F by 3X 3 convolution layer _G Therefore, optimization of partial segmentation branches of wheat seedlings is realized in a network training stage.

(2) The process of optimizing and judging the local feature map by combining the local segmentation branch with the point labeling result of the sample image specifically comprises the following steps:

1) Generating a local segmentation map: the wheat seedling local segmentation map is a result image of extracting wheat seedling local context supervision information from point labels, and is used as a learning target of wheat seedling local segmentation branches, so that the counting model indirectly extracts more local context information by using the point labels on the basis of directly taking the point labels as the learning target, the true value is used more fully, and a stronger supervision effect is played on the network model.

Specifically, a circle domain with marked point coordinates as a circle center and sigma as a radius is generated according to the point marking result of the sample image; then binarizing the pixel points inside and outside the circle, and assigning the pixel value of the pixel point inside the circle as 1, otherwise, assigning the pixel value as 0; and finally obtaining a local segmentation graph G. That is, a wheat seedling image with N points marked is known (the positions marked by the points are at the rootstalk of the wheat seedling, and P = { P _i I e { 1., -, N }. To represent the point labeling coordinates of all wheat seedlings on the image, where p _i ＝(x _i ，y _i ) Point labeling coordinates representing ith wheat seedling), generating N numbers of p _i The circle field with the circle center and the radius sigma is adopted, the pixel value in the circle field is 1, and the pixel value outside the circle field is 0, so that the wheat seedling local segmentation map G is obtained, and the size of a rhizome target area of each wheat seedling is determined by the radius sigma of the circle field. The generation process of the local segmentation graph G and the circle domain radius sigma is as follows:

wherein p is the pixel position in the partial segmentation map; p is p _i Representing the coordinates of the ith marking point; p= { P _i I e {1,., N } represents the coordinates of all annotation points on the image; r (w, h) is represented by the mark point p _i A rectangular area with the width w and the height h is taken as the center (w, h is a super parameter); r (w, h) is the number of marking points contained in the rectangular area; a is any marking point in the region R (w, h); k is the number of K marking points (super parameters) closest to a; d, d _k，a Is the Euclidean distance between the kth annotation point and a.

The local segmentation map is a binary image, wherein the region with a pixel value of 1 is a wheat seedling rhizome target region (i.e., a local context supervision information region) of interest in the invention, and the region with a pixel value of 0 is a non-wheat seedling rhizome region.

2) Generating a prediction segmentation map: the local feature extraction branches further comprise 1 3 x 3 convolution layer; the local feature map F ₁ 8 times of up-sampling treatment is carried out, and then the 3 multiplied by 3 convolution layer is input for convolution treatment, thus obtaining a prediction segmentation map F _G The method comprises the steps of carrying out a first treatment on the surface of the The generation process of the prediction segmentation map is as follows:

F _G ＝Conv(Up(F ₁ ))

where Up is the upsampling process.

3) And (3) optimizing and judging: partitioning the graph F according to predictions _G Constructing a loss function L of a local segmentation branch with a local segmentation map G _G According to the loss function L _G Performing parameter updating on the target counting model by adopting back propagation to obtain a target counting model after parameter updating; re-extracting local features by using the target counting model with updated parameters to obtain an optimized local feature map; the loss function L _G The method comprises the following steps:

/>

wherein w is a weight; l is a superscript (value 0 or 1); g ^l Tensors formed for the channel labeled l above in the partial segmentation map; i G ^l I is tensor G ^l The sum of all numerical elements contained; the |G| is the position in the local segmentation mapSum of numerical elements; f (F) _G ^l Tensors formed for the channel labeled l in the predictive segmentation map; mean is the average of all numerical elements contained by the tensor; gamma is a super parameter.

S223, fusing global information and local context information of wheat seedlings by an element-by-element point multiplication mechanism of a feature fusion module (see fig. 6) to generate a fused feature map F ₂ (see module (3)), specifically:

(1) Generating a local strengthening characteristic diagram: the feature fusion module comprises a softmax function and a repeat function; after the optimized local feature map of 2 XHXW is input into a feature fusion module, classifying the feature fusion module by using 1 softmax function to obtain 2 tensors with the scale of H XW, wherein one tensor represents local feature information and the other tensor represents non-local feature information; the tensor representing the local characteristic information is duplicated 256 times through a repeat function, and the global characteristic diagram F is obtained after splicing ₀ 256 XH XW local enhancement feature map with same channel number;

S224, predicting candidate point location coordinates of wheat seedlings and corresponding confidence scores of the candidate point location coordinates through the point regression branches and the classification branches respectively (see a module (4)).

Further, the point regression branch predicts M candidate point coordinates, and correspondingly, the classification branch will generate M confidence scores. In the training stage, firstly, using a one-to-one matching strategy proposed by the P2PNet to perform one-to-one matching on candidate point coordinates generated by a network and labeling point coordinates; n candidate point coordinates successfully matched with the labeling point coordinates are predicted wheat seedling position coordinates, confidence score labels corresponding to the N candidate point coordinates are 1, the rest candidate point coordinates are classified as background points, and confidence score labels corresponding to the background points are 0.

Performance test:

1. influence of the difference of the partial segmentation map on the target count result

To investigate the effect of the difference of the local segmentation map on the target counting result, the inventors compared the average absolute error MAE and root mean square error RMSE performance of the target counting model obtained in example 1 of the present invention with the target counting model obtained in comparative example 1 on the same wheat seedling dataset. The MAE is used for measuring the counting accuracy of the network, and the smaller the MAE value is, the closer the predicted value of the wheat seedling number is to the true value. RMSE is used to measure the stability of a network, and the smaller the value, the stronger the stability and the better the robustness of the network. The results are shown in Table 2.

The object counting model obtained in comparative example 1 of the present invention was substantially the same as that of example 1, except that; in step S222 (2), the radius of the circle domain is no longer σ in example 1, i.e. will no longer beAveraging, directly toAs a circle field halfAnd obtaining a local segmentation map by the diameter, and further obtaining a target counting model.

TABLE 2 influence of partial wheat seedling segmentation on the counting results

Numbering device	MAE	RMSE
			Example 1	6.56	8.08
Comparative example 1	5.86	7.68

As can be seen from table 2, using the wheat seedling partial division map obtained in example 1 of the present invention as a learning target for the wheat seedling partial division branches, a more accurate counting effect can be obtained, which indicates that the size of the wheat seedling root target region (i.e., the local context monitoring information region) is very important for the counting performance of the model.

2. Influence of different target counting models on wheat seedling counting result

To investigate the effect of different target counting models on the target counting result, the inventors compared the target counting model obtained in example 1 of the present invention with the existing model CSRNet, CANet, SCAR, BL, DM-Count, P2PNet on the same wheat seedling dataset for MAE and RMSE performance. The results are shown in Table 3 and FIG. 7.

TABLE 3 influence of different target count models on wheat seedling count results

Model	MAE	RMSE
			CSRNet	26.98	31.71
CANet	34.25	41.19
			SCAR	21.24	27.11
BL	6.62	9.45
			DM-Count	6.54	9.97
P2PNet	6.60	9.46
			P2P_Seg	5.86	7.68

As shown in Table 3, the MAE of the target count model P2P_Seg obtained by the present invention was 5.86, the RMSE was 7.68, and the decrease was 0.74 and 1.78, respectively, compared to the P2PNet before improvement. Meanwhile, compared with other counting methods, the two counting errors of the P2P_Seg are also minimum. The method and the system indicate that the target counting model obtained by the invention improves the recognition capability of the network to wheat seedlings by enhancing local context supervision information, thereby reducing counting errors and improving the accuracy and stability of wheat seedling counting.

In fig. 7, the darker the color in the density map, the greater the wheat seedling density. The density map is used for carrying out visual display, the counting result of the existing models CSRNet, CANet, SCAR, BL, DM-Count and P2PNet on wheat seedling counting is poor, the accuracy is to be improved, the generated density map cannot directly identify the positions of wheat seedlings, and more supporting information cannot be provided for downstream tasks. The last two columns are respectively the predicted values of P2PNet and P2P_Seg, the output results of the two are more visual wheat seedling coordinates, and as the P2P_Seg introduces partial branches to enhance the local context supervision information, the predicted values are closer to the true values when wheat seedling images (such as images in lines 3-5) affected by factors such as shielding, overlapping, illumination and the like are counted, and the counting error is smaller. Moreover, from the 1 st line to the 6 th line, wheat seedlings in the image gradually become dense from sparse, and noise such as dead leaves, shadows caused by illumination and the like exists in the image, so that a small challenge is brought to the existing counting network model for identifying the wheat seedlings. However, the p2p_seg provided herein concentrates attention on the local root and stem of wheat seedlings by enhancing local context supervision information, so that the network ignores the noise as much as possible, thereby remarkably improving the accuracy of wheat seedling counting. Meanwhile, when wheat seedling images with different densities are processed, the target counting model P2P_Seg obtained by the invention obtains the best counting result and shows better generalization performance.

Example 2

An image object counting method, the method comprising: acquiring an image to be identified, and inputting the image to be identified into a target counting model to obtain a target counting result of the image to be identified; the target counting model is a trained target counting model obtained by training the target counting model training method in embodiment 1.

The image to be identified is a wheat seedling image, and the target counting result is the position and the number of the wheat seedlings.

Example 3

An electronic device comprising a memory storing a computer program that when executed implements the object count model training method of embodiment 1 or the image object count method of embodiment 2, and a processor.

Example 4

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the object count model training method as described in embodiment 1 or the image object count method as described in embodiment 2.

In conclusion, the invention effectively overcomes the defects in the prior art and has high industrial utilization value. The above-described embodiments are provided to illustrate the gist of the present invention, but are not intended to limit the scope of the present invention. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The target counting model training method based on the enhanced local context supervision information is characterized by comprising the following steps of:

s2, inputting a sample image in a training set into a pre-constructed target counting model for counting to obtain a target counting result of the sample image, wherein the target counting result is the position information of a target object obtained based on the target counting model; constructing a loss function according to a target counting result of the sample image and a point labeling result of the sample image, and updating parameters of a target counting model by adopting back propagation according to the loss function to obtain a trained target counting model; the target counting model is obtained by adding a local segmentation branch and a feature fusion module between a base network of the P2PNet and two branches of a point regression branch and a classification branch on the basis of a point-to-point network P2PNet taking a positioning classification framework VGG16 as a base network;

2. The object count model training method of claim 1, wherein said local segmentation branch includes a local feature extraction module; the local feature extraction module is used for carrying out local feature extraction on the global feature map extracted by the basic network to obtain a local feature map; the local segmentation branches are combined with the point labeling results of the sample images to conduct optimization judgment on the local feature images, and the optimized local feature images are obtained; the feature fusion module is used for carrying out fusion processing on the optimized local feature map and the global feature map to obtain a fusion feature map, and then taking the fusion feature map as the input of the point regression branch and the classification branch.

3. The training method of the target counting model according to claim 2, wherein the local feature extraction module comprises 3 dimension reduction convolution modules and 1 3 x 3 convolution layers which are identical in structure and are connected in sequence; the dimension reduction convolution module is used for compressing channels of the input feature images; after the global feature map is input into a dimension reduction convolution module for channel compression, a one-time dimension reduction feature map with unchanged width and height and halved channel is obtained; continuously inputting the primary dimension-reducing feature map into a 2-time dimension-reducing convolution module for channel compression, and sequentially obtaining a secondary dimension-reducing feature map and a tertiary dimension-reducing feature map; and processing the three dimension reduction feature graphs by the 3X 3 convolution layer to obtain a local feature graph with the unchanged width and height and the channel number of 2.

4. The training method of the object counting model according to claim 3, wherein the process of optimizing and judging the local feature map by combining the local segmentation branch with the point labeling result of the sample image is specifically as follows:

(1) Generating a local segmentation map: firstly, generating a circle domain with marked point coordinates as a circle center and sigma as a radius according to a point marking result of the sample image; then binarizing the pixel points inside and outside the circle, and assigning the pixel value of the pixel point inside the circle as 1, otherwise, assigning the pixel value as 0; finally, a local segmentation graph G is obtained; the generation process of the local segmentation graph G and the circle domain radius sigma is as follows:

wherein p is the pixel position in the partial segmentation map; p is p _i Representing the coordinates of the ith marking point; p= { P _i I e {1,., N } represents the coordinates of all the annotation points; r (w, h) is represented by the mark point p _i A rectangular area with the width w and the height h is taken as the center (w, h is a super parameter); r (w, h) is the number of marking points contained in the rectangular area; a is any marking point in the region R (w, h); k is the number of K marking points (super parameters) closest to a; d, d _k，a The Euclidean distance between the kth marking point and a;

(2) Generating a prediction segmentation map: the local feature extraction branches further comprise 1 3 x 3 convolution layer; The local feature map is input into the 3 multiplied by 3 convolution layer for convolution processing after 8 times of up-sampling processing to obtain a prediction segmentation map F _G ；

5. The training method of the object counting model according to claim 4, wherein the feature fusion module is configured to fuse the optimized local feature map with the global feature map specifically:

(1) Generating a local strengthening characteristic diagram: the feature fusion module comprises a softmax function and a repeat function; after the optimized local feature map is input into a feature fusion module, classifying the optimized local feature map by using 1 softmax function to obtain 2 tensors with the scale of H multiplied by W, wherein one tensor represents local feature information and the other tensor represents non-local feature information; copying tensors representing local feature information through repeat functions, and splicing to obtain local enhancement feature graphs with the same number as the channels of the global feature graph;

(2) Feature fusion: and performing element-by-element point multiplication on the local enhanced feature map and the global feature map to obtain a fusion feature map.

6. The method according to claim 5, wherein in step S2, the loss function is composed of a point regression branch loss function, a classification branch loss function, and a local segmentation branch loss function L _G Composition; the loss function L is specifically:

L＝L _CE +λ ₁ L _P +λ ₂ L _G

wherein lambda is ₁ 、λ ₂ Are super parameters; l (L) _P A loss function for the point regression branch; l (L) _CE A loss function for classifying branches; l (L) _G A loss function for the local partition branch.

7. An image object counting method, characterized in that the method comprises: acquiring an image to be identified, and inputting the image to be identified into a target counting model to obtain a target counting result of the image to be identified; the target counting model is a trained target counting model obtained by training the target counting model training method according to any one of claims 1-6.

8. The image object counting method according to claim 7, wherein the image to be identified is a young plant image, and the object counting result is the position and number of young plant.

9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, the processor, when executing the computer program, implementing any of the object count model training methods of claims 1-6 and/or the image object count method of claim 7 or 8.

10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the steps of the object counting model training method of claims 1-6 and/or the image object counting method of claim 7 or 8.