CN110399868B

CN110399868B - Coastal wetland bird detection method

Info

Publication number: CN110399868B
Application number: CN201810354126.7A
Authority: CN
Inventors: 邹月娴; 关文婕
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2022-09-09
Anticipated expiration: 2038-04-19
Also published as: CN110399868A

Abstract

The invention discloses a method for detecting birds in a coastal wetland, which is characterized in that a convolutional neural network is utilized, and through feature fusion, detail information beneficial to small-size target positioning and high-level semantic information beneficial to identification are fused into a feature map with high resolution; obtaining an interested area through an interested area generation network; obtaining a foreground area through an area object network; and further screening out an interested area in the foreground area, thereby realizing the bird detection and identification of the coastal wetland. The method can solve the problem that the detection performance of a large number of small-size birds at a distant view is poor in the coastal wetland bird detection in the prior art, and can greatly improve the detection success rate and accuracy of the small-size targets in the coastal wetland bird detection.

Description

Coastal wetland bird detection method

Technical Field

The invention relates to a target detection technology and a coastal wetland bird protection technology in computer vision, in particular to a coastal wetland bird detection method.

Background

The coastal wetland is a residence place of birds, and the distribution, the quantity, the biological diversity and other characteristics of the birds are related to the altitude, the soil humidity, the nitrogen gradient, the landscape index and other inorganic environmental factors in the wetland ecosystem, and are related to the biological integrity of the ecosystem. Therefore, the ecological parameters of birds are often used as evaluation indexes for site selection of protected areas, the integrity of the ecosystem and the health of the ecosystem. Therefore, an effective and simple bird diversity evaluation method is sought, and the method is crucial to timely understanding of wetland ecological environment quality and change information. However, the traditional working modes of 'long-term squat, hidden observation and periodic nest checking' are still adopted in the existing coastal wetland bird monitoring, and the continuity, the credibility and the timeliness of the obtained bird information data are poor. Therefore, by means of a target detection technology in computer vision, birds are automatically detected, the number and the types of the birds are counted for a long time, automation and digitization of bird activity records are achieved, labor cost can be greatly reduced, a scientific method is provided for coastal wetland protection and recovery, and the method has important application value.

With the development of deep learning, the target detection algorithm based on deep learning has a good effect in many applications. In the algorithms, deep semantic information capable of reflecting the essence of an image is extracted from an original image by using a convolutional neural network, and then the information is classified, so that a final detection result is obtained. In the coastal wetland bird detection task, due to the protection of birds, data acquisition equipment is often placed at a place far away from a bird residence place, so that a large number of small-size bird targets exist at a distant view in acquired videos or pictures. The existing target detection technology has poor detection performance on small targets, is easy to miss detection, has poor detection effect on small-size birds at distant scenes, and is difficult to apply to coastal wetland bird detection tasks.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the method for detecting the birds in the coastal wetlands, which can solve the problem that the detection performance of the prior art on a large number of small-size birds at distant scenes is poor in the detection of the birds in the coastal wetlands, and can greatly improve the detection success rate and accuracy of the birds in the coastal wetlands for detecting small-size targets.

The method adopts the following technical scheme:

a coastal wetland bird detection method utilizes a convolutional neural network to fuse detail information beneficial to small-size target positioning and high-level semantic information beneficial to identification into a high-resolution feature map through feature fusion; obtaining an interested area through an interested area generation network; obtaining a foreground area through a regional object network, and further screening out an interested area in the foreground area; therefore, the detection success rate and the accuracy rate of detecting small and medium-sized targets by the coastal wetland birds are improved; the method comprises the following steps:

A. through feature fusion, the high-resolution feature expression (feature map) of the whole picture containing high-level semantic information and detail information is obtained, and the implementation method comprises the following steps:

A1. inputting the bird pictures of the coastal wetland into a convolutional neural network, and obtaining a characteristic diagram (characteristic expression) of each stage through convolution operation of four stages;

A2. selecting the feature map of the third stage and the feature map of the fourth stage in the step A1 for feature fusion to obtain a high-resolution feature expression (feature map) of the whole picture containing high-level semantic information and detail information;

B. the method for generating the network by using the interesting regions and extracting the multiple interesting regions comprises the following steps:

B1. according to the ratio 1/n of the size of the high-resolution feature map obtained in the step A to the size of the original image, generating a plurality of candidate frames with different sizes and aspect ratios in the original image every n pixel points; establishing mapping between the high-resolution feature map and the candidate frame;

B2. and B, according to the high-resolution feature map obtained in the step A, calculating a region-of-interest generation network (a network structure diagram is shown in figure 3), and obtaining the score of each position candidate frame predicted as a foreground (containing the bird target) and a candidate frame translation scaling parameter.

B3. And according to the translation scaling parameters, carrying out translation scaling on each candidate frame to obtain an interested area containing the bird target in the picture.

B4. And (3) training by adopting a supervised learning method in machine learning, and during training, evaluating classification results of the candidate frames at all positions and translation and scaling results of the candidate frames by adopting a cross entropy loss function and a SmoothL1 loss function. The loss function is shown by the following equation:

wherein, i is a candidate frame index value and represents the ith candidate frame. p is a radical of _i Is to predict the probability that the object in the candidate box is a bird. Authentic mark

Is 1, if the candidate box contains birds, otherwise is 0. t is t _i For the predicted panning scaling parameters of the candidate frame,

the scaling parameter is the true translation. L is a radical of an alcohol _cls The difference between the prediction probability and the true marker is evaluated as a cross entropy loss function. The formula is as follows:

L _reg the predicted comment zoom parameter is evaluated for the difference from the true pan zoom parameter for the SmoothL1 loss function. L is _reg The formula is as follows:

in addition, N _cls And N _reg Is a normalization parameter and λ is a balance parameter that balances the two part loss function.

C. Selecting a plurality of foreground areas with target birds by adopting a regional object network, wherein the realization method comprises the following steps:

C1. and B, obtaining an object diagram by calculating the regional object network (the network structure diagram is shown in figure 4) according to the high-resolution feature diagram obtained in the step A. The pixel value range of each pixel in the object map is (0,1), which indicates the predicted probability value of whether the object region contains an avian target (foreground/background). And according to the probability value, taking the object area with the probability value larger than 0.5 as the foreground object area.

C2. The size of the object region is determined: according to the size ratio 1/n of the high-resolution feature map and the original image obtained in the step A, dividing the original image into a plurality of object areas every n pixel points of the original image, and establishing mapping between the high-resolution feature map and the object areas;

C3. during training, if the area of the overlapping region of the object region and the foreground object exceeds 70% of the area of the object region, the object region is considered as the foreground region, otherwise, the object region is considered as the background region. The loss function of the foreground/background prediction of the evaluation object area is shown as the following formula:

wherein, i is the index value of the object region and represents the ith object region. p is a radical of formula _i A probability value is predicted for the object region obtained in C1,

is the true label of the object region (foreground is 1, background is 0). N is a radical of _cls Representing the number of object regions in the image. L is _cls The formula is a cross entropy function, and is shown in formula 2.

D. Combining the foreground region of interest obtained in the step B3 with the foreground object region obtained in the step C1, and reserving the region of interest at the position of the foreground region;

E. and B, according to the mapping relation between the region of interest obtained in the step D in the input coastal wetland bird picture and the high-resolution feature map obtained in the step A, finding out feature frames corresponding to the region of interest in the high-resolution feature map, and unifying the feature frames to a fixed size.

F. And performing convolution operation on the feature frames through a plurality of convolution layers and pooling operation on the pooling layers to obtain feature vectors with fixed sizes, calculating the feature vectors to obtain scores predicted as birds and translation scaling parameters of the region of interest, and further obtaining a final identification frame through the translation scaling parameters of the region of interest.

G. And (4) performing non-maximum suppression treatment on all recognition results (classification scores and recognition frames) obtained in the step (F) to generate a final coastal wetland bird target detection and recognition result, so that the coastal wetland bird target can be recognized.

Compared with the prior art, the invention has the beneficial effects that:

in the convolutional network, the high-level feature map contains rich high-level semantic information, which is beneficial to classifying objects, but small-size targets occupy fewer pixel points in the feature map, so that the small-size targets are difficult to identify and position. By introducing the feature fusion method, the method disclosed by the invention fuses the detail information beneficial to positioning the small-size bird target in the distant view coastal wetland and the high-level semantic information beneficial to bird target identification into one feature map, so that the detection success rate and accuracy of the small-size bird target in the distant view can be greatly improved. On the other hand, the method can obtain the foreground (including the coastal wetland birds) area through the regional object network, further screen out the interested area in the foreground area, greatly reduce the redundant interested area in the background area, relieve the problem of unbalance quantity between the interested area in the background area and the interested area in the foreground area, and improve the generalization capability of the model.

Drawings

FIG. 1 is a block diagram of the flow of bird detection method in coastal wetland.

FIG. 2 is a schematic diagram of a feature fusion process in an embodiment of the invention;

wherein, F' is the characteristic of a lower layer, and the dimension is expanded to 1024 dimensions through convolution operation of 1 × 1; f is the characteristic of a higher layer, and the characteristic size is expanded to 2 times through deconvolution operation of 2 multiplied by 2; f ^fuse And performing point-to-point addition after respective operations on the F' and the F for characteristic fusion to obtain final output characteristics, and performing 1 × 1 convolution operation.

FIG. 3 is a schematic diagram of an area of interest generation network architecture in which the present invention may be implemented;

wherein (a) is a high resolution profile, obtainable in step a; (b) is the intermediate feature resulting from (a) a convolution operation by 3x 3; (c) classifying scores (foreground/background) and 4 panning scaling parameters for each candidate box resulting from (b) two 1x1 convolution operations, respectively; num _ anchors is the number of candidate frames generated by each pixel point in step B1; conv is a convolution operation.

FIG. 4 is a schematic diagram of a domain object network architecture in the practice of the present invention;

wherein, (a) is a high resolution feature map; (b) is a foreground/background object map; (c) is an object diagram; conv is the convolution operation; ReLu is a linear rectification activation function, and the formula is as follows: f (x) max (0, x).

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a method for detecting birds in a coastal wetland, which is characterized in that a convolutional neural network is utilized, and through feature fusion, detail information beneficial to small-size target positioning and high-level semantic information beneficial to identification are fused into a high-resolution feature map; obtaining an interested area through an area suggestion network; obtaining a foreground area through a regional object network, and further screening out an interested area in the foreground area; therefore, the detection success rate and the accuracy rate of detecting small and medium-sized targets by the coastal wetland birds are improved;

fig. 1 is a block flow diagram illustrating a method for detecting birds in a coastal wetland according to an exemplary embodiment of the present invention. The method can be applied to a PC (personal computer), and can also be applied to mobile terminal equipment such as a mobile phone, a tablet personal computer and the like, without limitation.

In this embodiment, an RGB 3 channel image with any size is used as an input, and the image may be an image of each frame captured in a video, or a photographed picture, which is not limited herein. As shown in fig. 1, the embodiment of the present invention includes the following steps:

A. inputting an image to be detected, and acquiring a feature expression (feature map) of the whole image containing high-level semantic information and detail information through a feature fusion module, wherein the implementation method comprises the following steps:

A1. carrying out convolution operation on the input coastal wetland bird pictures to be detected in the step for 4 stages to obtain a characteristic diagram of each stage;

the convolution operation that generates feature maps of the same size is referred to as the same stage of convolution operation, i.e., the feature maps generated by the convolution operation at each stage are of different sizes. The convolution operation of each stage is defined according to different convolution neural network structures, and the invention does not define the convolution neural network structure of each stage in detail.

It should be noted that different features can be obtained through convolution operations of different combinations, and in a possible implementation manner, the feature map can be extracted through a convolution neural Network with a structure such as Deep Residual Network (ResNet), which is not limited herein. The convolution operation is equivalent to a nonlinear mapping in mathematics, and different convolution parameters can obtain different calculation results. The features obtained by the convolution operation correspond to the calculation results obtained by the convolution operation. In the convolutional neural network, the parameters of each convolution operation are learned through back propagation training, and the learned parameters of each network structure are different.

A2. And selecting one group of feature maps F in the fourth stage and one group of feature maps F' in the third stage in the step A1 as the input of the feature fusion module.

A3. FIG. 2 is a schematic diagram of a feature fusion module. As shown in fig. 2, the dimension F 'is first expanded to be the same as the dimension F, the dimension F is then expanded to be the same as the dimension F', the two groups of feature maps with the same dimension and dimension are added point to point, and the final output feature map F is obtained through fusion processing ^fuse . The feature map dimensions are determined in particular by the network structure.

In one possible implementation, the F' dimension may be augmented by a convolution operation using a convolution kernel of size 1x1, the size of F may be augmented by a deconvolution operation using a convolution kernel of size 1x1, and the post-additive fusion process may be implemented by a convolution operation using a convolution kernel of size 1x 1.

B. A plurality of interested areas are extracted by adopting an interested area generating network, and a network structure diagram is shown in figure 3, and the realization method is as follows;

B1. according to the characteristic diagram F in the step A ^fuse The ratio of the size of the original image to the size of the original image is 1/n, and a plurality of candidate frames with different sizes and aspect ratios are generated in the original image every n pixel points.

As a possible implementation, the size of the candidate box may be 4 scales 32 ² ，64 ² ，128 ² ，512 ² The aspect ratio of the candidate box may be {1:1,1:2,1:0.5 }.

B2. According to the mapping relation between the candidate frames in the original image and the feature map, in the stepHigh-resolution feature map F with high-level semantic information obtained in A ^fuse Finding out the feature frames corresponding to the candidate frames, and calculating the background/foreground classification score and smoothL1 loss function of the feature frames through a cross entropy loss function to calculate the translation scaling parameters of the region of interest, so as to obtain the region of interest;

the loss function is shown by the following equation:

the scaling parameter is the true translation. L is _cls The difference between the prediction probability and the true marker is evaluated as a cross entropy loss function. The formula is as follows:

in addition, N _cls And N _reg Is a normalization parameter and λ is a balancing parameter that balances the two part loss functions.

C. Selecting a plurality of foreground areas with objects by adopting an area object network, wherein the implementation method comprises the following steps:

C1. according to the characteristic diagram F in the step A ^fuse The ratio of the size of the original image to the size of the original image is 1/n, and the original image is divided into a plurality of object areas every n pixel points on the original image;

C2. b, according to the mapping relation from the object area in the original image to the feature image, finding out feature blocks corresponding to the object area in the feature image obtained in the step A, and calculating background/foreground classification scores of the feature blocks through a cross entropy loss function to obtain the object area classified as the foreground or the background;

the loss function of the foreground/background prediction of the evaluation object area is shown as the following formula:

wherein, i is the index value of the object region and represents the ith object region. p is a radical of formula _i Probability values are predicted for the object regions obtained in C1,

In a possible implementation manner, feature learning of the object region may be implemented by adding a convolution operation using a convolution kernel of 1 × 1 size, but it should be noted that, depending on the actual situation, the extraction feature expression manner may be flexibly selected, including using different convolution operation structures and extracting artificial features (HOG features, Haar features), and the present invention is not limited thereto.

D. Combining the results of the step B and the step C, and keeping the interested area of the foreground area; and B, according to the mapping relation from the region of interest in the input data to the feature diagram obtained in the step A, finding a plurality of feature frames corresponding to the region of interest in the feature diagram, and fixing the plurality of feature frames to be of the same size.

In one possible implementation, these feature boxes may be fixed to a uniform 7 x 7 size. And the feature frame is subjected to a plurality of convolution layers and pooling layers, and the score of the bird target and the candidate frame translation scaling parameter are obtained through calculation. The number of convolutional and pooling layers passed is determined by the infrastructure. Further, the final bird identification frame is obtained by translating the scaling parameters of the candidate frame. For bird targets scoring greater than 0.5, bird targets are considered.

E. And D, performing non-maximum suppression treatment on the bird identification frame obtained in the step D to obtain a final position and category identification result of the birds, and positioning and identifying the bird target of the coastal wetland.

According to the embodiment, the high-level semantic information and the low-level fine-grained information are fused into the final feature map by combining the feature fusion module, so that the detection effect of the bird target with the small size on the long-range view can be greatly improved. And the interested areas of the background area are eliminated by combining the area object network, so that the number of the interested areas can be reduced, and the generalization capability of the model is improved. The figure shows an example of the result of the bird detection method for the coastal wetland. It can be seen that the detection effect of the small-size birds in the long shot is good.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of this disclosure and the appended claims. Therefore, the invention should not be limited by the disclosure of the embodiments, but should be defined by the scope of the appended claims.

Claims

1. A coastal wetland bird detection method utilizes a convolutional neural network to fuse detail information beneficial to small-size target positioning and high-level semantic information beneficial to identification into a high-resolution feature map through feature fusion; obtaining an interested area through an interested area generation network; obtaining a foreground area through an area object network; further screening out an interested area in the foreground area; therefore, the detection success rate and the accuracy rate of small-size targets in bird detection of the coastal wetland are improved; the method comprises the following steps:

A. acquiring a high-resolution feature map of the whole picture containing high-level semantic information and detail information through feature fusion; the following operations are specifically executed:

A1. inputting the bird pictures of the coastal wetland into a convolutional neural network, and obtaining a characteristic diagram of each stage through convolution operation of four stages;

A2. selecting the feature map of the third stage and the feature map of the fourth stage in the step A1 for feature fusion to obtain a high-resolution feature map of the whole picture containing high-level semantic information and detail information;

B. extracting a plurality of regions of interest by using a region of interest generation network, and specifically executing operations B1-B4:

B1. b, setting the ratio of the size of the high-resolution feature map obtained in the step A to the size of the original image to be 1/n, and generating a plurality of candidate frames with different sizes and aspect ratios in the original image every n pixel points; establishing mapping between the high-resolution feature map and the candidate frame;

B2. b, according to the high-resolution feature map obtained in the step A, calculating through a region-of-interest generation network to obtain the score of the candidate frame at each position predicted as the foreground and the translation scaling parameters of the candidate frame; the foreground refers to the avian target;

B3. carrying out translation zooming on each candidate frame according to the translation zooming parameters to obtain an interested area containing the bird target in the picture;

B4. training by adopting a supervised learning method in machine learning, and during training, evaluating classification results of the candidate frames at each position and translation and scaling results of the candidate frames by adopting a cross entropy loss function and a Smoothl1 loss function;

C. selecting a plurality of foreground areas with target birds by adopting a regional object network, and specifically executing the following operations:

C1. according to the high-resolution feature map obtained in the step A, obtaining an object map through calculation of a regional object network; the pixel value of each pixel in the object graph represents the predicted probability value of whether the object region contains the bird target; if the bird target is a foreground target, otherwise, the bird target is a background; the prediction probability value is (0, 1); the object area with the probability value larger than 0.5 is a foreground object area;

C2. determining the size of the object region: dividing the original image into a plurality of object areas every n pixel points according to the size ratio 1/n of the high-resolution feature image to the original image obtained in the step A, and establishing mapping between the high-resolution feature image and the object areas;

C3. during training, setting an area ratio, and if the ratio of the area of an overlapped area of the object area and the foreground target to the area of the object area exceeds the set area ratio, considering the object area as the foreground area; otherwise, the area is a background area;

E. according to the input mapping relation between the interested region of the foreground region position obtained in the step D in the coastal wetland bird picture and the high-resolution feature map obtained in the step A, finding out feature frames corresponding to the interested region in the high-resolution feature map, and unifying the feature frames to a fixed size;

F. obtaining a feature vector with a fixed size by performing convolution operation on the feature frame through a plurality of convolution layers of a convolution neural network and pooling operation on a pooling layer, obtaining scores predicted as birds and translation scaling parameters of an area of interest by utilizing the feature vector through calculation, and obtaining a final identification frame through the translation scaling parameters of the area of interest;

G. and F, performing non-maximum suppression treatment on the classification scores and the identification frames obtained in the step F to obtain a coastal wetland bird target detection and identification result, and identifying the coastal wetland bird target.

2. The coastal wetland bird detection method of claim 1, wherein the loss function of step B4 is represented by formula 1:

where i is the candidate box index value, representingThe ith candidate box; p is a radical of _i Is the probability of predicting that the object within the candidate box is a bird; authentic mark

Setting to be 1, if the candidate frame contains birds, otherwise, setting to be 0; t is t _i A predicted panning scaling parameter for the candidate box;

scaling parameters for true translation; l is a radical of an alcohol _cls Is a cross entropy loss function, and is used for evaluating the difference between the prediction probability and the real mark, and the formula is as follows:

L _reg is a SmoothL1 loss function and is used for evaluating the difference between a predicted translation scaling parameter and a real translation scaling parameter;

L _reg represented by formula 3:

N _cls and N _reg Is a normalization parameter and λ is a balance parameter that balances the two part loss function.

3. The method for detecting bird species in coastal wetlands of claim 1, wherein in the training of step C3, the area ratio is set to 70%; the prediction loss function for evaluating the foreground or background of the object region is expressed as formula 4:

wherein, i is the index value of the object region and represents the ith object region; p is a radical of _i Prediction of object regions obtained in C1A probability value;

the real mark of the object area is shown, the foreground is 1, and the background is 0; l is _cls For the cross entropy function, the formula is as follows:

N _cls representing the number of object regions in the image.

4. The method for detecting birds on coastal wetlands of claim 1, wherein the characteristics in step A are fused, specifically:

taking a group of feature maps F in the fourth stage in the step A1 and a group of feature maps F' in the feature maps in the third stage as input of feature fusion;

amplifying the dimension of F 'to be the same as that of F, and then amplifying the size of F to be the same as that of F';

adding the two processed feature maps with the same size and dimensionality in a point-to-point manner, and then performing fusion processing to obtain a fused output feature map F ^fuse ；

The feature map dimensions are specifically determined by the network structure.

5. The coastal wetland bird detection method of claim 4, wherein the F' dimension is amplified by a convolution operation using a convolution kernel of size 1x 1; amplifying the size of F by deconvolution operations using convolution kernels of size 1x 1; the fusion process after addition is realized by a convolution operation using a convolution kernel of size 1 × 1.

6. The method for detecting birds on coastal wetlands of claim 1, wherein the size of the candidate frame is {32 } ² ，64 ² ，128 ² ，512 ² }; the aspect ratio of the candidate box is {1:1,1:2,1:0.5 }.