CN113191222A

CN113191222A - Underwater fish target detection method and device

Info

Publication number: CN113191222A
Application number: CN202110406987.7A
Authority: CN
Inventors: 陈英义; 张倩; 李道亮; 秦瀚翔; 于辉辉; 孙博洋; 刘慧慧; 李少波; 魏晓华; 杨玲
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-30
Anticipated expiration: 2041-04-15
Also published as: CN113191222B

Abstract

The invention provides a method and a device for detecting underwater fish targets, wherein the method comprises the following steps: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model; inputting a plurality of feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected; the image to be detected comprises images of a plurality of fish targets with different scales. On one hand, the method realizes that the characteristics of the fish targets of all scales in the image to be detected are completely represented by extracting a plurality of characteristic graphs of different scales, so that the phenomenon that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect because the characteristics are incomplete is effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature maps with different scales, so that the target detection result is more accurate.

Description

Underwater fish target detection method and device

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for detecting underwater fish targets.

Background

At present, fish farming dominates aquaculture. In order to ensure the yield of fish, it is necessary to estimate the number of fish cultivated or monitor the growth state thereof. In addition, in order to prevent sudden death of fish from having a serious influence on the growth of other fish, it is necessary to perform individual tracking of target fish in which an abnormal condition occurs.

As underwater robots and cameras are also constantly being developed, research on underwater video or images is becoming a research hotspot in many other fields. At present, target detection is carried out on fishes by adopting a machine learning algorithm mainly according to underwater videos or images so as to monitor and count the growth state of the underwater fishes and track individual fishes.

However, the fish farming format is highly intensive, so that a single fish target occupies a small pixel area in an image, resulting in many small targets. The size of the fish target in the image is influenced by the distance between the fish target and the underwater camera, namely the close fish target occupies a large pixel area in the image, and the fish target far away from the camera occupies a small pixel area, so that the fish size in the image has a large difference. Therefore, when the prior art is adopted to detect the target of the fish target, the fish target with smaller size is easy to lose in the detection process, and the fish target with larger size can not be detected because the characteristic information is incomplete, so that the detection accuracy is lower.

Disclosure of Invention

The invention provides a method and a device for detecting underwater fish targets, which are used for solving the defect that in the prior art, when the sizes of the fish targets in an image have large differences, the target detection results of the fish targets are difficult to accurately obtain, and the target detection results of the fish targets are accurately obtained when the sizes of the fish targets in the image have large differences.

The invention provides a method for detecting underwater fish targets, which comprises the following steps:

inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model;

inputting the feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected;

the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

According to the underwater fish target detection method provided by the invention, the feature extraction model comprises a deconvolution module and a down-sampling module;

correspondingly, the inputting the image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model, includes:

sequentially passing the image to be detected through each down-sampling module to obtain a feature map output by the last down-sampling module;

sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module;

taking the feature map output by each deconvolution module as the feature map of the image to be detected;

the characteristic graphs output by the deconvolution modules are characteristic graphs with different scales, the down-sampling module is used for down-sampling the input of the down-sampling module, and the deconvolution module is used for deconvolution the input of the deconvolution module.

According to the underwater fish target detection method provided by the invention, the step of sequentially passing the feature map output by the last down-sampling module through each deconvolution module to obtain the feature map output by each deconvolution module comprises the following steps:

the feature graph output by the last downsampling module is sequentially fused with the feature graph output by each deconvolution module and the feature graph output by the downsampling module corresponding to each deconvolution module;

inputting the fusion result into a deconvolution module which is close to the back of each deconvolution module, and acquiring a feature map output by the back deconvolution module; wherein the deconvolution module is pre-associated with the downsampling module.

According to the underwater fish target detection method provided by the invention, the feature graph output by the last downsampling module sequentially passes through the feature graph output by each deconvolution module and the feature graph output by the downsampling module corresponding to each deconvolution module to be fused, and the method comprises the following steps:

for any deconvolution module, performing deconvolution on the feature map output by the deconvolution module immediately before the deconvolution module, and performing first convolution operation on the feature map after deconvolution to obtain a feature map after the first convolution operation;

respectively performing second convolution operation and third convolution operation on the feature map output by the down-sampling module corresponding to the deconvolution module to obtain a feature map after the second convolution operation and a feature map after the third convolution operation;

fusing the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation;

the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same channel number, and the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same size.

According to the underwater fish target detection method provided by the invention, the fusion of the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation comprises the following steps:

and performing dot product operation on the feature map after the first convolution operation and the feature map after the second convolution operation, and fusing the feature map after the dot product operation and the feature map after the third convolution operation.

According to the underwater fish target detection method provided by the invention, before the image to be detected is input into the feature extraction model in the target detection model and a plurality of feature maps with different scales of the image to be detected are obtained from the feature maps output by each layer of the feature extraction model, the method further comprises the following steps:

preprocessing the image to be detected;

the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

According to the underwater fish target detection method provided by the invention, the Loss function of the target detection model is a Focal local function.

The invention also provides an underwater fish target detection device, comprising:

the acquisition module is used for inputting an image to be detected into a feature extraction model in a target detection model and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model;

the target detection module is used for inputting the feature maps with different scales into a target detection model and outputting a target detection result of the image to be detected;

The invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of any one of the underwater fish target detection methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the underwater fish target detection method as described in any one of the above.

According to the underwater fish target detection method and device provided by the invention, on one hand, the image to be detected is input into the feature extraction model in the target detection model, multi-scale deconvolution operation is carried out on the image to be detected in the feature extraction model, a plurality of feature maps with different scales are obtained, so that the features of the fish targets with different scales in the image to be detected are completely represented, and the phenomena that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect because the features are incomplete are effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature maps with different scales, so that the target detection result is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for detecting underwater fish targets according to the present invention;

FIG. 2 is a schematic structural diagram of a target detection model in the underwater fish target detection method provided by the invention;

FIG. 3 is a second schematic structural diagram of a target detection model in the underwater fish target detection method provided by the present invention;

FIG. 4 is a third schematic structural diagram of a target detection model in the underwater fish target detection method provided by the present invention;

FIG. 5 is a schematic structural diagram of an underwater fish target detection device provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The underwater fish target detection method of the present invention is described below with reference to fig. 1, and includes: step 101, inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model;

the image to be detected is an image of a fish target with a plurality of different scales, which needs to be subjected to target detection. The source of the image to be detected can be an image of the underwater fish target acquired in real time or an image of the underwater fish target acquired in advance. The present embodiment does not specifically limit the source of the image to be detected.

Optionally, the image to be detected in this embodiment is an image acquired by the underwater image acquisition device in a factory intensive culture scene. The underwater image acquisition device may be a camera or a robot, and the embodiment is not particularly limited thereto.

Before an image to be detected is input into a feature extraction model in a target detection model, the target detection model needs to be trained first. When the target detection model is trained, a data set needs to be constructed.

The data construction method in the present embodiment will be described below by taking an example of constructing a data set using an image of underwater fish taken by a camera.

According to the embodiment, an underwater video acquisition device is built in the field culture pond, and underwater fish images are shot in the culture pond in advance through the underwater video acquisition device to construct a data set.

Preferably, the underwater video capture device in this embodiment includes a support and an underwater camera. The height and the angle of the underwater camera in water are adjusted by the support, so that the fish targets can be shot in the visual field of the underwater camera as much as possible.

The shape of the size of the culture pond for collecting the data set can be selected or set according to actual requirements. For example, the culture pond is a cylindrical pond with the diameter of 1.8 meters and the height of 1 meter.

In addition, the present embodiment is not limited to the number, breed and length of fish in the culture pond, for example, the number of fish in the culture pond is 300, the breed of fish is sparus punctatus fry, and the length of fish is 7-8 cm.

The period of collecting sample data can be set according to actual requirements, such as one month or two months.

In the data acquisition stage, in order to obtain more diversified data, richen data set, this embodiment is gathered the image of fish under water under the multiple underwater environment. The underwater environment includes natural illumination in the daytime, artificial light source at night and camera light source illumination at night, which is not specifically limited in this example.

Through the mode, a large number of underwater fish videos are collected. Then, extracting image frames from the underwater fish video through video processing software, and labeling the image frames by using an image labeling tool to obtain target detection labels of the image frames. Wherein, the data marking tool is provided with LabelImg and the like.

In addition, the embodiment also eliminates the image frames with serious object ghosting caused by high-speed movement of fishes in water and the image frames with few fish objects or difficult labeling so as to obtain high-quality effective image frames to form a data set.

Wherein, the distribution density of the fish targets in each image frame can be medium density or high density. The number of data sets, medium-density or high-density image frames may be plural, and this embodiment does not specifically limit this. For example, the data set includes 725 image frames, 291 image frames with medium density, and 434 image frames with high density. The number of fish targets in the data set may be multiple, for example, 32499, and this embodiment is not particularly limited.

The size of the image frame may also be set according to practice, such as 768x 1024.

In addition, the number of fishes in the image frames with medium density or high density can also be set according to the actual requirement, for example, the number of the fish targets in each image frame with medium density is 30-60, and the number of the fish targets in each image frame with high density is 60-90.

In the actual simulation process, 80% of the image frames can be selected as training samples, namely sample images, and 10% of the image frames can be selected as verification samples, so that the usability and the performance of the target detection model can be verified. In addition, 10% of the image frames can be used as a test set, i.e. the image to be detected.

In summary, in the data set construction process, firstly, an underwater fish video is acquired by using an underwater camera, and then, a picture of a fish target is extracted and labeled by using video editing software and an image labeling tool, so as to construct a data set and provide a data basis for training a target detection model.

In the training process of the target detection model, the training strategy can adopt a random gradient descent (SGD) activation function and the like to optimize the target detection model. The hyper-parameters in the target detection model can be set according to actual requirements. Such as learning rate, batch _ size, epoch are set to 0.001, 2, and 30, respectively.

Then, the image to be detected is input into a feature extraction model of the trained target detection model, and after the image to be detected is subjected to multiple convolution in the feature extraction model, multiple deconvolution operations are carried out to obtain a plurality of feature maps with different scales. Wherein, each feature map comprises the features of the fish targets with different scales.

In the embodiment, the characteristic graphs of a plurality of different scales can be obtained by performing multi-scale characteristic extraction on the image to be detected so as to retain the characteristics of the fish targets of all scales in the image to be detected, thereby effectively avoiding the defects that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to perform target detection because the characteristics are incomplete.

Step 102, inputting the feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

The target detection model comprises a feature extraction model, a classification model and a regression model. The classification model is used for identifying the fish targets in the image to be detected, and the regression model is used for obtaining the bounding box of each fish target in the image to be detected.

The characteristic diagrams with different scales can be sequentially input into the target detection model, the target detection result of each characteristic diagram is output, and the target detection results of the characteristic diagrams with different scales are superposed to obtain the target detection result of the image to be detected.

Or inputting a plurality of feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected.

In this embodiment, the target detection algorithm is implemented by using python language or C language.

Preferably, when implementing the target detection algorithm using Python language, the target detection algorithm may be implemented based on a Pytorch framework. The hardware device may be set according to actual requirements, for example, 2 2080Ti GPUs (Graphics Processing units, Graphics processors) are set.

In addition, in the embodiment, accuracy (Precision), recall (recall) and MAP (Mean Average Precision) are used as measurement standards to verify the effectiveness of the underwater fish target detection method in the embodiment. The specific calculation formula is as follows:

wherein TP, FP and FN respectively indicate that a sample label is a positive sample and a classification result is a positive sample, a sample label is a negative sample and a classification result is a negative sample, N is the number of images to be detected, p (k) is a Precision value of a k-th image to be detected, and Deltar (k) is a change condition between a k-1-th image to be detected and a Recall value of the k-th image to be detected.

In order to optimize the investment of aquaculture production links, intensive aquaculture for implementing high-density aquaculture and mass production is becoming a standard mode of aquaculture to obtain maximum economic and environmental benefits. Nowadays, high and new technologies such as big data, internet of things, cloud computing, artificial intelligence and the like, and intelligent equipment are deeply integrated with modern agriculture. The method has more and more applications in production, processing, transportation, sale and other links in the fishery breeding process, and the production and operation efficiency is greatly improved. Among them, the target detection method in computer vision technology is widely applied in the aspects of automatic identification and classification, production state monitoring and the like in aquaculture due to the advantages of rapidness, objectivity and high precision.

The underwater fish target detection method in this embodiment is compared with an existing target detection method, such as fast Regions with CNN (fast convolutional neural network), a retanet network, SSD (Single Shot multi box Detector ), DSSD (deconvolution Single Shot Detector, deconvolution Single Shot Detector), YOLOv3(You see third version only), and YOLOv5(You see fifth version only), to verify the effectiveness of the underwater fish target detection method in this embodiment. As can be seen from table 1, in this embodiment, the detection result of the image to be detected has the maximum accuracy, recall rate and average mean value accuracy, so that the target detection result is more accurate.

TABLE 1 results of experimental analyses

Model (model)	MAP	Recall	Precision
				Faster RCNN	88.9％	97.3％	61.2％
Retinanet	89.6％	94.6％	50.2％
				SSD	86.3％	92.7％	50.1％
DSSD	85.7％	84.2％	70.8％
				YOLOv3	90.4％	95.1％	61.3％
YOLOv5	94.7％	96.4％	73.1％
				This example	95.2％	99.2％	78.1％

In the embodiment, multi-scale feature extraction is performed on the image to be detected through the target detection model, feature maps of multiple scales are obtained, and a target detection result of the image to be detected can be accurately obtained by combining the feature maps of multiple different scales.

On one hand, the image to be detected is input into the feature extraction model in the target detection model, multi-scale deconvolution operation is carried out on the image to be detected in the feature extraction model, a plurality of feature maps with different scales are obtained, so that the features of the fish targets with various scales in the image to be detected are completely represented, and the phenomena that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect because the features are incomplete are effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature maps with different scales, so that the target detection result is more accurate.

On the basis of the above embodiment, the feature extraction model in this embodiment includes a deconvolution module and a downsampling module; correspondingly, the inputting the image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model, includes: sequentially passing the image to be detected through each down-sampling module to obtain a feature map output by the last down-sampling module; sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module; taking the feature map output by each deconvolution module as the feature map of the image to be detected; the characteristic graphs output by the deconvolution modules are characteristic graphs with different scales, the down-sampling module is used for down-sampling the input of the down-sampling module, and the deconvolution module is used for deconvolution the input of the deconvolution module.

The feature extraction model is a deconvolution neural network structure, and ResNet (residual error network) with multiple convolutional layers is used as a backbone network. ResNet improves the representation ability of the shallow layer network through a residual block structure containing a multilayer network, and effectively avoids the network degradation problem.

Preferably, the convolutional layer is a void convolution. Due to the fact that the size of the image to be detected is large, under the condition of limited resources, a larger receptive field can be provided by adopting the cavity convolution, and information loss caused by pooling operation can be effectively reduced. That is, a larger perceived field of view can be provided without losing image information. In addition, the cavity convolution can reduce parameters and accurately determine the position of the target.

Wherein, the perception visual field is the size of the area mapped to the original image by the pixel in each layer of feature map. The larger the receptive field is, the more global information is contained, and the higher semantic features are contained; conversely, the smaller the receptive field, the more local and detailed features are contained in the feature map.

Note that the convolutional layer may be a general convolution.

The feature extraction model may include a plurality of downsampling modules and a plurality of deconvolution modules, and the structure of the feature extraction model is not limited in this embodiment.

For example, the number of downsampling modules in the feature extraction model is 6, which are conv3, conv5, conv6, conv7, conv8 and conv9 in the DSSD network.

Optionally, one or more downsampling layers may be included in the downsampling module, and the present embodiment is not limited to the number of downsampling layers in the downsampling module. The down-sampling layers in the down-sampling modules may be the same or different.

For any down-sampling layer of the down-sampling module, the feature maps of the input down-sampling layer can be subjected to pooling operation, so that the size of the input feature map becomes smaller. Wherein the pooling operation may be maximal pooling or average pooling, etc.

The downsampling module further includes one or more convolution layers, and the present embodiment is not limited to the number of convolution layers in the downsampling module. The convolution kernel size may be the same or different for each convolution layer. The convolution operation may learn the input feature map and output a feature map with deep feature information.

As shown in fig. 2 and fig. 3, the image to be detected may sequentially pass through a plurality of down-sampling modules, and feature extraction may be performed on the image to be detected by each down-sampling module until the image passes through the last down-sampling module, so as to obtain a feature map output by the last down-sampling module.

After the feature map output by the last downsampling module is obtained, the feature map output by the last downsampling module may be used as an input of the first deconvolution module arranged in the feature extraction model to obtain the feature map output by the deconvolution module. And then inputting the feature map output by the deconvolution module into an immediately next deconvolution module, and acquiring the feature map output by the immediately next deconvolution module until all the feature maps pass through the deconvolution modules to acquire a plurality of feature maps with different scales.

Alternatively, the feature map output by the last downsampling module and the feature output by the downsampling module corresponding to the first deconvolution module arranged in the feature extraction model may be used together as the input of the deconvolution module, and the feature map output by the deconvolution module may be acquired. And then inputting the feature map output by the deconvolution module and the feature map output by the downsampling module corresponding to the next deconvolution module immediately behind into the next deconvolution module together to obtain the feature map output by the next deconvolution module until all the deconvolution modules are passed through.

One or more deconvolution layers may be included in the deconvolution module, and this embodiment is not limited to the number of deconvolution layers in the deconvolution module. The deconvolution layers in the deconvolution modules may be the same or different.

For any deconvolution layer, the feature map input in the deconvolution layer can be amplified or restored using a deconvolution operation. The deconvolution layers in the deconvolution modules may be the same or different.

In the embodiment, the input feature graph is subjected to multi-scale feature extraction by adopting a plurality of deconvolution modules, so that the extracted feature graph contains abundant scale features of the feature graph to be detected, and the accuracy of target detection is effectively improved.

On the basis of the foregoing embodiment, in this embodiment, sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module includes: the feature graph output by the last downsampling module is sequentially fused with the feature graph output by each deconvolution module and the feature graph output by the downsampling module corresponding to each deconvolution module; inputting the fusion result into a deconvolution module which is close to the back of each deconvolution module, and acquiring a feature map output by the back deconvolution module; wherein the deconvolution module is pre-associated with the downsampling module.

Specifically, the sum of the number of the deconvolution module and the number of the down-sampling modules in the feature extraction model is assumed to be N, and N is a positive integer. And numbering the downsampling modules and the deconvolution modules in a mode of sequentially adding 1 to the arrangement sequence of the downsampling modules and the deconvolution modules from front to back in the feature extraction, wherein the downsampling module corresponding to the deconvolution module with the number of i is numbered as N-i.

For the next deconvolution module next to any deconvolution module, after the feature map output by the last downsampling module sequentially passes through the deconvolution module, the feature map output by the deconvolution module and the feature map output by the downsampling module corresponding to the deconvolution module are fused and input into the next deconvolution module next to the module, and the feature map output by the next deconvolution module next to the module can be obtained.

In the embodiment, the feature graph output by the deconvolution module is fused with the feature graph output by the downsampling module corresponding to the deconvolution module, the fused feature graph is used as the input of the next deconvolution module adjacent to the deconvolution module, the context information is fully utilized, and the fused feature graph contains abundant shallow features and deep features, so that the loss of the feature information in the image to be detected can be effectively reduced, and the accuracy of target detection is improved.

On the basis of the foregoing embodiment, in this embodiment, the fusing the feature map output by the last downsampling module sequentially through the feature map output by each deconvolution module and the feature map output by the downsampling module corresponding to each deconvolution module includes: for any deconvolution module, performing deconvolution on the feature map output by the deconvolution module immediately before the deconvolution module, and performing first convolution operation on the feature map after deconvolution to obtain a feature map after the first convolution operation; respectively performing second convolution operation and third convolution operation on the feature map output by the down-sampling module corresponding to the deconvolution module to obtain a feature map after the second convolution operation and a feature map after the third convolution operation; fusing the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation; the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same channel number, and the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same size.

Specifically, since mutual occlusion between targets easily occurs in a high-density image, after a target is occluded, the feature information that can be extracted is reduced and extraction of other features may be interfered, resulting in frequent occurrence of missing detection or false detection. When the computer vision method is used for analyzing the images of the underwater fishes acquired in the dense scene, the complexity of the problems exponentially increases along with the number of the fish targets and the increase of the mutual influence among the fish targets, so that the detection result of the underwater fish targets cannot meet the actual requirement.

In order to solve the above problem, in this embodiment, on one hand, the feature map output by the deconvolution module is fused with the feature map output by the downsampling module corresponding to the deconvolution module, so that the low-level features and the high-level features are fully combined; on the other hand, the features are enriched by expanding the dimension of the feature diagram, so that most information is reserved, and no obvious overhead is generated.

Optionally, for any deconvolution module, performing a deconvolution operation on the feature map output by the deconvolution module to perform a deconvolution on the feature map output by the deconvolution module; and then carrying out first convolution operation on the characteristic diagram after the deconvolution operation. The size of the feature map after the first convolution operation is a predetermined multiple, such as 2 times, of the feature map before the convolution operation.

And meanwhile, performing a second convolution operation on the feature map output by the downsampling module corresponding to the deconvolution module, and further extracting the features in the feature map output by the downsampling module corresponding to the deconvolution module. The sizes of the feature maps before and after the second convolution operation are unchanged.

And in order to obtain additional context information, performing a third convolution operation on the feature map output by the down-sampling module corresponding to the deconvolution module. The size and the number of channels of the feature map after the third convolution operation are the same as those of the feature map after the first convolution operation and the feature map after the second convolution operation.

The sizes of convolution kernels of the deconvolution operation, the first convolution operation, the second convolution operation and the third convolution operation can be set according to actual requirements. The times of the first convolution operation, the second convolution operation and the third convolution operation can also be set according to actual requirements.

And then, fusing the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation.

The fusion mode comprises the steps of serially connecting the feature diagram after the first convolution operation, the feature diagram after the second convolution operation and the feature diagram after the third convolution operation, and serially connecting the feature diagram after the first convolution operation and the feature diagram after the second convolution operation after being fused with the feature diagram after the third convolution operation or serially connecting the feature diagram after the first convolution operation and the feature diagram after the third convolution operation after being fused with the feature diagram after the second convolution operation to expand the dimension of the feature diagram and obtain rich features of the image to be detected.

The embodiment fully combines the low-level features with the high-level features, enriches the features through the extended dimension, not only retains most information, but also does not generate obvious overhead, and effectively solves the problems of low precision, missed detection and frequent false detection of target detection results caused by serious shielding between fish targets and numerous small targets in the dense scene in the prior art.

As shown in fig. 4, this is an example in the present embodiment. The feature map output by the deconvolution module has a size of W × H × 512, and is first subjected to a deconvolution operation of 2 × 2 × 256, and then subjected to a first convolution operation of 3 × 3 × 256, so that the feature map obtained after the first convolution operation is 2W × 2H × 256. The size of the feature map output by the down-sampling module corresponding to the deconvolution module is 2W × 2H × 512, and the feature map obtained after the first convolution operation is 2W × 2H × 256 after the second convolution operations of 3 × 3 × 256 and 3 × 3 × 256 are sequentially performed. The feature map output by the down-sampling module corresponding to the deconvolution module is subjected to a third convolution operation of 1 × 1 × 256 in size, and the feature map obtained after the third convolution operation is 2W × 2H × 256.

On the basis of the foregoing embodiment, in this embodiment, the fusing the feature map after the first convolution operation, the feature map after the second convolution operation, and the feature map after the third convolution operation includes: and performing dot product operation on the feature map after the first convolution operation and the feature map after the second convolution operation, and fusing the feature map after the dot product operation and the feature map after the third convolution operation.

Specifically, dot product operation (Elementwise) is performed on the feature map after the first convolution operation and the feature map after the second convolution operation, and the size and the number of channels of the feature map after the dot product operation are the same as those of the feature map after the first convolution operation and the feature map after the second convolution operation.

Then, the feature map after the dot product operation and the feature map after the third convolution operation are fused by using a concatenation operation (Concatenate). The number of channels of the fused feature map is twice that of the feature map after the dot product operation or the feature map after the third convolution operation, and the size of the fused feature map is the same as that of the feature map after the dot product operation or the feature map after the third convolution operation.

According to the embodiment, the low-level features and the high-level features are fully combined, and more detailed features are fused, so that the obtained feature map contains abundant feature information of the image to be detected, and even under the conditions that the shielding between targets is serious or small targets are numerous in a dense scene, the target detection result can be accurately obtained.

On the basis of the foregoing embodiments, in this embodiment, before the inputting the image to be detected into the feature extraction model in the target detection model, and acquiring a plurality of feature maps of different scales of the image to be detected from the feature maps output from each layer of the feature extraction model, the method further includes: preprocessing the image to be detected; the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

In particular, underwater fish target detection faces a great challenge due to the complicated underwater environment. When an underwater image acquisition device is used for acquiring an image of an underwater fish target, color deviation caused by light absorption in an underwater environment, detail blurring caused by forward scattering of light and low contrast or even serious distortion caused by backward scattering of light cause poor quality of the underwater image acquired by an underwater optical imaging-based image acquisition mode, and an expected effect cannot be achieved, so that a target detection result is inaccurate. Therefore, the image enhancement preprocessing of the image to be detected plays an important role in accurately detecting the underwater fish target.

In order to alleviate the above problems, in this embodiment, before the image to be detected is input into the feature extraction model in the target detection model, an image to be detected is subjected to image enhancement preprocessing by using a defogging network (AOD-Net), so that the contrast of the image to be detected can be effectively improved, the phenomenon of image blurring caused by too many white masks during underwater imaging is reduced, the quality of the image to be detected is further improved, and the influence on the fish target detection result caused by the blurring of the image to be detected is reduced.

Optionally, AOD-Net is a deep learning method based on atmospheric scattering model. The AOD-Net directly calculates two parameters of the global atmospheric light and the medium transmissivity and combines the two parameters into one parameter for estimation, and accumulation and even amplification of errors can not be caused. The AOD-Net enhanced image to be detected has clear outline and rich colors, and does not have the problem of over-recovery.

In the embodiment, one or more of the mean value, the variance, the mean gradient and the image entropy of the image are adopted to compare the images to be detected before and after the AOD-Net is enhanced so as to evaluate the quality of the images to be detected before and after the enhancement.

In addition, before the image to be detected is input into the feature extraction model in the target detection model, the image to be detected can be subjected to geometric transformation so as to enrich the features of the image to be detected.

In the training process of the target detection model, the method can also be used for carrying out image enhancement and geometric transformation on the sample image, rich sample data sets can be obtained by carrying out geometric transformation on the sample graph line, and the phenomenon of overfitting of the model is effectively avoided.

Optionally, the geometric transformation includes flipping, rotating, scaling, and the like, and the embodiment is not limited to the manner of geometric transformation.

The embodiment can effectively improve the contrast of the image to be detected by carrying out image enhancement on the image to be detected and the sample image, reduce the phenomenon of image blurring caused by too many white masks during underwater imaging and further improve the accuracy of a target detection result. In addition, the sample set can be expanded by carrying out geometric transformation on the sample image so as to effectively avoid the over-fitting phenomenon of the target detection model, improve the performance of the target detection model and further improve the accuracy of the target detection result.

On the basis of the above embodiments, the Loss function of the target detection model in this embodiment is a Focal local function.

Specifically, in a highly intensive environment, a single fish target occupies a small pixel area in an image, many small targets are generated, and the background is redundant. That is, the number of positive samples is small, and the number of negative samples is large, resulting in category imbalance.

As shown in fig. 2, in order to solve the above problem, the present embodiment uses the Focal local function as the Loss function to balance the problem of the imbalance of the positive and negative samples.

In addition, a Non-Maximum inhibition (Non-Maximum inhibition) method is adopted to optimize the target detection model.

The underwater fish target detection device provided by the present invention is described below, and the underwater fish target detection device described below and the underwater fish target detection method described above may be referred to in correspondence to each other.

As shown in fig. 5, the present embodiment provides an underwater fish target detection apparatus, the zhuang fish includes an obtaining module 501 and a target detection module 502, wherein:

the obtaining module 501 is configured to input an image to be detected into a feature extraction model in a target detection model, and obtain a plurality of feature maps of different scales of the image to be detected from feature maps output by each layer of the feature extraction model;

The shape of the size of the culture pond for collecting the data set can be selected or set according to actual requirements.

Further, the present embodiment is not limited to the number, breed, and length of fish in the culture pond.

The period of collecting sample data can be set according to actual requirements.

Wherein, the distribution density of the fish targets in each image frame can be medium density or high density. The number of data sets, medium-density or high-density image frames may be plural, and this embodiment does not specifically limit this. The number of fish targets in the data set may be plural, and this embodiment is not particularly limited thereto.

The size of the image frame may also be set according to practice.

In addition, the number of fishes in the image frame with medium density or high density can be set according to actual requirements.

In the training process of the target detection model, the training strategy can adopt a random gradient descent (SGD) activation function and the like to optimize the target detection model. The hyper-parameters in the target detection model can be set according to actual requirements.

The target detection module is used for inputting the feature maps with different scales into the target detection model and outputting a target detection result of the image to be detected; the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

Preferably, when implementing the target detection algorithm using Python language, the target detection algorithm may be implemented based on a Pytorch framework. Wherein, the hardware equipment can be set according to actual requirements.

In order to optimize the investment of aquaculture production links, intensive aquaculture for implementing high-density aquaculture and mass production is becoming a standard mode of aquaculture to obtain maximum economic and environmental benefits. Nowadays, high and new technologies such as big data, internet of things, cloud computing, artificial intelligence and the like, and intelligent equipment are deeply integrated with modern agriculture. The method has more and more applications in production, processing, transportation, sale and other links in the fishery breeding process, and the production and operation efficiency is greatly improved. Among them, computer vision technology is widely used in the fields of automatic identification and classification, production state monitoring, etc. in aquaculture because of its advantages of rapidness, objectivity and high precision.

The underwater fish target detection method in the embodiment is compared with the existing computer vision technology to verify the effectiveness of the underwater fish target detection method in the embodiment. As can be seen from table 1, in this embodiment, the detection result of the image to be detected has the maximum accuracy, recall rate and average mean value accuracy, so that the target detection result is more accurate.

On the basis of the above embodiment, the feature extraction model in this embodiment includes a deconvolution module and a downsampling module; correspondingly, the obtaining module is specifically configured to: sequentially passing the image to be detected through each down-sampling module to obtain a feature map output by the last down-sampling module; sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module; taking the feature map output by each deconvolution module as the feature map of the image to be detected; the characteristic graphs output by the deconvolution modules are characteristic graphs with different scales, the down-sampling module is used for down-sampling the input of the down-sampling module, and the deconvolution module is used for deconvolution the input of the deconvolution module.

On the basis of the foregoing embodiment, the obtaining module in this embodiment is further configured to fuse the feature map output by the last downsampling module after sequentially passing through the feature maps output by the respective deconvolution modules and the feature maps output by the downsampling modules corresponding to the respective deconvolution modules; inputting the fusion result into a deconvolution module which is close to the back of each deconvolution module, and acquiring a feature map output by the back deconvolution module; wherein the deconvolution module is pre-associated with the downsampling module.

On the basis of the above embodiment, the present embodiment further includes a fusion module specifically configured to: for any deconvolution module, performing deconvolution on the feature map output by the deconvolution module immediately before the deconvolution module, and performing first convolution operation on the feature map after deconvolution to obtain a feature map after the first convolution operation; respectively performing second convolution operation and third convolution operation on the feature map output by the down-sampling module corresponding to the deconvolution module to obtain a feature map after the second convolution operation and a feature map after the third convolution operation; fusing the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation; the number of channels of the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation are the same; and the sizes of the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation are the same.

In addition to the foregoing embodiment, the fusion module in this embodiment is further configured to perform a dot product operation on the feature map after the first convolution operation and the feature map after the second convolution operation, and fuse the feature map after the dot product operation and the feature map after the third convolution operation.

On the basis of the above embodiments, the present embodiment further includes a preprocessing module specifically configured to: preprocessing the image to be detected; the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)601, a communication Interface (Communications Interface)602, a memory (memory)603 and a communication bus 604, wherein the processor 601, the communication Interface 602 and the memory 603 complete communication with each other through the communication bus 604. The processor 601 may invoke logic instructions in the memory 603 to perform a method of underwater fish target detection, the method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model; inputting the feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

In addition, the logic instructions in the memory 603 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the underwater fish target detection method provided by the above methods, the method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model; inputting the feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the underwater fish target detection method provided above, the method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature maps with different scales of the image to be detected from feature maps output by each layer of the feature extraction model; inputting the feature maps with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training with a sample image as a sample and a target detection label of the sample image as a sample label; the image to be detected comprises images of a plurality of fish targets with different scales.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An underwater fish target detection method is characterized by comprising the following steps:

2. The underwater fish target detection method of claim 1, wherein the feature extraction model includes a deconvolution module and a downsampling module;

3. The underwater fish target detection method according to claim 2, wherein the step of sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module comprises:

4. The underwater fish target detection method according to claim 3, wherein the fusing of the feature map output by the last downsampling module sequentially passing through the feature map output by each deconvolution module and the feature map output by the downsampling module corresponding to each deconvolution module comprises:

5. The underwater fish target detection method according to claim 4, wherein the fusing the feature map after the first convolution operation, the feature map after the second convolution operation, and the feature map after the third convolution operation includes:

6. The underwater fish target detection method according to any one of claims 1 to 5, before obtaining a plurality of feature maps of different scales of the image to be detected from feature maps output from respective layers of the feature extraction model in which the image to be detected is input into the target detection model, further comprising:

preprocessing the image to be detected;

7. The underwater fish target detection method as claimed in any one of claims 1 to 5, wherein the loss function of the target detection model is a FocalLoss function.

8. An underwater fish target detection device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the underwater fish target detection method as claimed in any one of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the underwater fish target detection method according to any one of claims 1 to 7.