CN114241003A

CN114241003A - All-weather lightweight high-real-time sea surface ship detection and tracking method

Info

Publication number: CN114241003A
Application number: CN202111531772.4A
Authority: CN
Inventors: 王德全; 王小勇; 宋文雯; 吴戈
Original assignee: Suzhou Apqi Internet Of Things Technology Co ltd; Chengdu Apuqi Technology Co ltd
Current assignee: Suzhou Apqi Internet Of Things Technology Co ltd; Chengdu Apuqi Technology Co ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-25
Anticipated expiration: 2041-12-14
Also published as: CN114241003B

Abstract

The invention discloses an all-weather light high-real-time sea surface ship detection and tracking method, which comprises the steps of carrying out self-adaptive feature fusion on collected visible light and infrared images by utilizing channel attention in a low-level network; processing the fused features through a detection network optimized by light weight, and positioning and identifying the target; embedding the initial characteristic and the network detection result into a twin network, and realizing lightweight tracking by utilizing the channel attention matched with separation convolution. The all-weather lightweight high-real-time sea surface ship detecting and tracking method effectively improves the stability of detecting and tracking ships in all-weather open sea surface environments, and simultaneously ensures high real-time performance of monitoring high-resolution large-range sea areas.

Description

All-weather lightweight high-real-time sea surface ship detection and tracking method

Technical Field

The invention relates to the technical field of computer image multi-mode target detection and tracking, in particular to an all-weather light high-real-time sea surface ship detection and tracking method.

Background

The target detection based on the sea surface visible light and infrared images obtained by the image sensor is widely applied to the aspect of sea defense safety, and the detection method combining the visible light and the infrared images can enable the detector to have better adaptability to illumination changes. The detection area is limited in a visual distance from the offshore or a certain point on the sea, and both the sea surface motion carrier and the offshore buoy can be used as good system deployment sites.

The traditional method for detecting the sea surface target of the video image mainly comprises a weak classifier cascade method based on a boosting frame and a target matching method based on a support vector machine. The methods have limited feature expression capability and cannot cope with the changes of the shapes of different ships on the sea surface and the changes of the sizes and angles of the ships appearing in the video. Meanwhile, under different weather conditions such as rainstorm, strong wind, sunlight irradiation and the like, the imaging effect of visible light and infrared images is influenced by different degrees. The above reasons cause the condition of missing detection and false detection of the ship to be serious.

The target detection method based on deep learning has a complex feature extraction network, and can acquire abundant features to identify targets. Common detection frameworks include deeper backbone networks VGG, DenseNet, MobileNet, and the like, and the amount of feature pyramid computation included is also high. On the other hand, processing both visible and infrared images introduces additional computational effort. Under such conditions, when the target detection network sets a larger network input size to obtain a high-resolution detection effect, real-time detection cannot be performed on a server with high calculation power. And the time consumption is more serious when the device is deployed on a general edge device. Therefore, monitoring a large sea area in an open environment, detecting and tracking ships therein require high real-time performance, robustness and stability, which are technical problems to be solved by the prior art.

Target tracking algorithms can be divided into two broad categories, generative models and discriminative models. The generative model method uses the generative model to describe the characteristics of the target, and then minimizes the reconstruction error by searching candidate targets. Discriminant model methods distinguish between objects and background by training classifiers. In recent years, discriminant methods represented by correlation filtering and deep learning have become dominant in the field of target tracking. The related filtering speed is high, but the feature extraction capability is not as good as that of the deep learning method. The deep learning method is large in calculation amount and often poor in real-time performance, and how to improve the tracking accuracy and ensure the speed is a problem to be solved in the prior art.

Disclosure of Invention

The invention aims to provide an all-weather light-weight high-real-time sea surface ship detection and tracking method to solve the problems.

In order to solve the technical problems, the invention provides an all-weather lightweight high-real-time sea surface ship detection and tracking method, which comprises the steps of carrying out self-adaptive feature fusion on collected visible light and infrared images by utilizing channel attention in a low-level network; processing the fused features through a detection network optimized by light weight, and positioning and identifying the target; embedding the initial characteristic and the network detection result into a twin network, and realizing lightweight tracking by utilizing the channel attention matched with separation convolution.

Further, the collected visible light and infrared images are normalized and converted into fixed size and then input into a lower-layer network, and then convolution is carried out by utilizing a convolution module to realize downsampling; and performing concat splicing on the two groups of target features extracted by the downsampling convolution on the channel, taking the obtained target feature vector as the input of the attention of the channel, calculating the weight, and outputting a one-dimensional vector.

Further, the convolution module comprises a convolution layer of 3 × 3 convolution kernels, a BatchNorm layer and a Scale layer for accelerating the training convergence speed and stability, and a modified linear unit ReLU activation function for converting the linear transformation into the nonlinear transformation.

Further, performing weight adaptive distribution on the spliced target features on the channels by using channel attention, and obtaining one-dimensional feature values of the channels by using a channel attention module through a Global Average firing layer; then, the correlation between the channels is calculated by two 1 × 1 convolution layers, and the obtained one-dimensional vector, namely the weight of each channel is multiplied to the original channel characteristic by the Scale layer.

And further, carrying out lightweight processing on the detection network, optimizing the deep convolutional network, extracting multi-scale target features, constructing a multi-scale feature pyramid, and carrying out target positioning and identification.

Further, the deep network convolutional layers of the optimized deep convolutional network comprise a plurality of lightweight convolutional modules with the step size of 1 and a plurality of convolutional layers with the step size of 2, and the lightweight convolutional modules and the convolutional layers are sequentially arranged in a staggered mode.

Further, a multi-scale feature pyramid is constructed by utilizing the multi-scale features to perform target feature calculation, the multi-scale feature pyramid is a five-layer feature pyramid, and the sizes of all feature layers are sequentially reduced by 2 times.

Furthermore, each scale feature layer is connected to a space attention module, and the concat channels are spliced by the normalized resolution of all the feature layers through bidirectional interpolation up-sampling layers; and then compressing the features into a two-dimensional space feature with the channel number of 1, expanding the receptive field range by utilizing convolution of two holes, expanding the channel number to be consistent with the previous input, multiplying the weight of each channel space with the previous feature input to realize the weighting of the target feature vector, and performing network training.

Further, during training, selecting a feature point close to the central point of the Gtbox to carry out multi-task loss calculation, and outputting the probability representing the real category and the distance of regression of four edges of the bounding box relative to the feature point; and calculating classification loss by adopting a softmax function, mapping a real value vector of a k dimension into a constant vector of a range of 0-1, and classifying.

Further, extracting multi-scale normalization features of the tracking search frame and the template frame respectively, enhancing channel features through a channel attention module respectively, extracting a candidate region and a template region according to a detection result of the detection network to perform correlation calculation, dividing the candidate region into a classification branch and a regression branch, and performing depthwise cross correlation calculation respectively; if the detection score indicates that the tracking target is lost, the tracking algorithm adopts a global search strategy to detect the target again, namely starting from the upper left corner of the image, cutting out a local search area which is enlarged relatively before, and searching the whole image in sequence, wherein the transverse step length is half of the target length, and the longitudinal step length is half of the target width.

The invention has the beneficial effects that: the all-weather lightweight high-real-time sea surface ship detection and tracking method utilizes a channel attention mechanism to dynamically adjust the weight of visible light and infrared images in a channel to realize self-adaptive fusion of environmental illumination change, and utilizes a space attention mechanism to enable a detection network to focus on a sea surface ship area to ignore a monotonous sea surface background area. Meanwhile, the light-weight technology is utilized to optimize the frame based on the deep learning convolutional neural network to detect and track the ships in the video image, and the real-time performance and the stability of detecting and tracking the ships in the large-range sea area are effectively ensured.

In order to meet the requirement of monitoring a large-range sea area, redundancy removal of a lightweight convolution channel is carried out on the premise of high-resolution network input, the calculation amount is reduced, and the real-time performance is improved. And embedding the network detection result into a twin network, and realizing light tracking by utilizing the attention of a channel and the cooperation of separable convolution. The method effectively improves the stability of detecting and tracking the ship in all-weather open sea surface environment, and simultaneously ensures high real-time performance of monitoring high-resolution large-range sea areas.

Drawings

FIG. 1 is a schematic diagram of adaptive feature fusion of visible and infrared images of an all-weather lightweight high-real-time sea vessel detection and tracking method.

FIG. 2 is a schematic diagram of a lightweight convolution channel redundancy elimination module of an all-weather lightweight high-real-time marine vessel detection and tracking method.

FIG. 3 is a schematic diagram of a deep network feature space attention module of an all-weather lightweight high-real-time marine vessel detection and tracking method.

FIG. 4 is a schematic diagram of a lightweight feature pyramid network of an all-weather lightweight high-real-time marine vessel detection and tracking method.

FIG. 5 schematically shows a light-weight twin tracking network schematic diagram of an embedded detection network for an all-weather light-weight high-real-time marine vessel detection and tracking method.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the sake of simplicity, the following description will omit the technical common sense known to those skilled in the art, and the following description will be made in detail with reference to fig. 1 to 5.

The all-weather light high-real-time sea surface ship detection and tracking method comprises the step of carrying out self-adaptive feature fusion on collected visible light images and infrared images by utilizing channel attention in a low-level network. The visible light and the infrared image are fused in a low-level network instead of being placed in a deep-level network, so that the calculation amount of respectively extracting the characteristics of the visible light and the infrared image is reduced.

Specifically, the collected visible light and infrared images are normalized and converted into fixed sizes and then input into a lower-layer network, the fixed sizes are respectively converted into 640 × 640 sizes and input into the network, and then convolution is respectively carried out by utilizing a convolution module to realize down-sampling; and performing concat splicing on the two groups of target features extracted by the downsampling convolution on the channel, taking the obtained target feature vector as the input of the attention of the channel, calculating the weight, and outputting a one-dimensional vector.

The convolution module comprises a convolution layer of 3x3 convolution kernels, a BatchNorm layer and a Scale layer for accelerating the convergence speed and stability of training, and a modified linear unit ReLU activation function for converting linear transformation into nonlinear transformation.

Performing weight adaptive distribution on the spliced target features on the channels by using channel attention, and obtaining one-dimensional feature values of each channel by using a channel attention module through a Global Average firing layer; then, the correlation between the channels is calculated by two 1 × 1 convolution layers, and the obtained one-dimensional vector, namely the weight of each channel is multiplied to the original channel characteristic by the Scale layer.

The method adopts the channel attention in a lower layer network to perform self-adaptive feature fusion on visible light and infrared images, extracts target features through a deep convolutional neural network, uses a series of 3x3 convolutional kernels to extract the features for each convolutional layer, and the step length can be selected to be 1 or 2. When the step size is 2, the down sampling of the characteristic image is realized, and the image resolution is reduced by one time. And considering the demand of fast calculation, after the image is input into the network, 3 convolution layers with the step length of 2 are continuously arranged on the lower layer network, thereby realizing fast down-sampling to reduce the resolution of the image and reducing the calculation amount.

In addition, in the design of the invention, the visible light and the infrared image are fused in a lower network instead of being placed in a deep network, so that the calculation amount for respectively extracting the characteristics of the visible light and the infrared image is reduced. The features input and extracted by the two groups of images are concat combined on the channels, the channels are weighted by using a channel attention mechanism, the imaging quality of the visible light or infrared images and the contribution of the visible light or infrared images to identification are continuously changed when the ambient illumination condition is changed, and self-adaptive channel fusion is realized through weight change.

And then, the fused features are processed through a detection network optimized through light weight, and positioning and identification of the target are carried out. In order to meet the requirement of monitoring a large-range sea area, the detection network is subjected to lightweight optimization processing, and lightweight convolution channel redundancy removal is performed on the premise of high-resolution network input, so that the calculation amount is reduced, and the real-time performance is improved.

And carrying out lightweight processing on the detection network, optimizing the deep convolutional network, extracting multi-scale target features, constructing a multi-scale feature pyramid, and carrying out target positioning and identification. The optimized deep network convolution layer of the deep convolutional network comprises a plurality of lightweight convolution modules with the step length of 1 and a plurality of convolution layers with the step length of 2, and the lightweight convolution modules and the convolution layers are sequentially arranged in a staggered mode.

And constructing a multi-scale feature pyramid by using the multi-scale features to perform target feature calculation, wherein the multi-scale feature pyramid is a five-layer feature pyramid, and the sizes of all feature layers are sequentially reduced by 2 times, preferably 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5.

A spatial attention module is accessed to each scale feature layer, and concat channels are spliced on the normalized resolution of all the feature layers through bidirectional interpolation up-sampling layers; and then compressing the features into a two-dimensional space feature with the channel number of 1, expanding the receptive field range by utilizing convolution of two holes, expanding the channel number to be consistent with the previous input, multiplying the weight of each channel space with the previous feature input to realize the weighting of the target feature vector, and performing network training.

During training, selecting a feature point close to the central point of the Gtbox to carry out multi-task loss calculation, and outputting the probability representing the real category and the distance of the regression of four edges of the bounding box relative to the feature point; and calculating classification loss by adopting a softmax function, mapping a real value vector of a k dimension into a constant vector of a range of 0-1, and classifying.

Specifically, multi-task loss calculation is carried out on the spliced feature map, each group route-box is matched with the feature point closest to the central point of the group route-box, and compared with a traditional method for setting a large number of anchors to match Gtobx, the calculation amount is reduced by 4-9 times. Calculating classification loss by adopting a softmax function, mapping a real value vector of a k dimension into a constant vector in a range of 0-1, classifying according to the size of the constant vector, and directly regressing the boundary of the target frame by a Bounding box to the upper, lower, left and right 4 distances from the corresponding feature point.

The method utilizes the initial part of the detection network to respectively extract the multi-scale normalized feature maps of the search frame and the template frame required by tracking, and then respectively enhances the features of the effective channel through a channel attention module M1. And providing a candidate area and the template area according to the detection result of the detection network for correlation calculation. The classification branch and the regression branch are respectively used for depthwise cross correlation calculation.

In practice, a high resolution network input is a necessary requirement to monitor a large sea area and detect therein such small targets as ships. However, once the size of the input of the deep neural network is increased, the calculation amount is increased sharply, and the real-time performance of the algorithm is seriously influenced.

The invention optimizes the light weight of the main network part and the characteristic pyramid network part of the detection network frame, and reduces the calculation amount while ensuring the detection rate. The main network part utilizes a linear cheap operation to carry out channel redundancy removal on the traditional convolutional layer structure, the characteristic pyramid network part is connected with a space attention module at the input end of each characteristic layer, weight is given to a foreground object to weaken the background of the sea surface, and the detection precision of the network is not lost while the redundancy calculation amount is removed.

Embedding the initial characteristic and the network detection result into a twin network, and realizing lightweight tracking by utilizing the channel attention matched with separation convolution. The stability of detection and tracking of the ship in all-weather open sea surface environment can be effectively improved, and high real-time performance of monitoring high-resolution large-range sea areas is guaranteed.

The multi-scale normalized features of the tracking search frame and the template frame can be respectively extracted, the channel features are respectively enhanced through a channel attention module, the candidate area and the candidate network of the template area input area are extracted according to the detection result of the detection network to carry out correlation calculation, the correlation calculation is divided into a classification branch and a regression branch, and the depthwise cross correlation calculation is respectively carried out.

If the detection score indicates that the tracking target is lost, the tracking algorithm adopts a global search strategy to detect the target again, namely, starting from the upper left corner of the image, cutting out a local search area which is enlarged relatively before, and searching the whole image in sequence, wherein the transverse step length is half of the target length, and the longitudinal step length is half of the target width.

The twin network has two inputs, which are detection frame and template frame, the initial calibrated target image is used as template frame, the local search area in the detection frame is centered above the target position of the frame, and the current frame is cut in four times of target size. The multi-scale normalized feature maps are extracted through the network same as the detection network, the feature maps of the two paths are input into a subsequent regional candidate network through a channel attention module M1, a classification branch and a regression branch of the regional candidate network respectively carry out depthwise convolution operation on the features of a template frame and a detection frame to calculate the correlation, the correlation between the channels is not calculated through the operation of the method, the calculated amount is greatly reduced, and the previously inserted channel attention module M1 plays a role in highlighting an effective channel.

In the specific implementation, in the rapid self-adaptive feature fusion of the visible light image and the infrared image, a frame of 3-channel image is read from the visible light sensor and the infrared sensor respectively at the same time, and the size of the image is changed to 640 x 640 respectively and then input into the network.

Firstly, 3 continuous convolutions with the step size of 2 are used for realizing rapid down-sampling, the resolution of the features is rapidly reduced by 23 times, the resolution of the output features after 3 convolutions is respectively 320 × 320, 160 × 160 and 80 × 80, and the number of output channels is respectively 16, 32 and 64. The setting ensures that the calculated amount of the network input end with large resolution is reduced as much as possible, simultaneously, the requirement of low-level feature extraction with low abstraction degree is met, and the high-level feature extraction stage with small resolution and relatively less calculated amount is quickly transited.

In the fusion stage, firstly, concat splicing is carried out on the features extracted by fast downsampling convolution of the visible image and the infrared image, the resolution ratio of the feature vectors is 80 x 80, and the number of channels is equal to 64, and then the result is 128. The feature vector with the number of channels being 128 is used as the input of the attention of the channels, the relative weight of each channel is calculated on the vector, and the output is a one-dimensional vector with the length being 128.

The specific operation comprises the following steps: and calculating Global Average Pooling, 1x1 convolution with the number of output channels being 64, ReLU activation function, 1x1 convolution with the number of output channels being 128 and Sigmoid activation function respectively for each channel. And (3) calculating the previous relative weight of each channel through the 5 steps, multiplying the weight by the 128-channel feature vector Scale, and outputting a feature vector of the 128-channel self-adaptive prominent visible light or infrared channel to realize self-adaptive fusion.

And when the fused features are processed through the detection network optimized through light weight, and the positioning and the identification of the target are carried out, the method comprises the steps of light weight of the high-resolution detection network and target positioning and classification of the light weight feature pyramid network.

The output of the self-adaptive fusion feature information is used as the input of the following full convolution network, and the deep convolution network extracts features to be used for constructing the multi-scale feature pyramid. And considering the size requirement of the feature pyramid, namely 80 × 80, 40 × 40, 20 × 20, 10 × 10, 5 × 5.

The deep network convolutional layer design comprises a lightweight convolutional module with the step length of 1 and a plurality of convolutional layers with the step lengths of 2, which are sequentially arranged in a staggered mode, and the specific form is as follows:

three series M2 convolution modules, the M2 convolution module is specifically 3 × 3 convolution kernel, step size is 1, channel number N is 128,

a convolutional layer of 3x3 convolutional kernels, step 2, channel 256,

three series M2 convolution modules, the M2 convolution module is specifically 3 × 3 convolution kernel, the step size is 1, the channel number N is 256,

a convolutional layer of 3x3 convolutional kernels, step 2, channel 256,

an M2 convolution module, the M2 convolution module is a 3 × 3 convolution kernel, the step size is 1, the channel number N is 128,

a convolutional layer of 3x3 convolutional kernels, step 2, channel 256,

convolutional layer, which is 3 × 3 convolutional kernel, step 2, channel 256.

The M2 lightweight convolution module replaces the traditional convolution layer, redundant channel calculation of each convolution operation is avoided, the number of channels N is smaller than that of the traditional convolution channels, meanwhile, the feature extraction capability is not lost, and the calculation amount is greatly saved.

The specific structure is shown in fig. 2, the Convolution with the number of input channels being C, the number of output channels being N, and the Convolution kernel being 3 × 3 occupies a main calculation amount, the Convolution is set to be Depthwise Convolution depth separable Convolution, and the parameter group is made to be the channel number N, the correlation of the feature vectors between the channels is not calculated, so that the cheap linear calculation is formed, and meanwhile, the cross-stage connection ensures the diversity of the feature abstraction degree in the same feature vector.

In the target positioning and classification of the lightweight feature pyramid network, feature pyramids are constructed by utilizing feature layers with different sizes and respectively storing the feature pyramids with the sizes of 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5 to calculate feature vectors of targets with different sizes, wherein the feature layer with high resolution corresponds to a small target, and the feature layer with low resolution corresponds to a large target.

It should be noted here that since the surface vessel targets belong to relatively small targets, the feature resolution is only retained to 5 x 5, and the smaller part is discarded to save computation.

Selecting the most rear feature layer of the 5 feature sizes in the network as an input feature layer, respectively connecting the feature layers to an M3 space attention module, improving the characterization capability of target features relative to a monotonous sea surface background, normalizing all the feature layers to a uniform 80 × 80 resolution ratio through a bidirectional interpolation upsampling layer with relatively small calculation amount, performing concat channel splicing, wherein the input features of each layer before splicing are 256, performing channel compression after splicing to 16 × 5 × 80, and classifying and positioning targets of all sizes under the feature vectors.

The M3 spatial attention module inputs features H × W × N, and compresses the features into a two-dimensional vector H × W of 1 channel, i.e., height × width, through convolution with 1x 1; then, 2 hole convolutions are connected to perform feature space correlation calculation for expanding the scope of the receptive field, with a partition of 3 and a convolution kernel of 3x 3. The final convolution followed by 1x1 restores the number of channels to be consistent with the input vector. Weighting of the feature vector is achieved by multiplying the input vector and the spatial attention vector by Scale.

During training, selecting the feature point closest to the central point of the Gtbox to perform multi-task loss calculation, and outputting the probability p representing the real category_tAnd the distance t relative to the characteristic point of the four-edge regression of the bounding box is equal to (t)_l，t_r，t_t，t_b). Wherein the parameters l, r, t and b respectively represent the distance of the feature point with respect to the left, right, upper and lower scales of the target frame.

The loss function is expressed as: l (p, k)^*，t，t^*)＝L_conf(p，k^*)+αL_loc(t，t^*)。

Wherein k is^*Is the true category and α is the weight.

Classification loss function L_confFocal length was used: FL (p)_t)＝-(1-p_t)^γlog(p_t)。

Location loss function L_loc(t，t^*)＝∑_{i∈(l，r，t，b)}smooth_L1(t，t^*)，

Wherein

In the twin tracking network lightweight embedded in the detection network,

the classification branch and the regression branch respectively perform depthwise cross correlation on the features of the template frame and the detection frame, namely depth separable operation, and the specific notations are as follows:

wherein the content of the first and second substances,

is the feature vector of the classification branch search frame,

is the feature vector of the classification branch template frame.

Is the feature vector of the position regression branch search frame,

is the feature vector of the position regression branch template frame.

Contains 2k channel vectors, each point of which represents a positive and negative excitation, sorted by softmax penalty.

Containing 4k channel vectors, each point representing the distance t between a feature point and the left, right, upper and lower 4 edges of a ground transistor box_l，t_r，t_t，t_bThe loss function is smooth L1 loss, and the formula of the loss function is consistent with the detection of the previous part; at the same time, the channel attention module M1 parameter remains consistent with the first part setting above.

If the detection score indicates that the tracking target is lost, the tracking algorithm adopts a global search strategy to detect the target again, namely starting from the upper left corner of the image, cutting out a local search area which is enlarged relatively before, and searching the whole image in sequence, wherein the transverse step length is half of the target length, and the longitudinal step length is half of the target width.

In the invention, a constraint Architecture For Feature Extraction framework is used For calculation, a shallow network part of a main network is formed by fast downsampling Convolution shown in figure 1, a deep network part can be a 10-20-layer full Convolution network according to different balancing deployment depths of tasks on precision and speed, and all Convolution layers in the network are replaced by a light Convolution channel mode shown in figure 2.

All convolution operations in the network are by default composed of 4 parts: convolution layers of 3 × 3 convolution kernels are convolved with the last layer of input feature vectors by step 1 or 2; the BatchNorm layer realizes normalization, new mean and variance are calculated by adopting moving average during training, the parameter use _ global _ stats is set as false, the parameter use _ global _ stats is set as tube during testing, and the mean and variance stored in the model are used forcibly; the Scale layer realizes translation and scaling, and the parameter bias _ term is set to true; and converting the linear transformation into the nonlinear transformation by adopting a modified linear unit ReLU activation function.

The formula is expressed as follows:

X_n＝g(W^TX^n-1+bⁿ)，g(x)＝ReLU(x)＝max(0，x)，

wherein n represents a network of a layer, X is image characteristics, W is convolution kernel weight, g (X) is an activation function, and a RecU is adopted to correct a Rectified Linear Unit. For a linear function, the expression capacity of the ReLU is stronger, and the ReLU is particularly reflected in a deep network, while for a non-linear function, the gradient of a non-negative interval is constant, so that the problem of gradient disappearance does not exist, and the convergence rate of the model is maintained in a stable state.

And the infrared and visible light sensors are provided with binocular parallel light registration, so that aiming at a specific target object, the ship is far away from the camera, and accurate alignment can be realized. And calibrating the image, acquiring camera parameters and image scale factors, and calculating the offset of the infrared image relative to the visible light image. After cutting, the infrared image and the visible light image can be completely overlapped, which belongs to the conventional operation of image registration and fusion and is not described any more.

The network training of the invention adopts the following method:

1. training and testing data

The training set consists of two parts, where one part of the picture for the pre-trained network is from the COCO dataset. The COCO, Microsoft Common Objects in Context dataset is a large, rich object detection, segmentation dataset. The data set is mainly intercepted from a complex daily scene by taking scene understating as a target, and the target in the image is subjected to position calibration through accurate segmentation. The categories provided are 80, there are over 33 ten thousand pictures, 20 of which are labeled, and the number of individuals in the whole data set is over 150 thousands.

And after the pre-training is finished, training the detection tracker.

Sea surface vessel data sea surface video data collected by sensors deployed by coastal and sea surface motion carriers are sliced into data sets consisting of picture frames. And cutting out video frame images from a plurality of groups of videos collected in different time periods, different climatic conditions and different illumination conditions at intervals, and manually marking the ship target by using a marking tool. Finally 5 ten thousand pairs of visible and infrared annotation images are formed, in which the number of ships present exceeds 7 thousand. 4/5 is extracted from the ship data we labeled as training data to fine-tune training fine tuning the detection network, and the rest 1/5 images are used as test data set.

The OTB data set is a video data set containing 100 sequences, and is a common benchmark used to evaluate the performance of the tracking algorithm. The 9 challenging visual tracking attributes were manually marked in the test sequence, including illumination changes, scale changes, occlusion, deformation, motion blur, fast movement, in-plane and out-of-plane rotation, object movement out of sight, background interference, low resolution.

The data set contains 25% gray scale sequences, which is helpful for network training with infrared images as input in the present invention. The video sequence is divided into single-frame images and stored in different folders, and meanwhile, the position group Truth Box of the target object is marked into a corresponding file manually. And adding 30 sequences extracted from the sea surface ship data, dividing each sequence into 5 sections, taking 4 sections as training data, and taking the rest 1 section as test data.

2. Training strategy

The detection network firstly uses the images in the COCO data set to carry out classification pre-training on the backbone network so as to enable the network to learn target feature expression parameters, wherein the visible light and infrared image shallow layer network parts of the network adopt the same image as input. Training is carried out on a server with 4 GPUs and 1080ti display cards, the initial learning rate is 0.001, the iteration frequency is 13 ten thousand, the batch size is set to be 16, and the optimization algorithm adopts a random gradient descent SGD.

And then, fine tuning training is carried out on the basis of a pre-training model by using the labeled ship data set, so that the target detection network learns better positioning capability and characterization capability aiming at the special target of the ship. The tracking network also uses a COCO data set for pre-training, then uses a video sequence of the tracking data set for fine-tuning training of the network, so that the network learns continuous motion models of different objects, and adjusts the hyper-parameters and optimizes the model performance according to the performance evaluation result of the test set.

3. Deploying

The detection tracking energy equipment designed by the invention is respectively deployed on a server and low-power-consumption edge equipment, and the configuration parameters are shown in table 1. When deployed on server equipment, the model can be used directly; when the model is deployed to low-power-consumption equipment, the model can be quantized by utilizing ncnn, floating-point number operation is replaced by integer operation, the reasoning speed of the model can be increased, the calculation memory can be reduced, and the calculation speed can be further increased.

TABLE 1

The all-weather light high-real-time sea surface ship detection and tracking method is characterized in that image frames acquired by a video sensor are converted into fixed sizes in a normalization mode to serve as input of a feature extraction network, and the input sizes select high resolution to ensure that small ship targets are detected in a large monitoring range. And fast down-sampling is realized by a plurality of continuous convolution layers with the step length of 2, so that the convolution network does not waste a large amount of calculation and extraction of shallow features at a high resolution stage.

And the visible images and the infrared images are respectively subjected to fast convolution, after the size of the characteristic images is reduced by 8 times relative to the network input, concat superposition is carried out on the characteristics of the two parts on the channel, and the combination is carried out at the shallow stage of the network, so that the calculation amount is greatly reduced.

Meanwhile, the spliced features are subjected to weight self-adaptive distribution on the channels by utilizing the attention of the channels, the channel attention module firstly obtains one-dimensional feature values of all the channels through a Global Average firing layer, then the interrelation between all the channels is calculated through two 1x1 convolution layers, and the 1x1 convolution is used for replacing the traditional full-connection layer to reduce the calculated amount and improve the speed. The two activation functions respectively adopt ReLU and Sigmoid, the obtained one-dimensional vector, namely the weight of each channel, is multiplied to the original channel characteristic through a Scale layer, more important characteristic channels are dynamically enhanced, for example, the weight of an infrared characteristic channel is increased when ambient light becomes dark, and dynamic fusion of visible light and infrared images is realized.

After dynamic fusion, the fused features are passed through a series of lightweight convolution modules with step size 1, where a few convolution layers with step size 2 are interleaved for down-sampling to replace the conventional 3x3 convolution layers. The DwConv is a deep separable convolution, is a linear cheap operation, has a remarkably small calculation amount, and the other part of channels directly carry out concat combination in a cross-stage mode, so that redundant calculation on the channels is reduced, and the light weight of calculation is ensured due to the small number of the channels of each convolution layer.

And moreover, a 5-layer characteristic pyramid is constructed by utilizing multi-scale characteristics, and the size of each characteristic layer is reduced by 2 times in sequence. The traditional feature pyramid network FPN splices each scale feature layer for many times, and the large parameter amount and the time consumption of calculation are realized. The invention uses the uniform dimension of each layer of the upsampling to splice the upsampled uniform dimension to a characteristic diagram with reduced channels. Meanwhile, in order to avoid the loss of precision, a space attention module M3 is accessed to the feature layers with different scales.

In summary, the invention utilizes the channel attention mechanism to dynamically adjust the weight of the visible light and the infrared image in the channel to realize the self-adaptive fusion of the environmental illumination change, and utilizes the space attention mechanism to enable the area of the sea surface ship focused by the detection network to ignore the monotonous sea surface background area. Meanwhile, the light-weight technology is utilized to optimize the frame based on the deep learning convolutional neural network to detect and track the ships in the video image, and the real-time performance and the stability of detecting and tracking the ships in the large-range sea area are effectively ensured.

The all-weather lightweight high-real-time sea surface ship detection and tracking method adopts a target detection framework based on a lightweight convolutional neural network to detect and track ships in a video image, ensures the real-time performance under the condition of high-resolution video input, and achieves the effect of monitoring a wider sea area; meanwhile, the detector has better stability when illumination changes by fusing visible light and infrared image self-adaption. The light-weight high-performance detection tracker can be deployed on a near-shore server or low-power-consumption edge equipment of a marine motion carrier to realize real-time monitoring.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An all-weather light high-real-time sea surface ship detection and tracking method is characterized by comprising the following steps:

s1, carrying out self-adaptive feature fusion on the collected visible light and infrared images by using channel attention in a low-level network;

s2, processing the fused features through a detection network optimized through light weight, and positioning and identifying the target;

and S3, embedding the initial characteristic and the network detection result into a twin network, and realizing lightweight tracking by utilizing channel attention matched with separation convolution.

2. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 1, wherein: in step S1, the collected visible light and infrared images are normalized and converted into fixed size, and then input to a lower-layer network, and then convolved by a convolution module to realize downsampling;

and performing concat splicing on the two groups of target features extracted by the downsampling convolution on the channel, taking the obtained target feature vector as the input of the attention of the channel, calculating the weight, and outputting a one-dimensional vector.

3. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 2, wherein: the convolution module comprises a convolution layer of 3x3 convolution kernels, a BatchNorm layer and a Scale layer for accelerating the training convergence speed and stability, and a modified linear unit ReLU activation function for converting a linear transformation into a nonlinear transformation.

4. The all-weather, lightweight, high real-time marine vessel inspection and tracking method of claim 2 or 3, wherein: performing weight adaptive distribution on the spliced target features on the channels by using channel attention, and obtaining one-dimensional feature values of each channel by using a channel attention module through a Global Average firing layer; then, the correlation between the channels is calculated by two 1 × 1 convolution layers, and the obtained one-dimensional vector, namely the weight of each channel is multiplied to the original channel characteristic by the Scale layer.

5. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 1, wherein: in step S2, the detection network is subjected to lightweight processing, the deep convolutional network is optimized, multi-scale target features are extracted, a multi-scale feature pyramid is constructed, and target positioning and identification are performed.

6. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 5, wherein: the deep network convolutional layer of the optimized deep convolutional network comprises a plurality of lightweight convolutional modules with the step length of 1 and a plurality of convolutional layers with the step length of 2, and the lightweight convolutional modules and the convolutional layers are sequentially arranged in a staggered mode.

7. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 5, wherein: and constructing a multi-scale feature pyramid by using the multi-scale features to calculate the target features, wherein the multi-scale feature pyramid is a five-layer feature pyramid, and the sizes of all feature layers are sequentially reduced by 2 times.

8. The all-weather light-weight high-real-time marine vessel detection and tracking method according to any one of claims 5 to 7, characterized in that: the feature layers of all scales are connected into a space attention module, and the concat channels are spliced by the normalized resolution of all the feature layers through bidirectional interpolation up-sampling layers; and then compressing the features into a two-dimensional space feature with the channel number of 1, expanding the receptive field range by utilizing convolution of two holes, expanding the channel number to be consistent with the previous input, multiplying the weight of each channel space with the previous feature input to realize the weighting of the target feature vector, and performing network training.

9. The all-weather, lightweight, high real-time marine vessel inspection and tracking method of claim 8, wherein: during training, selecting a feature point close to the central point of the Gtbox to carry out multi-task loss calculation, and outputting the probability representing the real category and the distance of the regression of four edges of the bounding box relative to the feature point; and calculating classification loss by adopting a softmax function, mapping a real value vector of a k dimension into a constant vector of a range of 0-1, and classifying.

10. The all-weather lightweight high-real-time marine vessel inspection and tracking method of claim 1, wherein: in step S3, extracting the multi-scale normalized features of the tracking search frame and the template frame, respectively, enhancing the channel features through the channel attention module, extracting the candidate regions and the template regions according to the detection result of the detection network, performing correlation calculation, dividing the candidate regions into classification branches and regression branches, and performing depthwise cross correlation calculation; if the detection score indicates that the tracking target is lost, the tracking algorithm adopts a global search strategy to detect the target again, namely starting from the upper left corner of the image, cutting out a local search area which is enlarged relatively before, and searching the whole image in sequence, wherein the transverse step length is half of the target length, and the longitudinal step length is half of the target width.