CN117765482B

CN117765482B - Garbage identification method and system for garbage enrichment area of coastal zone based on deep learning

Info

Publication number: CN117765482B
Application number: CN202410195184.5A
Authority: CN
Inventors: 于迅; 彭士涛; 胡健波; 何建斐
Original assignee: Tianjin Research Institute for Water Transport Engineering MOT
Current assignee: Tianjin Research Institute for Water Transport Engineering MOT
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-05-14
Anticipated expiration: 2044-02-22
Also published as: CN117765482A

Abstract

The invention relates to the technical field of image processing, and discloses a garbage identification method and a garbage identification system for a garbage enrichment zone of a coastal zone based on deep learning, wherein the method comprises the following steps: acquiring an original image of a garbage enrichment area of a coastal zone, and preprocessing the original image to be divided into N data sets with different scales; the improved Swin transform layer is used as a backbone layer of the Mask2form network model, and an improved Mask2form network model is built; training the improved Mask2Former network model by using N data sets with different scales; inputting the image to be detected into a trained improved Mask2Former network model to carry out garbage identification. Targets with different pixel sizes can be adapted through the feature extraction windows with different scales, so that the waste of calculation amount and overfitting are avoided, the receptive field and the segmentation precision of the network model are improved, and the garbage recognition precision is improved.

Description

Garbage identification method and system for garbage enrichment area of coastal zone based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a garbage identification method and a garbage identification system for a coastal zone garbage enrichment zone based on deep learning.

Background

Coastal zone waste refers to persistent, man-made or processed solid waste in a coastal environment. The amount of waste in coastal zones is increasing dramatically and has become a major concern worldwide due to its significant potential impact on coastal systems, marine life and human health. The increase of garbage in the coastal zone can also affect the beach urban tourism industry, seriously damage urban image and lead to the reduction of tourism income. Therefore, it is necessary to monitor the coastal zone waste. The garbage sources in the coastal zone are complex and can be roughly divided into three types, namely direct discarding and depositing of personnel around the coastal zone, conveying from peripheral areas through rain channels and coastal runoffs, and conveying to the coast by using factors such as wind, sea waves and tides to influence the marine system. At present, unmanned aerial vehicle aerial image monitoring is a common method in the field of coastal zone garbage monitoring, but a standardized automatic method is still required for processing aerial images to reduce the monitoring cost.

Conventional methods for automatically processing images include an image processing thresholding method and a Random Forest (RF) method, which process images and extract features, and then use a classification algorithm to determine the presence or absence of an object. However, thresholding typically relies on a selected threshold, and the selection of this threshold has a large impact on the result, nor does thresholding process images with multiple lighting conditions, complex textures, or large variations well for complex image scenes. While random forests have high computational resource requirements for large-scale data sets, training and reasoning make it difficult to apply in many scenarios. In the prior art, the Swin Transformer is used as the Mask2 Transformer of the back plane layer, and the self-attention operation is only carried out in the window, so that the calculation complexity is reduced. However, the window size of the Swin transducer addition is limited, resulting in limited receptive fields, which do not fit well into different scale targets during feature extraction.

Therefore, there is a need for a garbage identification method for a garbage enrichment area of a coastal zone based on deep learning, which can adapt to targets with different scales to perform feature extraction, avoid the occurrence of wasting calculation amount and overfitting, and improve the receptive field and segmentation precision of a network model, thereby improving the garbage identification precision.

Disclosure of Invention

In order to solve the technical problems, the invention provides the garbage identification method and the garbage identification system for the garbage enrichment area of the coastal zone based on deep learning, which can adapt to targets with different scales to perform feature extraction, avoid the occurrence of the phenomena of wasting calculation and overfitting, and improve the receptive field and the segmentation precision of a network model, thereby improving the garbage identification precision.

The invention provides a garbage identification method of a coastal zone garbage enrichment zone based on deep learning, which comprises the following steps:

S1, acquiring an original image of a garbage enrichment area of a coastal zone;

S2, preprocessing an original image, and dividing the preprocessed original image into N data sets with different scales;

S3, based on a Mask2Former network model, using the improved Swin transform layer as a backbone layer of the Mask2Former network model, and establishing an improved Mask2Former network model; the improved Swin transducer layer comprises M feature extraction windows with different scales, wherein M=N;

S4, training the improved Mask2Former network model by utilizing N data sets with different scales to obtain a trained improved Mask2Former network model;

S5, inputting the image to be detected into a trained improved Mask2Former network model to carry out garbage identification.

Further, S2, preprocessing an original image, dividing the preprocessed original image into N data sets with different scales, and dividing the data set of each scale into a training set and a verification set, which includes:

S21, performing image cutting processing on the original image, performing labeling processing on the original image after image cutting, labeling a corresponding real label of the original image after image cutting, and taking the original image after image cutting and the corresponding real label as a group of data sets;

S22, dividing the data component into N data sets with different scales according to the categories of the real labels.

Further, S22, the data group is divided into N data sets with different scales according to the category of the real tag, where the category of the real tag includes: ocean, beach, garbage, vegetation, biology and background.

Further, S22, dividing the data group into N data sets with different scales according to the category of the real tag includes:

Wherein N is 3;

taking a data group with the category of the real tag as a living being as a first data set;

Taking a data set with the real tag classified as garbage and vegetation as a second data set;

the data set with the category of real tags as ocean and beach is taken as a third data set.

Further, after S22, the method further includes:

s23, respectively dividing the data in the data sets of the N scales into a training set and a verification set according to a preset proportion.

Further, S3, based on the Mask2Former network model, using the improved Swin transform layer as a backbone layer of the Mask2Former network model, and establishing a modified Mask2Former network model, wherein the improved Swin transform layer comprises M feature extraction windows with different scales, and the M feature extraction windows with different scales comprise:

taking a feature extraction window with a scale within a first scale range as a first window;

taking a feature extraction window with the scale within a second scale range as a second window;

and taking the feature extraction window with the scale within the third scale range as a third window.

Further, S4, training the improved Mask2Former network model by using N data sets with different scales, where obtaining a trained improved Mask2Former network model includes:

S41, inputting a data set in a training set into an improved Mask2Former network model for training; s41 specifically comprises:

s411, training a first window in the improved Mask2Former network model through a data set in a training set in a first data set;

S412, training a second window in the improved Mask2Former network model through a data set in a training set in a second data set;

S413, training a third window in the improved Mask2Former network model through a data set in a training set in a third data set;

s42, comparing the output data set of the improved Mask2Former network model with the data set in the verification set in the corresponding data set, and calculating whether the error value is smaller than a preset value;

S43, if the error value is smaller than a preset value, training the improved Mask2Former network model is completed, and a trained improved Mask2Former network model is obtained; if the error value is not smaller than the preset value, training the improved Mask2Former network model is continued.

The invention also provides a garbage recognition system of the garbage enrichment zone of the coastal zone based on deep learning, which is used for executing the garbage recognition method of the garbage enrichment zone of the coastal zone based on deep learning, and comprises the following steps:

the image acquisition module is used for acquiring an original image of the garbage enrichment area of the coastal zone;

The image processing module is used for preprocessing an original image and dividing the preprocessed original image into N data sets with different scales;

the network model building module is used for building an improved Mask2Former network model by taking the improved Swin transform layer as a backbone layer of the Mask2Former network model based on the Mask2Former network model; the improved Swin transducer layer comprises M feature extraction windows with different scales, wherein M=N; training the improved Mask2Former network model by utilizing N data sets with different scales to obtain a trained improved Mask2Former network model;

the image recognition module is used for inputting the image to be detected into the trained improved Mask2Former network model to carry out garbage recognition.

The embodiment of the invention has the following technical effects:

the invention provides an improved network model based on a Mask2former, a Swin transform layer is selected as a backbone layer of the Mask2former network model, the Swin transform layer is further improved, a plurality of translation windows with different scales are integrated into the Swin transform layer, and collected data sets are classified with different scales, so that the network model can adapt to targets with different pixel sizes through characteristic extraction windows with different scales, the waste of calculation amount and overfitting are avoided, the garbage characteristics of a coastal zone are extracted more accurately, and the accurate segmentation of garbage enrichment areas of the coastal zone is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a garbage identification method of a coastal zone garbage enrichment zone based on deep learning, which is provided by the embodiment of the invention;

FIG. 2 is a schematic diagram of a Mask2Former network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Swin transducer layer structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a modified Swin transducer layer structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an improved Mask2Former network model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of feature extraction based on an improved Mask2Former network model according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a garbage recognition system based on deep learning of a garbage enrichment zone of a coastal zone.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are within the scope of the invention.

Fig. 1 is a flowchart of a garbage identification method for a garbage enrichment area of a coastal zone based on deep learning, which is provided by an embodiment of the present invention, referring to fig. 1, specifically includes:

s1, acquiring an original image of a garbage enrichment area of the coastal zone.

Specifically, shooting and sampling are carried out on the garbage enrichment area of the coastal zone through the unmanned aerial vehicle, so that an original image of the garbage enrichment area of the coastal zone is obtained. Only the raw images taken on the coastal zone garbage enrichment zone are used, thereby reducing the complexity of model training caused by various backgrounds.

Illustratively, the garbage enrichment area may be photographed and sampled using a DJI PHANTOM 4 Pro +v2.0 drone, and the pixel size of the sampled original image may be 4864×3648.

S2, preprocessing the original image, and dividing the preprocessed original image into N data sets with different scales.

S21, performing image cutting processing on the original image, performing labeling processing on the original image after image cutting, labeling corresponding real labels of the original image after image cutting, and taking the original image after image cutting and the corresponding real labels as a group of data sets.

Specifically, to adapt to the input of the deep learning network, the original image is preprocessed, where the preprocessing includes: the original image is subjected to image cutting processing, and the original image is cut into images with a pixel size of 1024×1024. And labeling the original image after the image cutting, labeling the corresponding real label of the original image after the image cutting, and taking the original image after the image cutting and the corresponding real label as a group of data sets.

Illustratively, the categories of real tags may include: ocean, beach, garbage, vegetation, biology and background; the category names, the numbers and the like of the real labels can be customized according to the actual situation.

Illustratively, N may be set to 3, dividing the data group into 3 data sets of different scales according to the category of the real tag, including:

Illustratively, the preset ratio may be set as desired, such as 5:1 or 8:2, etc. Taking the preset ratio of 5:1 as an example, dividing the data in the first data set into a first training set and a first verification set according to the ratio of 5:1; dividing the data in the second data set into a second training set and a second verification set according to a ratio of 5:1; data in the third dataset is divided into a third training set and a third validation set … … and so on according to a ratio of 5:1, and data in the dataset with N scales is divided into the training set and the validation set according to a ratio of 5:1.

S3, based on the Mask2Former network model, the improved Swin transformation Former layer is used as a backbone layer of the Mask2Former network model, and the improved Mask2Former network model is built.

Specifically, fig. 2 is a schematic diagram of a Mask2Former network model provided by an embodiment of the present invention, fig. 3 is a schematic diagram of a Swin transform layer provided by an embodiment of the present invention, fig. 4 is a schematic diagram of a modified Swin transform layer provided by an embodiment of the present invention, and fig. 5 is a schematic diagram of a modified Mask2Former network model provided by an embodiment of the present invention; referring to fig. 2-5, based on the Mask2Former network model, a Swin transform layer is selected as a backbone layer of the Mask2Former network model, and the Swin transform layer is further improved, and a plurality of translation windows with different dimensions are integrated into the Swin transform layer, so that the improved Swin transform layer is obtained, and an improved Mask2Former network model is further established. Wherein the modified Swin transducer layer includes M feature extraction windows of different dimensions, and m=n.

Further, with continued reference to FIG. 4, given the input profile Y ^L-1, normalization is performed by the LayerNorm (LN) layer. Then, the normalized feature map is passed through TW-MSA (three window multi-head self attention, three window multi-head attention mechanism) module to obtain a second feature map Y ^L; the principle formula of feature extraction is as follows:

；

Wherein LN is a normalization layer, MLP is a multi-layer perceptron, TW-MSA is a three-window multi-head attention mechanism module, Y ^L ^-1 is an input feature map, Y ^L is a feature map obtained by Y ^L-1 through the normalization layer and the three-window multi-head attention mechanism module, and Y ^L+1 is a feature map obtained by Y ^L through the normalization layer and the multi-layer perceptron; the multi-layer perceptron MLP is a neural network structure and consists of a plurality of fully connected layers, any neuron of the upper layer is connected with all neurons of the lower layer, and the layers are propagated forward.

The TW-MSA module is improved on the basis of a W-MSA (window multi-head self attention, window multi-head attention mechanism) module, so that the scope of self-attention operation is limited to one regular window, different scale feature extraction windows are described by adding different superscripts in a formula, and the three-window feature extraction principle formula is as follows:

；

Wherein Y ^L-1 is an input feature map, Z ^L-1 is a second feature map obtained by normalizing Y ^L-1, Z ^L is a third feature map obtained by normalizing Z ^L-1 through a TW-MSA module, and the TW-MSA module consists of three W-MSA modules and a superscript; superscript Representing three feature extraction windows of different scales, respectively. The three Windows of the TW-MSA module are arranged in parallel, and the three Windows of the STW-MSA (SHIFT THREE Windows multi-head self attention, three-scale window movement mechanism) module are also arranged in parallel.

Illustratively, when N is set to 3, M is also set to 3, and the modified Swin fransformer layer includes 3 different scale feature extraction windows, including:

The first scale range, the second scale range and the third scale range can be set according to the pixel size of the target to be detected, and the larger the pixel of the target to be detected is, the larger the scale of the feature extraction window is. The targets with different scales can better extract features by using feature extraction windows with corresponding sizes, for example, when the first window is used for extracting features of data in the first data set, and the size of the first window can be set to be 5×5 because the category of the first data set is biology and the pixel size of the data in the first data set is not greater than 2304 pixels; when the second window is used for extracting features of data in the second data set, and the class of the second data set is garbage and vegetation, and the pixel size of the data in the second data set is between 2304 pixels and 16384 pixels, the scale of the second window can be set to be 7×7; when the third window is used for feature extraction of data in the third data set, since the category of the third data set is ocean and beach, and the pixel size of the data in the third data set is not less than 16384 pixels, the scale of the third window may be set to 9×9.

Further, with continued reference to fig. 5, the improved Mask2Former network model is further explained. The improved Mask2Former network model firstly uses a patch segmentation module to segment an input image into non-overlapping patches, the size of the patches can be set to 8 multiplied by 8, and then the size is converted through Linear Embedding layers; the two transforms are implemented by convolution operations, where the convolution kernel size is 8 x 8, the step size is 8, and the output dimension is 96. Next is a stack of modifications Swin Transformer Block and PATCH MERGING, PATCH MERGING is a 2 x downsampling operation, which is accomplished by means of a recombination splice and linear layer. The Mask2Form Decoder is consistent with the original network architecture.

And S4, training the improved Mask2Former network model by utilizing N data sets with different scales to obtain a trained improved Mask2Former network model.

S41, inputting the data set in the training set into an improved Mask2Former network model for training.

Specifically, training is performed on feature extraction windows of different scales according to data sets of different scales, and S41 specifically includes:

S411, training a first window in the improved Mask2Former network model through a data set in a training set in the first data set.

Specifically, data in a training set in a first data set is input into an improved Mask2Former network model, a first window in the improved Mask2Former network model is selected to perform feature extraction on the input data, and a prediction result output by the first window is obtained.

S412, training a second window in the improved Mask2Former network model through the data set in the training set in the second data set.

Specifically, data in a training set in a second data set is input into an improved Mask2Former network model, a second window in the improved Mask2Former network model is selected to perform feature extraction on the input data, and a prediction result output by the second window is obtained.

S413, training a third window in the improved Mask2Former network model through the data set in the training set in the third data set.

Specifically, inputting data in a training set in a third data set into an improved Mask2Former network model, selecting a third window in the improved Mask2Former network model, and extracting features of the input data to obtain a prediction result output by the third window.

S42, comparing the output data set of the improved Mask2Former network model with the data set in the verification set in the corresponding data set, and calculating whether the error value is smaller than a preset value.

Specifically, respectively comparing the prediction result output by the first window with the data set in the verification set in the first data set; comparing the predicted result output by the second window with the data set in the verification set in the second data set; and comparing the prediction result output by the third window with the data group in the verification set in the third data set to calculate whether the error value is smaller than a preset value.

Specifically, if the error values of the prediction results output by the first window, the second window and the third window are smaller than the preset value, training of the improved Mask2Former network model is completed, and a trained improved Mask2Former network model is obtained; if the feature extraction window with the error value not smaller than the preset value exists, training and optimizing the window is continued.

Specifically, fig. 6 is a schematic diagram of feature extraction based on an improved Mask2Former network model, in which an image to be detected is input into a trained improved Mask2Former network model, the Mask2Former network model divides the image to be detected into non-overlapping areas, and multi-head attention operation is independently executed in each feature extraction window. And the feature extraction windows with different scales are connected in parallel, feature extraction is respectively carried out on the images to be detected, and then feature images of a plurality of receptive fields are fused to obtain a result of carrying out garbage identification on the garbage enrichment area of the coastal zone.

In the embodiment of the invention, an improved network model based on a Mask2former is provided, a Swin transform former layer is selected as a backbone layer of the Mask2former network model, the Swin transform former layer is further improved, a plurality of translation windows with different scales are integrated into the Swin transform former layer, and the acquired data set is classified with different scales, so that the network model can adapt to targets with different pixel sizes through feature extraction windows with different scales, the waste of calculation amount and overfitting is avoided, the garbage features of a coastal zone are extracted more accurately, and the accurate segmentation of garbage enrichment areas of the coastal zone is realized.

Further, the garbage identification effect of the garbage enrichment area of the coastal zone of the improved Mask2Former network model and other network models is evaluated and verified.

Wherein, the evaluation index comprises an intersection ratio (Intersection over Union, ioU) and an accuracy (Acc). IoU is an evaluation index for evaluating the percentage of overlap between the correct region (standard answer) and the classified region (network model output) of each category, and the calculation formula is as follows:

Wherein Intersection is the number of pixels common to the correct region and the classified region for each class, and Union is the sum of the number of pixels of the correct region and the number of pixels of the classified region. IoU ranges from 0 to 1, where 0 means no overlap and 1 means a completely overlapping split.

The calculation formula of the accuracy Acc is as follows:

where Acc represents the accuracy, TP represents the number of actual positive categories predicted as positive categories, FP represents the number of actual negative categories predicted as positive categories, FN represents the number of actual positive categories predicted as negative categories, and TN represents the number of actual negative categories predicted as negative categories.

Illustratively, taking garbage as a positive category, TP represents the number of targets labeled garbage predicted as garbage, FP represents the number of targets labeled non-garbage predicted as garbage, FN represents the number of targets labeled garbage predicted as non-garbage, and TN represents the number of targets labeled non-garbage predicted as non-garbage.

Identifying the verification set by using HRNet network model and Mask2former model of different backbone layers, comparing according to the identification result, and evaluating the identification result as shown in table 1:

TABLE 1 semantic segmentation results of different models on a validation set

In table 1, mIoU is an average cross-over ratio, that is, an average of cross-over ratios of all categories, and mAcc is an average accuracy, that is, an average of accuracy of all categories. Mask2former (Swin-T-5) is a network model in which a Swin transform layer is selected as a backbone layer of the Mask2former network model, and the feature extraction window is set to 5×5 in size; mask2former (Swin-T-9) is a network model in which a Swin transform layer is selected as a backbone layer of the Mask2former network model, and the feature extraction window is set to 9×9 in size; mask2former (Swin-S-5) is a network model with more Swin modules selected as the backbone layer of the Mask2former network model and the feature extraction window size set to 5×5; mask2Former (Three-spin-T-5, 7, 9) is an improved Mask2Former network model of Three feature extraction windows of different dimensions in the scheme, wherein the sizes of the Three feature extraction windows are respectively set as 5×5,7×7 and 9×9 network models.

According to the evaluation results of table 1, the improved Mask2Former network model provided by the scheme achieves the best recognition precision on the verification set, which indicates that the improved Mask2Former network model provided by the scheme can extract the garbage characteristics of the coastal zone more accurately and realize the accurate segmentation of the garbage enrichment zone of the coastal zone.

Fig. 7 is a schematic structural diagram of a garbage recognition system based on a deep learning garbage enrichment region of a coastal zone according to an embodiment of the present invention, where the garbage recognition system based on the deep learning garbage enrichment region of the coastal zone is used to execute the garbage recognition method based on the deep learning garbage enrichment region of the coastal zone according to any one of the above embodiments, and as shown in fig. 7, the system includes:

the image acquisition module is used for acquiring an original image of the garbage enrichment area of the coastal zone.

Specifically, the image acquisition module may include an unmanned aerial vehicle, and is configured to take a photograph of the coastal zone garbage enrichment region, and obtain an original image of the coastal zone garbage enrichment region.

The image processing module is used for preprocessing the original image and dividing the preprocessed original image into N data sets with different scales.

The network model building module is used for building an improved Mask2Former network model by taking the improved Swin transform layer as a backbone layer of the Mask2Former network model based on the Mask2Former network model; the improved Swin transducer layer comprises M feature extraction windows with different scales, wherein M=N; and training the improved Mask2Former network model by utilizing the N data sets with different scales to obtain the trained improved Mask2Former network model.

In the embodiment of the invention, a network model building module is based on a Mask2former network model, a Swin transform layer is selected as a backbone layer of the Mask2former network model, the Swin transform layer is further improved, and a plurality of translation windows with different scales are integrated into the Swin transform layer; the collected data sets are classified in different scales through the image processing module, so that the network model can adapt to targets with different pixel sizes through feature extraction windows in different scales, the waste of calculation amount and overfitting are avoided, the garbage features of the coastal zone are extracted more accurately, and the accurate segmentation of the garbage enrichment area of the coastal zone is realized.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used in this specification, the terms "a," "an," "the," and/or "the" are not intended to be limiting, but rather are to be construed as covering the singular and the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method or apparatus that includes the element.

It should also be noted that the positional or positional relationship indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention.

Claims

1. The garbage identification method for the coastal zone garbage enrichment zone based on deep learning is characterized by comprising the following steps of:

S1, acquiring an original image of a garbage enrichment area of a coastal zone;

S2, preprocessing the original image, and dividing the preprocessed original image into N data sets with different scales;

Wherein N is 3;

taking a data set with the real tag category of ocean and beach as a third data set;

s3, based on a Mask2Former network model, using the improved Swin transform layer as a backbone layer of the Mask2Former network model, and establishing an improved Mask2Former network model; wherein the improved Swin transducer layer comprises M feature extraction windows of different scales, and m=n;

The M feature extraction windows with different scales comprise:

taking a feature extraction window with the scale within a third scale range as a third window;

Given an input feature map Y ^L-1, carrying out normalization through an LN layer, and obtaining a second feature map Y ^L through a three-window multi-head attention mechanism module by the normalized feature map; the principle formula of feature extraction is as follows:

；

Wherein LN is a normalization layer, MLP is a multi-layer perceptron, TW-MSA is a three-window multi-head attention mechanism module, Y ^L-1 is an input feature map, Y ^L is a feature map obtained by Y ^L-1 through the normalization layer and the three-window multi-head attention mechanism module, and Y ^L+1 is a feature map obtained by Y ^L through the normalization layer and the multi-layer perceptron;

The three-window multi-head attention mechanism module is improved on the basis of the window multi-head attention mechanism module, so that the scope of self-attention operation is limited to one rule window, the feature extraction windows with different dimensions are described in a formula by adding different superscripts, and the three-window feature extraction principle formula is as follows:

；

Wherein Y ^L-1 is an input feature map, Z ^L-1 is a second feature map obtained by normalizing Y ^L-1, Z ^L is a third feature map obtained by normalizing Z ^L-1 through TW-MSA module consisting of three W-MSA modules and superscript, W-MSA is a window multi-head attention mechanism module, and superscript Respectively representing three feature extraction windows with different scales; three windows of the TW-MSA module are arranged in parallel, and three windows of the three-scale window moving mechanism module STW-MSA are also arranged in parallel;

S4, training the improved Mask2Former network model by utilizing the N data sets with different scales to obtain a trained improved Mask2Former network model;

s411, training the first window in the improved Mask2Former network model through a data set in a training set in the first data set;

S412, training the second window in the improved Mask2Former network model through a data set in a training set in the second data set;

s413, training the third window in the improved Mask2Former network model through a data set in a training set in the third data set;

S43, if the error value is smaller than a preset value, training the improved Mask2Former network model is completed, and a trained improved Mask2Former network model is obtained; if the error value is not smaller than the preset value, continuing to train the improved Mask2Former network model;

2. The deep learning-based garbage identification method for a coastal zone garbage enrichment zone according to claim 1, wherein the step S2 of preprocessing the original image, dividing the preprocessed original image into N data sets with different scales, and dividing the data sets of each scale into a training set and a verification set comprises:

s22, dividing the data group into N data sets with different scales according to the category of the real label.

3. The deep learning-based garbage identification method for a coastal zone garbage enrichment zone according to claim 2, wherein S22, the data group is divided into N data sets with different scales according to the category of the real tag, and the category of the real tag includes: ocean, beach, garbage, vegetation, biology and background.

4. The deep learning based coastal zone waste identification method of claim 3, further comprising, after S22:

5. A deep learning-based garbage identification system for a garbage enrichment zone of a coastal zone for performing the deep learning-based garbage identification method of the garbage enrichment zone of the coastal zone as claimed in any one of the preceding claims 1 to 4, characterized in that the system comprises:

the image processing module is used for preprocessing the original image and dividing the preprocessed original image into N data sets with different scales;

the network model building module is used for building an improved Mask2Former network model by taking the improved Swin transform layer as a backbone layer of the Mask2Former network model based on the Mask2Former network model; wherein the improved Swin transducer layer comprises M feature extraction windows of different scales, and m=n; training the improved Mask2Former network model by utilizing the N data sets with different scales to obtain a trained improved Mask2Former network model;