CN116311001A

CN116311001A - Method, device, system, equipment and medium for identifying fish swarm behavior

Info

Publication number: CN116311001A
Application number: CN202310561907.4A
Authority: CN
Inventors: 杨信廷; 杨才伟; 周超; 孙传恒; 朱开捷
Original assignee: Zhongyang Fisheries Jiangmen Co ltd; Research Center of Information Technology of Beijing Academy of Agriculture and Forestry Sciences
Current assignee: Zhongyang Fisheries Jiangmen Co ltd; Research Center of Information Technology of Beijing Academy of Agriculture and Forestry Sciences
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-06-23
Anticipated expiration: 2043-05-18
Also published as: CN116311001B

Abstract

The invention provides a fish swarm behavior recognition method, a device, a system, equipment and a medium, and relates to the field of image recognition, wherein the method comprises the following steps: acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video; inputting the target image features, the target audio features and the target water quality features into a multi-modal fish swarm behavior recognition model to obtain target fish swarm behaviors corresponding to a target video; the multi-mode fish school behavior recognition model is determined by training with the sample fish school behavior of each sample video according to the sample image characteristics, the sample audio characteristics and the sample water quality characteristics of each sample video. According to the invention, the characteristics corresponding to the image, the audio and the water quality data are mutually fused by adopting a multi-mode fusion method, so that the anti-interference capability of identifying the feeding behavior of the fish shoal is improved, the feeding behavior analysis is carried out from multiple directions and multiple angles, the feeding behavior state of the fish shoal is accurately identified, the accurate feeding of the fish shoal is realized, and the waste of feed is reduced.

Description

Method, device, system, equipment and medium for identifying fish swarm behavior

Technical Field

The present invention relates to the field of image recognition, and in particular, to a method, apparatus, system, device, and medium for identifying fish school behaviors.

Background

The fish swarm feeding behavior is judged to be influenced by illumination, water turbidity, water surface reflection, aerator noise or artificial noise through visual characteristics, so that the fish swarm feeding behavior cannot be accurately identified, and further a feeding decision is influenced.

Disclosure of Invention

The invention provides a method, a device, a system, equipment and a medium for identifying fish swarm behaviors, which are used for solving the technical problem that the identification of the fish swarm ingestion behaviors is inaccurate in the prior art, and provides a multi-mode fusion algorithm for fusing video, audio and water quality parameters, so that the accurate prediction of the fish swarm ingestion behaviors is realized.

In a first aspect, the present invention provides a method for identifying fish school behaviors, including:

acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video;

inputting the target image characteristics, the target audio characteristics and the target water quality characteristics into a multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model;

The multi-mode fish school behavior recognition model is determined by training with the sample fish school behavior of each sample video according to the sample image characteristics, the sample audio characteristics and the sample water quality characteristics of each sample video.

According to the fish swarm behavior recognition method provided by the invention, the method for acquiring the target image characteristics, the target audio characteristics and the target water quality characteristics of the target video comprises the following steps:

cutting the original fish school video according to a preset duration to obtain all target videos;

for each target video, respectively extracting image features of the target video based on a double-flow model and a video encoder to obtain a first image feature and a second image feature, and splicing the first image feature and the second image feature to obtain target image features of the target video;

performing audio feature extraction on the target video based on a pre-training audio neural network model to obtain target audio features of the target video;

text feature extraction is carried out on water quality data corresponding to a target video based on a text encoder, and target water quality features of the target video are obtained;

the water quality data includes an acid-base number, a dissolved oxygen number, and a temperature.

According to the fish-swarm behavior recognition method provided by the invention, the target image feature, the target audio feature and the target water quality feature are input into a multi-modal fish-swarm behavior recognition model, and the target fish-swarm behavior corresponding to the target video is obtained and output by the multi-modal fish-swarm behavior recognition model, comprising:

inputting the target image features and the target audio features to a first submodule in the multi-mode fish school behavior recognition model, and obtaining a first video fusion feature output by the first submodule; inputting the target image features and the target audio features to a second submodule in the multi-mode fish school behavior recognition model, and obtaining second video fusion features output by the second submodule;

performing feature fusion on the first video fusion feature and the second video fusion feature according to preset weights to obtain target fusion features;

inputting the query embedded feature and the target fusion feature to a query decoder of the first submodule, and acquiring target shoal behaviors corresponding to target videos output by the query decoder;

the query embedding feature is generated by embedding a target water quality feature into the target fusion feature.

According to the fish school behavior recognition method provided by the invention, the steps of inputting the target image features and the target audio features into the first submodule in the multi-mode fish school behavior recognition model, and obtaining the first video fusion features output by the first submodule include:

inputting the target image features and the target audio features to a feature enhancement layer of the first sub-module, and acquiring the image enhancement features and the audio enhancement features output by the feature enhancement layer;

and inputting the image enhancement features and the audio enhancement features to a bottleneck attention layer of the first sub-module, and obtaining a first video fusion feature output by the bottleneck attention layer.

According to the fish school behavior recognition method provided by the invention, the inputting of the target image features and the target audio features to the second submodule in the multi-mode fish school behavior recognition model, the obtaining of the second video fusion features output by the second submodule, includes:

inputting the target image features and the target audio features to a multi-level basic component modularized common attention layer in the second sub-module, and obtaining a second video fusion feature output by the multi-level basic component modularized common attention layer;

The multi-level basic component modularization common attention layer is formed by connecting basic component modularization common attention layers of each level in series;

the basic component modularization common attention layer of each hierarchy is formed by self-attention units and leading attention units.

According to the fish swarm behavior recognition method provided by the invention, the feature fusion is carried out on the first video fusion feature and the second video fusion feature according to the preset weight, and the target fusion feature is obtained, which comprises the following steps:

determining a first weight characteristic according to the first weight parameter and the first video fusion characteristic;

determining a second weight characteristic according to the second weight parameter and the second video fusion characteristic;

determining the target fusion feature according to the first weight feature and the second weight feature;

the preset weight comprises the first weight parameter and the second weight parameter;

the first weight parameter and the second weight parameter are determined according to the influence degree of the first sub-module and the second sub-module on the fish swarm behavior recognition result.

In a second aspect, there is provided a fish school behavior recognition apparatus, comprising:

A first acquisition unit: the method comprises the steps of acquiring target image characteristics, target audio characteristics and target water quality characteristics of target video;

a second acquisition unit: the multi-modal fish swarm behavior recognition model is used for inputting the target image features, the target audio features and the target water quality features to the multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model;

In a third aspect, there is provided a fish school behavior recognition system, comprising:

the video acquisition equipment is used for acquiring an original fish school video;

the water quality acquisition equipment is used for acquiring water quality data;

the illumination transmitter is used for acquiring illumination intensity;

the light source is used for supplementing light for the video acquisition equipment;

the fish school behavior recognition device is also included.

In a fourth aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method for identifying fish school behavior as described in any one of the above when executing the program.

In a fifth aspect, the invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of fish school behavior identification as described in any of the above.

According to the fish swarm behavior recognition method, device, system, equipment and medium, the target image characteristics, the target audio characteristics and the target water quality characteristics of the target video are obtained through characteristic extraction, the target image characteristics, the target audio characteristics and the target water quality characteristics are input into the multi-modal fish swarm behavior recognition model, and the target fish swarm behaviors corresponding to the target video are obtained.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a fish school behavior recognition method according to the present invention;

FIG. 2 is a schematic flow chart of acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video according to the present invention;

fig. 3 is a schematic flow chart of a target shoal behavior corresponding to a target video acquisition method provided by the invention;

fig. 4 is a schematic flow chart of acquiring a first video fusion feature according to the present invention;

FIG. 5 is a schematic flow chart of the method for acquiring the target fusion feature;

FIG. 6 is a second flow chart of the fish school behavior recognition method according to the present invention;

FIG. 7 is a third flow chart of the fish school behavior recognition method according to the present invention;

FIG. 8 is a schematic structural view of a multi-level basic assembly modular common focus layer provided by the present invention;

FIG. 9 is a schematic diagram of a fish school behavior recognition system according to the present invention;

FIG. 10 is a schematic diagram of a connection structure of an illuminance transmitter provided by the present invention;

fig. 11 is a schematic structural diagram of a fish school behavior recognition device provided by the invention;

fig. 12 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The bait is one of important variable costs in aquaculture, and accounts for more than 50% of the total cost, so that the feeding of the bait in aquaculture is very important, and overfeeding and underfeeding are common problems in the feeding link in the aquaculture process. The real-time analysis and monitoring of the variation of the feeding behavior of the shoal of fish in the aquaculture water body is one of important bases for making scientific bait casting strategies, and can effectively reduce bait waste and avoid water pollution.

Machine vision is widely applied to the fields of image classification, target recognition and the like due to the advantages of wide applicability and reliable data acquisition, and the combination of specific image preprocessing and enhancement algorithms. And (3) by acquiring a fish swarm ingestion picture, judging the ingestion behavior of the fish swarm through image processing, feature extraction and quantification, and making a feeding decision. However, the method based on machine vision is generally limited to a culture environment with clear culture water body and good illumination, has high requirements on the culture environment, and is influenced by factors such as turbidity of the water body, reflection of the water surface and the like; the sound signals of fish ingestion are strong and obvious in change, so that a basic basis can be provided for fish ingestion research, however, the acoustic technology is easily interfered by an aerator and artificial noise, and the application in actual production practice is limited; based on different sensors for detecting temperature, dissolved oxygen value and pH value, the water quality parameters of the culture water body are obtained, necessary information can be provided for accurate feeding, the change of the water quality parameters of the culture water body directly influences the appetite of fishes, however, the change of the water quality parameters is a longer process, and the ingestion of fish shoals is a shorter process, so that the identification of the ingestion behaviors of the fishes is very difficult to reflect through the water quality.

In summary, single image, audio frequency and water quality parameters can provide reference directions for the feeding behavior of the fish shoal, but single information cannot accurately identify the feeding behavior of the fish shoal, so that bait is wasted and water is polluted.

Fig. 1 is a schematic flow chart of a fish school behavior recognition method provided by the present invention, and provides a fish school behavior recognition method, which includes:

step 101, acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video;

102, inputting the target image features, the target audio features and the target water quality features into a multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model;

In step 101, the target video may be a video stream acquired in real time in an underwater environment, and the video duration is intercepted, so that the determined video with uniform duration is rich in information generated by the feeding state of the fish shoal. Visually, fish shoals can cause changes in the body of water during constant swimming and eating; in terms of sound, remarkable sounds such as water spray beating and collision sound between fish shoals can be generated when the fish shoals ingest; the intake behavior of the fish shoal can also generate tiny changes on dissolved oxygen values, acid-base values, temperature and other environmental factors in the water body, and the changes of the environmental factors can also be used as characteristics of the intake behavior, so that the intake behavior of the fish shoal can be comprehensively and accurately reflected by comprehensively utilizing various information to identify the intake behavior of the fish shoal.

Specifically, the target image features are the features of no audio feature and only dynamic image in the target video, the target audio features are the features of no dynamic image in the target video and only the features of sound in the underwater environment, which may include the food intake sound of the fish shoal, the working sound of the oxygenerator, the beating and collision sounds generated by the fish shoal for ingestion, and the like, and the target water quality features mainly include the dissolved oxygen value, the acid-base value and the temperature in the time period where the target video is located.

In step 102, according to the sample image feature, the sample audio feature and the sample water quality feature of each sample video, training is performed with the sample fish swarm behavior of each sample video to determine a multi-modal fish swarm behavior recognition model, so that after the target image feature, the target audio feature and the target water quality feature are input into the multi-modal fish swarm behavior recognition model, the target fish swarm behavior corresponding to the target video can be obtained and output by the multi-modal fish swarm behavior recognition model, and further, a bait casting strategy is determined according to the target fish swarm behavior, thereby effectively reducing bait waste and avoiding water pollution.

The invention can construct a fish-swarm feeding behavior recognition model based on a deep learning framework (PyTorch) and by using a language (Python), the invention sets Batch Size (Batch Size) to 32, iterative cycle number to 1000, learning rate to 0.001, optimizes network parameters by an Adam optimizer, sets Weight attenuation (Weight Decay) to 0.0001 for preventing overfitting, and takes extracted image features, audio features and water quality features as the input of the model during training.

The invention discloses a high-precision multi-mode fusion fish-swarm ingestion behavior recognition algorithm (Adaptive DMCA-UMT), which is improved on the basis of a multi-mode model (Unified Multimodal Transformer, UMT) to remarkably improve recognition accuracy.

Compared with the prior method which only uses single video or audio data, the method can correlate various data types and supplement features, and under the condition that the features of certain data types are not obvious enough, the method can obtain better effects, so that compared with the method for identifying fish shoal behaviors by single data, the method has more excellent identification performance.

The invention also utilizes the water camera device and the water quality probe to collect the video stream, the audio stream and the water quality data of the fish swarm ingestion under the control of the operation processor, the light source is used for supplementing light for the waterproof camera device, the light source and the illuminance transmitter supplement light when the underwater light is insufficient, and then the operation processor identifies the fish swarm ingestion behavior through a trained model. The method and the device can be effectively applied to the aquaculture environment, and a reliable and accurate technical means is provided for researching and monitoring the feeding behavior of the fish shoals.

Fig. 2 is a schematic flow chart of acquiring target image features, target audio features and target water quality features of a target video according to the present invention, where the acquiring the target image features, the target audio features and the target water quality features of the target video includes:

step 1011, cutting an original fish school video according to a preset duration to obtain all target videos;

step 1012, for each target video, extracting image features of the target video based on a dual-stream model and a video encoder to obtain a first image feature and a second image feature, and splicing the first image feature and the second image feature to obtain a target image feature of the target video;

step 1013, extracting audio features of the target video based on a pre-training audio neural network model to obtain target audio features of the target video;

step 1014, extracting text characteristics of water quality data corresponding to a target video based on a text encoder, and obtaining target water quality characteristics of the target video;

In step 1011, the invention can adopt the waterproof camera equipment to acquire the video of the feeding behavior of the fish shoal in the underwater target area to determine the original fish shoal video, optionally, the invention firstly preprocesses the original fish shoal video and uniformly cuts the original fish shoal video into videos with preset duration, for example, 150 seconds of video, then the 150 seconds of video is the target video, and the invention can execute the cutting operation on the original fish shoal video so as to acquire all the target videos.

In step 1012, for each target video, image feature extraction is performed on the target video based on the dual-stream model SlowFast, image feature extraction is performed on the target video based on the video encoder, a first image feature and a second image feature are obtained, as an optional embodiment, the target video is set to be a video of 150 seconds, image features corresponding to the dual-stream model SlowFast are respectively extracted every 2 seconds, image features corresponding to the video encoder are extracted, normalization processing is performed on the two features to splice the two features into one video vector, then all moments are traversed, all video vectors are determined, and all video vectors are spliced, so that the target image feature of the target video is obtained.

In step 1013, optionally, the present invention performs audio feature extraction on the target video based on the pre-trained audio neural network model (Pretrained Audio Neural Networks, PANN) to obtain an audio vector of the target video, where the audio vector is the target audio feature.

In step 1014, optionally, the text encoder in the Pre-Training neural network model (Contrastive Language-Image Pre-Training, CLIP) performs text feature extraction on water quality data corresponding to the target video to obtain target water quality features of the target video, where the water quality data includes an acid-base number, a dissolved oxygen value and a temperature, and those skilled in the art understand that while the target video is obtained, the water quality data in the time period of the target video may be obtained by using a temperature sensor, a dissolved oxygen sensor and an acid-base value measurement device, so as to extract the target water quality features of the target video.

Fig. 3 is a schematic flow chart of obtaining a target shoal of fish behavior corresponding to a target video, where the inputting the target image feature, the target audio feature, and the target water quality feature into a multi-modal fish behavior recognition model obtains a target shoal of fish behavior corresponding to the target video output by the multi-modal fish behavior recognition model, and includes:

step 1021, inputting the target image feature and the target audio feature to a first submodule in the multi-mode fish swarm behavior recognition model, and obtaining a first video fusion feature output by the first submodule; inputting the target image features and the target audio features to a second submodule in the multi-mode fish school behavior recognition model, and obtaining second video fusion features output by the second submodule;

step 1022, performing feature fusion on the first audio-visual fusion feature and the second audio-visual fusion feature according to a preset weight to obtain a target fusion feature;

step 1023, inputting the query embedded feature and the target fusion feature to a query decoder of the first sub-module, and obtaining a target fish swarm behavior corresponding to a target video output by the query decoder;

In step 1021, the first sub-module may be a multi-mode model (Unified Multimodal Transformer, UMT), and since the multi-mode model UMT has a significant effect in video recognition, the recognition accuracy is greater than that of a single mode, the target image feature and the target audio feature are input to the first sub-module in the multi-mode fish school behavior recognition model, and the first video fusion feature output by the first sub-module is obtained.

The second submodule comprises a multi-level basic component modularized common Attention layer (DMCA), and for fusing image features and audio features, the DMCA module is introduced into the second submodule to further improve video-audio joint features, so that the target image features and the target audio features are input into the second submodule in the multi-mode fish swarm behavior recognition model, and the second video fusion features output by the second submodule are obtained.

In step 1022, in order to improve accuracy of identification of fish intake behavior, image features and audio features are further fused, DMCA is introduced based on UMT model, and the model of UMT model is the model DMCA-UMT, however, in order to make the model weigh important data and non-important data of two kinds of fusion mode information, that is, important data increase weight, non-important data decrease weight, self-adaptive weights need to be added to the mode fusion of two kinds of data respectively, so that anti-interference capability and identification accuracy of the model for fish intake behavior identification are improved, that is, feature fusion is performed on the first video fusion feature and the second video fusion feature according to preset video and audio weights, and target fusion features are obtained.

In step 1023, the target water quality feature is first embedded into the target fusion feature to generate a query embedded feature, and then the query embedded feature and the target fusion feature are input into a query decoder of the first sub-module to obtain a target fish swarm behavior corresponding to the target video output by the query decoder.

The invention takes the target water quality characteristics and the target fusion characteristics as input, calculates attention weights between video clips and Query texts through a Query Generator (Query Generator), determines whether each video clip contains information described in the texts and predicts a Query embedding, further, a Query Decoder (Query Decoder) takes the video-audio joint characteristics and a text-guided moment Query as input, namely, inputs the Query embedding characteristics and the target fusion characteristics to the Query Decoder, decodes the Query embedding characteristics and the target fusion characteristics, obtains final joint moment retrieval by using a prediction head, and the moment retrieval is defined as a key point detection problem, namely, each moment can be represented by a time center and a duration window of the key point detection problem, and the center point can be estimated by predicting a time heat map and extracting a local maximum value; the window may be further regressed from the characteristics of the center.

For each real time search, its center is

And window d, quantizing the center point to be

And use of one-dimensional Gaussian kernel +.>

Fill heat map->

Wherein->

Is the time coordinate, +.>

Is the window adaptive standard deviation.

The present invention uses a gaussian focus loss function to optimize the center point prediction as:

（1）

in the formula (1), the components are as follows,

and->

Representing weights and indices, actually set to 2.0 and 4.0, the L1 penalty is optimized for window and offset regression.

（2）

（3）

Wherein in the formula (2),

for the window true value, +.>

For window prediction value, < +.in formula (3)>

Offset of true value, +.>

Is the predicted offset.

The total training penalty will be a weighted sum of all the penalties described above:

（4）

wherein in the formula (4),

the weights of center loss, window loss, and offset loss, respectively.

Fig. 4 is a schematic flow chart of acquiring a first video and audio fusion feature provided by the present invention, where the inputting the target image feature and the target audio feature to a first sub-module in the multi-mode fish school behavior recognition model acquires the first video and audio fusion feature output by the first sub-module, and includes:

step 10211, inputting the target image features and the target audio features to a feature enhancement layer of the first sub-module, and obtaining the image enhancement features and the audio enhancement features output by the feature enhancement layer;

Step 10212, inputting the image enhancement feature and the audio enhancement feature to a bottleneck attention layer of the first sub-module, and obtaining a first video fusion feature output by the bottleneck attention layer.

In step 10211, inputting the target image features and target audio features to a feature enhancement layer of the first sub-module, the feature enhancement layer comprising a Uni-modal Encoder (Uni-modal Encoder) for enhancing features of global context of each mode, the first sub-module being formed by stacking a plurality of encoding layers, each encoding layer being composed of a multi-headed self-attention and a feed-forward network, in each attention head, for image features or audio feature modes

Is from (1)The calculation formula of the attention is:

（5）

in the formula (5), the amino acid sequence of the compound,

representing input features->

Representing an output characteristic; />

Representing query linear transformation weights +.>

Representing the linear transformation weight of the key value,/->

Representing the value linear transformation weight, and +.>

Representing the linear transform weights of the output matrix.

In step 10212, the present invention employs multi-modal learning to achieve overall feature capture, so after step 10211 is performed, cross-modal Encoder (Cross-modal Encoder) captures Cross-modal global dependencies, optionally the Cross-modal Encoder is a bottleneck attention layer (Attention Bottlenecks). The bottleneck attention layer can be divided into two stages, namely feature compression and feature expansion, and in the invention, only two modes of image features and audio features exist, and the feature compression process can be expressed as follows:

（6）

In the formula (6), the amino acid sequence of the compound,

input and output, respectively denoted as the bottleneck attention layer, characteristicsThe purpose of the sign compression is to refine and compress the multimodal information into the bottleneck attention layer.

After compressing the multi-mode information, further expanding the compressed roar feature, the invention uses a multi-head attention to transmit the compressed feature to the image feature and the audio feature mode, and the feature expansion process can be expressed as follows:

（7）

optionally, the inputting the target image feature and the target audio feature to the second sub-module in the multi-mode fish school behavior recognition model, and obtaining the second video fusion feature output by the second sub-module includes:

Those skilled in the art understand that the basic component modular common attention layer (MCA) is composed of two attention units: one is Self-Attention (SA) for intra-modal interactions; the other is a Guided-Attention unit (GA) for inter-modality interaction, the self-Attention unit and the design principle of the Guided-Attention unit comes from scaled dot product Attention.

Further, assuming that the input of the scaled dot product attention is Q, K is a key value, V represents a value, Q, K, V are set to the same dimension d, the dot product of Q and K is calculated, divided by

And go throughOverfunction->

To obtain the attention weight, the feature obtained by the weighted sum of V +.>

Can be expressed as:

（8）

in order to further improve the representation capability of the participation characteristic, the invention introduces multi-head attention, which consists of h parallel 'heads', wherein each head corresponds to an independent zooming dot product attention function. I.e. output characteristics

Expressed as:

（9）

（10）

in the formula (10), the amino acid sequence of the compound,

is->

Projection matrix of individual heads->

。/>

Is the dimension of the output of each head, in order to prevent the multi-head attention model from becoming too large, it is usually set +.>

。

Fig. 8 is a schematic structural diagram of a multi-level basic component modular common attention layer provided by the present invention, which constructs two attention units, namely a self-attention unit (SA) and a guided attention unit (GA), on a multi-head attention basis. The self-attention unit consists of a multi-head attention layer and a feedforward layer, and in addition, the residual connection and the layer normalization are carried out on the output of the two layers, and the optimization guiding attention unit carries out the interaction between modes on the relation between the input X and the input Y respectively. On the basis, the self-Attention unit and the guiding Attention unit are combined in a modularized manner to obtain a basic component modularized common Attention layer (MCA), and finally, a plurality of MCA layers are connected in series to form a multi-level basic component modularized common Attention layer (DMCA).

Fig. 5 is a schematic flow chart of obtaining a target fusion feature according to the present invention, wherein the feature fusion is performed on the first audio-video fusion feature and the second audio-video fusion feature according to a preset weight, so as to obtain the target fusion feature, which includes:

step 10221, determining a first weight feature according to the first weight parameter and the first video fusion feature;

step 10222, determining a second weight feature according to the second weight parameter and the second video fusion feature;

step 10223, determining the target fusion feature according to the first weight feature and the second weight feature;

In step 10221, a first weight feature is determined according to the product of the first weight parameter and the first video fusion feature.

In step 10222, a second weighting characteristic is determined according to the product of the second weighting parameter and the second video fusion characteristic.

In step 10223, the target fusion feature is determined from the sum of the first weight feature and the second weight feature, referring to the following formula:

（11）

In the formula (11), q is a target fusion characteristic,

for the first weight parameter, +.>

For the second weight feature +.>

For the first video fusion feature->

Is a second video fusion feature.

The invention can automatically adjust the fusion proportion by self-adaptive weight, the sum of the first weight parameter and the second weight parameter is 1, the first weight parameter and the second weight parameter can automatically change the size according to the training of a model and the adjustment of an optimizer, the proportion of data with great influence on the result can be increased, the proportion of data with small influence can be decreased, and the output characteristics of a basic assembly modularized common attention layer and a cross-mode encoder passing through a multi-level are subjected to characteristic fusion, so as to obtain the target fusion characteristics.

FIG. 6 is a second flow chart of the fish school behavior recognition method according to the present invention, wherein optionally, the multi-mode fish school behavior recognition model is determined by training the fish school behavior of each sample video according to the sample image feature, the sample audio feature and the sample water quality feature of each sample video, and the present invention sets the waterproof camera 15cm below the water surface of the farm, and places the waterproof camera on the side of the glass cylinder to shoot in order to obtain a larger fish school movement field of view, and processes the original fish school ingestion video shot by the waterproof camera, thereby obtaining the image and audio data; the method comprises the steps of utilizing an electrochemical water quality probe provided by a full-automatic Internet of things circulating water culture system to collect the change of water quality data in a period of time corresponding to the feeding video of a fish school, and determining three water quality data, namely an acid-base number, a dissolved oxygen value and a temperature; preprocessing the obtained image, audio data and water quality data, and extracting data characteristics from the preprocessed image, audio data and water quality data to serve as sample image characteristics, sample audio characteristics and sample water quality characteristics of each sample video; labeling sample fish swarm behaviors corresponding to each sample video, preprocessing the data, determining a training data set, a test data set and a verification data set, randomly taking 60% of total data as the training set, 20% of total data as the verification set, dividing the rest 20% of data into the test sets, and providing a multi-information-fused fish swarm ingestion behavior recognition algorithm by combining complex ingestion behaviors of fish swarms in a circulating culture pond and optimizing a loss function by using the algorithm; setting network initial parameters, taking the training data set and the verification data set as algorithm input, training the model, and generating a trained algorithm model; and finally, identifying the fish intake behavior by using the trained algorithm model, and detecting the fish intake behavior.

Fig. 7 is a third flow chart of the fish school behavior recognition method provided by the invention, fig. 7 shows that after feature extraction by a video feature extraction module, an audio feature extraction module and a water quality feature extraction module, the water quality features of image features, audio features and text features are obtained, the image features and the audio features are processed according to a single-mode encoder, a first video and audio fusion feature is output through a cross-mode encoder, a second video and audio fusion feature is output according to multi-level modularization common attention among common attention of depth modules, and a first weight parameter w is used for determining the first video and audio fusion feature ₁ And determining a first weight feature by the product of the first video fusion features; according to the second weight parameter w ₂ Determining a second weight characteristic according to the product of the second video and audio fusion characteristic, determining the target fusion characteristic according to the sum of the first weight characteristic and the second weight characteristic, then carrying out query generator and query decoder on the water quality characteristic of the target fusion characteristic and the text representation, and finally obtaining the target corresponding to the target video output by the multi-mode fish swarm behavior recognition modelAnd (5) fish shoal behavior.

Fig. 9 is a schematic structural diagram of a fish school behavior recognition system provided by the present invention, and the present invention discloses a fish school behavior recognition system, including:

the illumination transmitter is used for acquiring illumination intensity;

the fish school behavior recognition device is also included.

The invention also provides a fish school behavior recognition system, which is recognition equipment of fish school feeding behaviors, and comprises: the device comprises video acquisition equipment, water quality acquisition equipment, a light source, an illuminance transmitter, a memory, a fish school behavior recognition device and a fish school ingestion behavior recognition program which is stored on the memory and can run on the processor, wherein the fish school behavior recognition device is respectively connected with the waterproof camera, the water quality probe, the light source and the illuminance transmitter.

The video acquisition equipment can acquire the fish swarm ingestion video in real time under the control of the fish swarm behavior recognition device, and extract the ingestion video into a video stream and an audio stream. The water quality acquisition equipment is a water quality probe and is connected with the computer through a communication technology to transmit water quality data to the computer; the light source is used for supplementing light for the waterproof camera equipment, the illuminance transmitter can sense the light intensity of the environment and transmit light intensity information to the fish school behavior recognition device, the fish school behavior recognition device controls the light source switch and the illumination intensity according to the light intensity information, the computer sends video, audio and water quality data to the fish school behavior recognition device, and the fish school behavior recognition device can conduct fish school ingestion behavior recognition according to a trained model.

Fig. 10 is a schematic diagram of a connection structure of an illuminance transmitter provided by the invention, wherein the illuminance transmitter comprises an illuminance sensor, a microcontroller and a communication interface in sequence, the microcontroller is respectively connected with the illuminance sensor and the communication interface, the microcontroller can control the illuminance sensor to collect data and transmit the data collected by the illuminance sensor to a processor through the communication interface, and the processor is a fish school behavior recognition device.

Fig. 11 is a schematic structural diagram of a fish school behavior recognition device according to the present invention, and the present invention provides a fish school behavior recognition device, including a first obtaining unit 1: the working principle of the first obtaining unit 1 may refer to the foregoing step 101 for obtaining the target image feature, the target audio feature and the target water quality feature of the target video, which are not described herein.

The fish school behavior recognition device further comprises a second acquisition unit 2: the second obtaining unit 2 is configured to obtain the target shoal behavior corresponding to the target video, and the working principle of the second obtaining unit 2 may refer to the foregoing step 102, which is not repeated herein.

Fig. 12 is a schematic structural diagram of an electronic device provided by the present invention. As shown in fig. 12, the electronic device may include: processor 110, communication interface (Communications Interface) 120, memory 130, and communication bus 140, wherein processor 110, communication interface 120, memory 130 communicate with each other via communication bus 140. The processor 110 may invoke logic instructions in the memory 130 to perform a fish school behavior recognition method comprising: acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video; inputting the target image characteristics, the target audio characteristics and the target water quality characteristics into a multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model; the multi-mode fish school behavior recognition model is determined by training with the sample fish school behavior of each sample video according to the sample image characteristics, the sample audio characteristics and the sample water quality characteristics of each sample video.

In addition, the logic instructions in the memory 130 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of fish school behavior recognition provided by the methods described above, the method comprising: acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video; inputting the target image characteristics, the target audio characteristics and the target water quality characteristics into a multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model; the multi-mode fish school behavior recognition model is determined by training with the sample fish school behavior of each sample video according to the sample image characteristics, the sample audio characteristics and the sample water quality characteristics of each sample video.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor is implemented to perform the method of fish school behavior identification provided by the above methods, the method comprising: acquiring target image characteristics, target audio characteristics and target water quality characteristics of a target video; inputting the target image characteristics, the target audio characteristics and the target water quality characteristics into a multi-modal fish swarm behavior recognition model, and obtaining target fish swarm behaviors corresponding to the target video, wherein the target fish swarm behaviors are output by the multi-modal fish swarm behavior recognition model; the multi-mode fish school behavior recognition model is determined by training with the sample fish school behavior of each sample video according to the sample image characteristics, the sample audio characteristics and the sample water quality characteristics of each sample video.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying fish school behaviors, comprising:

2. The fish school behavior recognition method according to claim 1, wherein said acquiring the target image feature, the target audio feature, and the target water quality feature of the target video comprises:

3. The method for identifying fish-shoal behaviors according to claim 1, wherein the inputting the target image feature, the target audio feature, and the target water quality feature into the multi-modal fish-shoal behavior identification model, obtaining the target fish-shoal behaviors corresponding to the target video output by the multi-modal fish-shoal behavior identification model, includes:

4. The fish-swarm behavior recognition method according to claim 3, wherein said inputting the target image features and the target audio features to the first sub-module in the multi-modal fish-swarm behavior recognition model, and obtaining the first video-audio fusion features output by the first sub-module, comprises:

5. The fish-swarm behavior recognition method according to claim 3, wherein said inputting the target image features and the target audio features to the second sub-module in the multi-modal fish-swarm behavior recognition model, and obtaining the second video-audio fusion features output by the second sub-module, comprises:

6. The fish school behavior recognition method according to claim 3, wherein the performing feature fusion on the first audio-visual fusion feature and the second audio-visual fusion feature according to a preset weight to obtain a target fusion feature comprises:

7. A fish school behavior recognition device, comprising:

8. A fish school behavior recognition system, comprising:

the illumination transmitter is used for acquiring illumination intensity;

further comprising a fish school behavior recognition means as recited in claim 7.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the fish school behavior recognition method of any one of claims 1-6 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the fish school behavior identification method of any one of claims 1-6.