CN117523671A

CN117523671A - Group behavior recognition method and system based on deep learning

Info

Publication number: CN117523671A
Application number: CN202311566881.9A
Authority: CN
Inventors: 李岩山; 尉淼淼; 刘恒九
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-02-06

Abstract

The embodiment of the application belongs to the field of video big data analysis and relates to a group behavior recognition method based on deep learning, which comprises the steps of obtaining an initial video segment, pre-training the initial video segment, extracting image appearance characteristics and spatial position information, and extracting initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics based on ROIAlign; obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, and updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle to obtain updated individual appearance characteristics; encoding a plurality of individual actions prestored in a database into Shan Re vectors, and embedding the Shan Re vectors into a latent space to obtain semantic features. The application also provides a group behavior recognition system based on deep learning.

Description

Group behavior recognition method and system based on deep learning

Technical Field

The present application relates to the field of video big data analysis, and in particular, to a group behavior recognition method, system, computer device and computer readable storage medium based on deep learning.

Background

Existing group behavior recognition methods typically enhance the visual representation of individuals by introducing relational reasoning into graph networks or transformers, establishing relationships based primarily on the visual representation or location of the individuals, which are not entirely consistent with semantic-level individual relationships in group activity. Additional knowledge, such as action tags, is introduced to the technician to establish semantic relationships in the group campaign. By introducing additional knowledge, the group activity recognition performance is improved, but the spatial relationship between the candidate instance and the surrounding instances is ignored, so that the semantic relationship in the group activity establishment has one-sidedness, and the final calculation reasoning result is inaccurate.

Disclosure of Invention

The embodiment of the application aims to provide a group behavior recognition method, a system, a computer device and a computer readable storage medium based on deep learning, so as to solve the technical problem that one-sided semantic relation exists in establishing group activities due to the fact that spatial relations between candidate examples and surrounding examples are ignored in the group behavior recognition method.

In order to solve the above technical problems, the embodiments of the present application provide a group behavior recognition method based on deep learning, which adopts the following technical scheme: the method comprises the following steps:

obtaining an initial video segment, pre-training the initial video segment, extracting image appearance characteristics and spatial position information, and extracting initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics based on the ROI alignment;

obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, and updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle to obtain updated individual appearance characteristics;

encoding a plurality of individual actions prestored in a database into Shan Re vectors, embedding the Shan Re vectors into a latent space to obtain semantic features, obtaining new coordinates corresponding to the prestored individual feature images in the latent space through interpolation sampling, and obtaining target semantic features;

performing dot product embedded matching on the target semantic features and updated individual appearance features, and obtaining a target feature image through network learning;

and transmitting the target characteristic image to the MLP, obtaining the category of each individual in the initial video segment through a softmax function, and classifying the group behaviors in the initial video segment based on the category of each individual.

Further, the step of obtaining an initial video segment, pre-training the initial video segment, extracting image appearance features and spatial position information, and extracting initial individual appearance features and initial individual spatial position information in the image appearance features based on the ROI alignment includes:

obtaining an initial video segment, extracting image appearance characteristics by taking a pre-trained expansion three-dimensional network on a group behavior recognition data set as a backbone, and providing spatial position information by adopting two-dimensional position coding _。

Further, after the step of obtaining the initial video segment, extracting the image appearance feature by using the pre-trained expanded three-dimensional network as backbone on the group behavior recognition data set, and providing the spatial position information by adopting the two-dimensional position coding, the method further comprises:

extracting refined visual features of the individuals from the image appearance features according to the body part bounding box of each individual in the initial video segment based on the ROI alignment;

the extracted refined visual features of the individual are encoded as initial individual appearance features based on the full connection layer and the ReLU activation function.

Further, the step of obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, updating the interaction characteristics, the initial individual appearance characteristics and initial individual spatial position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle, and obtaining updated individual appearance characteristics comprises the following steps:

and carrying out region division on the image appearance characteristics in the initial video segment, and deducing interaction characteristics among a plurality of individuals in each region in the image appearance characteristics.

Further, the step of performing region division on the image appearance characteristics in the initial video segment, and deducing interaction characteristics among a plurality of individuals in each region of the image appearance characteristics comprises the following steps:

aggregating initial individual appearance characteristics and individual appearance characteristics adjacent to the initial individual appearance characteristics in the image appearance characteristics according to the relation matrix multiplication and the attention mechanism to obtain updated appearance characteristics;

based on bilinear interpolation principle, the updated appearance characteristic is interpolated into the initial individual space position information to obtain the updated individual appearance characteristic, wherein the updated individual appearance characteristic comprises the relation between the individual and the surrounding individual.

Further, the step of performing dot product embedding matching on the target semantic features and the updated individual appearance features and obtaining a target feature image through network learning includes:

and carrying out dot product embedding and matching similarity on the target semantic features and the updated individual appearance features, encoding the spatial position information of each individual, and acquiring a target feature image with highest similarity through network learning based on a multi-head attention mechanism.

In order to solve the above technical problem, the embodiment of the present application further provides a group behavior recognition system based on deep learning, where the system includes:

the pre-training module is used for obtaining an initial video fragment, pre-training the initial video fragment, extracting image appearance characteristics and space position information, and extracting initial individual appearance characteristics and initial individual space position information in the image appearance characteristics based on the ROI alignment;

the updating module is used for obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle, and obtaining updated individual appearance characteristics;

the coding module is used for coding a plurality of individual actions prestored in the database into Shan Re vectors, embedding the Shan Re vectors into the latent space to obtain semantic features, obtaining new coordinates corresponding to the prestored individual feature images in the latent space through interpolation sampling, and obtaining target semantic features;

the dot product embedding module is used for carrying out dot product embedding matching on the target semantic features and the updated individual appearance features, and obtaining a target feature image through network learning;

the classification module is used for transmitting the target characteristic image to the MLP, obtaining the category of each individual in the initial video segment through a softmax function, and classifying the group behaviors in the initial video segment based on the category of each individual.

Further, the pre-training module includes:

a pre-training unit for obtaining initial video segments, extracting image appearance characteristics by using an expanded three-dimensional network pre-trained on a group behavior recognition data set as a backbone, and providing spatial position information by adopting two-dimensional position coding _；

An extraction unit for extracting refined visual features of the individuals from the image appearance features according to the body part bounding box of each individual in the initial video segment based on the ROI alignment;

and the encoding unit is used for encoding the extracted refined visual characteristics of the individual into initial individual appearance characteristics based on the full connection layer and the ReLU activation function.

In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes: the method comprises a memory and a processor, wherein the memory stores computer readable instructions, and the processor executes the computer readable instructions to realize the steps of the group behavior recognition method based on deep learning.

In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions: the computer readable storage medium has stored thereon computer readable instructions which when executed by a processor implement the steps of the deep learning based group behavior identification method as described above.

According to the method, an initial video segment is obtained, the initial video segment is pre-trained, image appearance characteristics and spatial position information are extracted, and initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics are extracted based on the ROI alignment; obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, and updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle to obtain updated individual appearance characteristics; encoding a plurality of individual actions prestored in a database into Shan Re vectors, embedding the Shan Re vectors into a latent space to obtain semantic features, obtaining new coordinates corresponding to the prestored individual feature images in the latent space through interpolation sampling, and obtaining target semantic features; performing dot product embedded matching on the target semantic features and updated individual appearance features, and obtaining a target feature image through network learning; and transmitting the target characteristic image to the MLP, obtaining the category of each individual in the initial video segment through a softmax function, and classifying the group behaviors in the initial video segment based on the category of each individual. The method has the advantages that the global performance of semantic relations in group activities is increased, the final calculation reasoning result is accurately improved, the group behavior recognition capability is more excellent, and the interaction information between participants and groups is made up.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of one embodiment of a deep learning based group behavior recognition method;

FIG. 2 is a block diagram of one embodiment of a dynamic dual-stream self-attention group behavior recognition system;

FIG. 3 is a schematic diagram of one embodiment of a dynamic dual-stream self-attention group behavior identification system;

FIG. 4 is a schematic diagram of an embodiment of a computer device according to the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, a flow chart of one embodiment of a deep learning based group behavior recognition method according to the present application is shown. The group behavior recognition method based on deep learning comprises the following steps:

step S1, obtaining an initial video segment, pre-training the initial video segment, extracting image appearance characteristics and spatial position information, and extracting initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics based on ROI alignment;

specifically, step S1 includes: obtaining an initial video segment, taking a pre-trained expansion three-dimensional network on a group behavior recognition data set as a backbone to extract image appearance characteristics, and adopting spatial position information in two-dimensional position coding provision; extracting refined visual features of the individuals from the image appearance features according to the body part bounding box of each individual in the initial video segment based on the ROI alignment; the extracted refined visual features of the individual are encoded as initial individual appearance features based on the full connection layer and the ReLU activation function.

The expansion three-dimensional network is the expansion three-dimensional convNet network I3D, which expands the convolution and pooling kernel in a very deep image classification network from 2D to 3D to seamlessly learn the space-time characteristics. And after the I3D is trained by the Kinetics, the I3D achieves 80.9% and 98.0% accuracy in the reference data sets HMDB-51 and UCF-101. I3D is realized mainly according to the optimal image network architecture, and the convolution and pooling cores of the I3D are expanded from 2D to 3D, and parameters of the I3D are selected and used, so that a very deep space-time classification network is finally obtained. Based on the pre-trained ideptionv 1 with BN on the initial video segment as backbone, five behavior classification networks were built, four of which were networks built based on previous papers, specifically:

ConvNet+LSTM-each frame extracts the image appearance characteristics and then the whole video is concentrated, or each frame is provided with +LSTM.

Improvement C3D: there are more parameters than two-dimensional convolution networks, the number of the defect parameters is large, and pre-training is not possible. Input 16 frames/112, a variation of C3D is implemented with 8 convolutional layers at the top layer, 5 concentrate layers and 2 fully-connected layers. The input to the model is 16 frames, 112x112 segments per frame. BN layers are added after all convolution and full link layers, while changing the time step of the first concentrated layer from 1 to 2 to reduce memory usage and increase the size of each batch.

Dual-flow network: since LSTM only grabs the convolved information of the higher layers, the information of the lower layers is also important in some cases. RGB frames and 10 stacked optical-flow frames, the optical-flow input is 2 times the optical-flow frames (x, y horizontal vertical channels).

New dual-flow network: using InceptionV1, the latter fusion part is changed to 3D convolution, 3D pooling, and finally classified by a full connection layer. The input to the network is 5 consecutive RGB frames, 10 frame samples apart, in an end-to-end training fashion.

Double-stream infated 3D convolution: extending the 2D convolution base model to a 3D base model convolution, the convolution kernel and the aggregate add a time dimension, while the 3D convolution can directly learn the time features, adding optical flow improves performance.

I3D structure extension mode if 2D filter is n×n, then 3D is n×n. The 2D filter weights are repeated N times, specifically along the time dimension, and normalized by dividing by N. The parameters are guided from a pre-trained initial video segment to initialize the I3D by repeatedly copying the images into a video sequence to convert the images into (boring) video. Then, the 3D model is implicitly pre-trained on the initial video segment, satisfying that the pooling activation on the video fixed point (the pooling video should be the same as the pooling activation on the original single image input), by repeating the weights of the 2D filters N times in the time dimension, and rescaling them by dividing by N, which ensures that the convolution filter responses are the same.

In addition to the C3D other models using pre-trained acceptance-V1 as the base network, full-join layer BN operations and ReLU activation functions are performed on other convolutional layers in addition to the last convolutional layer to encode the extracted features into a visual representation of the individual and further used in later modules to perform individual relationship inference. In all cases, video training used standard SGD with momentum set to 0.9 and a model of 110k steps trained on a Kinetics dataset, learning rate was reduced by a factor of 10 when validation loss was saturated. The learning rate superparameter was adjusted on the verification set of Kinetics. During training, the original frames are spatially adjusted to 256×256 and then randomly clipped to 224×224. Temporally, the preceding frames are chosen as much as possible to ensure that the number of enough optical flow frames is sufficiently large. Short video loops are input to ensure compliance with the network's input size. The random left-right overturn is also used in training. During testing, center clipping 224 x 224 is selected, all frames of the whole video are input, and then the core is averaged.

ROI alignment is a feature alignment method used in object detection that can correspond regions of interest (ROIs) of different sizes to feature maps of the same size, thereby facilitating subsequent feature extraction and classification. The basic principle of the ROI alignment is to divide each ROI into a plurality of small grids, and then interpolate pixels in each grid to obtain corresponding characteristic values. Specifically, the implementation steps of the ROI alignment are as follows: 1. dividing each ROI into a plurality of small grids which are equally divided into small areas; 2. for each grid, calculating its position and size on the feature map, and then mapping it onto the feature map; 3. for each grid mapped to the feature map, calculating the feature value of the grid by using a bilinear interpolation method; 4. and splicing the characteristic values of all grids to obtain the characteristic representation of the corresponding ROI. In summary, ROI alignment is an effective feature alignment method that can improve the accuracy and efficiency of target detection. In practical application, the performance of the ROI alignment can be optimized by adjusting parameters such as the grid size, the interpolation method and the like.

S2, obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, and updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle to obtain updated individual appearance characteristics;

specifically, step S2 includes: dividing the image appearance characteristics in the initial video segment into areas, and deducing interaction characteristics among a plurality of individuals in each area in the image appearance characteristics; aggregating initial individual appearance characteristics and individual appearance characteristics adjacent to the initial individual appearance characteristics in the image appearance characteristics according to the relation matrix multiplication and the attention mechanism to obtain updated appearance characteristics; based on bilinear interpolation principle, the updated appearance characteristic is interpolated into the initial individual space position information to obtain the updated individual appearance characteristic, wherein the updated individual appearance characteristic comprises the relation between the individual and the surrounding individual.

A row in the weight matrix is multiplied by a column of the word vector, respectively, which column of the word vector matrix actually represents a dimension of the different words. Matrix multiplication corresponds to a weighted summation process, the obtained result word vector is a new representation after weighted summation, and the weight matrix is obtained through similarity and normalization calculation.

The attention mechanism may provide the neural network with the ability to focus on its feature subset. Focusing attention refers to attention that is task dependent, actively focused on a feature intentionally for a predetermined purpose. The attention mechanism (Attention Mechanism) in deep learning is a method that mimics the human visual and cognitive system that allows a neural network to focus on relevant parts when processing input data. By introducing an attention mechanism, the neural network can automatically learn and selectively focus on important information in the input, improving the performance and generalization capability of the model. Specifically, the attention mechanism calculates the similarity between other features and normalizes the similarity to an attention weight, and by weighted summing each feature with a corresponding attention weight, the output from the attention mechanism can be obtained.

Bilinear interpolation mainly solves the problem of size scaling of an image, but information is lost during scaling, is essentially a weighting algorithm, and predicts the value of an interpolation point by utilizing a linear relation between two points, specifically, the coordinates of two known points are expressed in a matrix form, an interpolation point is determined, the distance between the interpolation point and the two known points is calculated, and the value of the interpolation point is calculated according to the traditional linear relation. The traditional linear relationship is as follows: if the x-axis represents the position of a pixel point on a certain axis and the y-axis represents the pixel value of the point, we can derive the pixel value of the position according to the distance between the pixel points.

Step S3, encoding a plurality of individual actions prestored in a database into Shan Re vectors, embedding the Shan Re vectors into a latent space to obtain semantic features, and obtaining new coordinates corresponding to the prestored individual feature images in the latent space through interpolation sampling to obtain target semantic features;

the interpolation sampling is a method for deducing unknown points on a continuous function or curve through a given discrete sampling point set, the interpolation determination of the optimal sampling point is realized by searching the optimal sample point set in a latent space and interpolating through the optimal sample point set, and then the interpolation of the optimal sampling point, namely, new coordinates corresponding to a pre-stored individual feature map are obtained.

S4, carrying out dot product embedding matching on the target semantic features and the updated individual appearance features, and obtaining a target feature image through network learning;

and carrying out dot product embedding and matching similarity on the target semantic features and the updated individual appearance features, encoding the spatial position information of each individual, and acquiring a target feature image with highest similarity through network learning based on a multi-head attention mechanism. Specifically, similarity between the target semantic features and the appearance features of the updated individuals is calculated through dot products, and the target semantic features and the appearance features of the updated individuals are embedded and matched according to the similarity calculated through dot products. The multi-head attention mechanism mainly allows the system to notice the correlation between different parts in the whole input, h groups (generally h=8) of different linear projections obtained by independent learning are used for transforming the query, the key and the value, the h groups of transformed query, key and value are sent to the attention convergence in parallel, the h attention converged outputs are spliced together, and the h attention converged outputs are transformed through another linear projection which can be learned to generate a final output.

And S5, transmitting the target characteristic image to the MLP, obtaining the category of each individual in the initial video segment through a softmax function, and classifying the group behaviors in the initial video segment based on the category of each individual.

The Softmax function is normalized to the vector, i.e. normalized to the similarity, and a normalized weight matrix is obtained, wherein the larger the weight of a certain value in the weight matrix is, the higher the similarity is. Further: the resulting normalized weight matrix is multiplied by the word vector matrix by the Softmax function. A row in the weight matrix is multiplied by a column of the word vector, respectively, which column of the word vector matrix actually represents a dimension of the different words. Matrix multiplication corresponds to a weighted summation process, the obtained result word vector is a new representation after weighted summation, and the weight matrix is obtained through similarity and normalization calculation.

According to the embodiment, an initial video segment is obtained, the initial video segment is pre-trained, image appearance characteristics and spatial position information are extracted, and initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics are extracted based on the ROI alignment; obtaining a plurality of individual appearance characteristics in the image appearance characteristics, deducing interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, and updating the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle to obtain updated individual appearance characteristics; encoding a plurality of individual actions prestored in a database into Shan Re vectors, embedding the Shan Re vectors into a latent space to obtain semantic features, obtaining new coordinates corresponding to the prestored individual feature images in the latent space through interpolation sampling, and obtaining target semantic features; performing dot product embedded matching on the target semantic features and updated individual appearance features, and obtaining a target feature image through network learning; and transmitting the target characteristic image to the MLP, obtaining the category of each individual in the initial video segment through a softmax function, and classifying the group behaviors in the initial video segment based on the category of each individual. The method has the advantages that the global performance of semantic relations in group activities is increased, the final calculation reasoning result is accurately improved, the group behavior recognition capability is more excellent, and the interaction information between participants and groups is made up.

With further reference to fig. 2-3, as an implementation of the method shown in fig. 1 described above, the present application provides an embodiment of a deep learning-based group behavior recognition system, which corresponds to the method embodiment shown in fig. 1.

As shown in fig. 2-3, the group behavior recognition system 800 based on deep learning according to the present embodiment includes: a pre-training module 201, an updating module 202, an encoding module 203, a dot product embedding module 204, and a classification module 205. Wherein:

the pre-training module 201 is configured to obtain an initial video segment, pre-train the initial video segment, extract image appearance features and spatial position information, and extract initial individual appearance features and initial individual spatial position information in the image appearance features based on ROI alignment;

the updating module 202 is configured to obtain a plurality of individual appearance features in the image appearance features, infer interaction features between the plurality of individual appearance features and the initial individual appearance features, update the interaction features, the initial individual appearance features and the initial individual spatial position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle, and obtain updated individual appearance features;

the encoding module 203 is configured to encode a plurality of individual actions pre-stored in the database into Shan Re vectors, embed the Shan Re vectors into the latent space to obtain semantic features, obtain new coordinates corresponding to the pre-stored individual feature images in the latent space through interpolation sampling, and obtain target semantic features;

the dot product embedding module 204 is configured to perform dot product embedding matching on the target semantic feature and the updated individual appearance feature, and obtain a target feature image through network learning;

the classification module 205 is configured to transmit the target feature image to the MLP, obtain a class of each individual in the initial video segment through a softmax function, and classify the group behaviors in the initial video segment based on the class of each individual.

In the embodiment, the pre-training module 201 obtains an initial video segment, pre-trains the initial video segment, extracts image appearance characteristics and spatial position information, and extracts initial individual appearance characteristics and initial individual spatial position information in the image appearance characteristics based on the ROI alignment; the updating module 202 obtains a plurality of individual appearance characteristics in the image appearance characteristics, deduces interaction characteristics between the plurality of individual appearance characteristics and initial individual appearance characteristics, updates the interaction characteristics, the initial individual appearance characteristics and initial individual space position information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle, and obtains updated individual appearance characteristics; the encoding module 203 encodes a plurality of individual actions pre-stored in a database into Shan Re vectors, embeds the Shan Re vectors into a latent space to obtain semantic features, obtains new coordinates corresponding to a pre-stored individual feature map in the latent space through interpolation sampling, and obtains target semantic features; dot product embedding module 204 performs dot product embedding matching on the target semantic features and the updated individual appearance features, and obtains a target feature image through network learning; the classification module 205 transmits the target feature image to the MLP, obtains a class of each individual in the initial video segment through a softmax function, and classifies the group behaviors in the initial video segment based on the class of each individual. The method has the advantages that the global performance of semantic relations in group activities is increased, the final calculation reasoning result is accurately improved, the group behavior recognition capability is more excellent, and the interaction information between participants and groups is made up.

In some optional implementations of the present embodiment, the pre-training module 201 includes:

the pre-training unit is used for obtaining an initial video segment, extracting image appearance characteristics by taking an expansion three-dimensional network pre-trained on the group behavior recognition data set as a backbone, and adopting spatial position information in two-dimensional position coding provision;

In some optional implementations of the present embodiment, the updating module 202 includes:

the inference unit is used for carrying out region division on the image appearance characteristics in the initial video segment and inferring interaction characteristics among a plurality of individuals in each region in the image appearance characteristics;

the aggregation unit is used for aggregating initial individual appearance characteristics and individual appearance characteristics adjacent to the initial individual appearance characteristics in the image appearance characteristics according to the relation matrix multiplication and the attention mechanism to obtain updated appearance characteristics;

and the updating unit is used for interpolating the updated appearance characteristics into the initial individual space position information based on the bilinear interpolation principle to obtain updated individual appearance characteristics, wherein the updated individual appearance characteristics comprise the relation between the individual and surrounding individuals.

In some alternative implementations of the present embodiment, the dot product embedding module 204 includes:

the dot product embedding unit is used for carrying out dot product embedding and matching similarity on the target semantic features and the appearance features of the updated individuals, encoding the spatial position information of each individual, and acquiring a target feature image with the highest similarity through network learning based on a multi-head attention mechanism.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring to fig. 4 specifically, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 3 comprises a memory 31, a processor 32, a network interface 33 communicatively connected to each other via a system bus. It should be noted that only the computer device 3 with components 31-33 is shown in fig. 4, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 31 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 31 may be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. In other embodiments, the memory 31 may also be an external storage device of the computer device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 3. Of course, the memory 31 may also comprise both an internal memory unit of the computer device 3 and an external memory device. In this embodiment, the memory 31 is generally used to store an operating system and various application software installed on the computer device 3, such as computer readable instructions of a group behavior recognition method based on deep learning. Further, the memory 31 may be used to temporarily store various types of data that have been output or are to be output.

The processor 32 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 32 is typically used to control the overall operation of the computer device 3. In this embodiment, the processor 32 is configured to execute computer readable instructions stored in the memory 31 or process data, for example, execute computer readable instructions of the group behavior recognition method based on deep learning.

The network interface 33 may comprise a wireless network interface or a wired network interface, which network interface 33 is typically used for establishing a communication connection between the computer device 3 and other electronic devices.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method described in the embodiments of the present application.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. A group behavior recognition method based on deep learning, the method comprising the steps of:

2. The method of claim 1, wherein the steps of obtaining an initial video segment, pre-training the initial video segment, extracting image appearance features and spatial location information, and extracting initial individual appearance features and initial individual spatial location information from the image appearance features based on ROI alignment comprise:

3. The deep learning based group behavior recognition method of claim 2, wherein the obtaining an initial video clip, extracting image appearance features as backbones from a pre-trained expanded three-dimensional network on a group behavior recognition dataset, and employing spatial location information in a two-dimensional location coding provision, further comprises:

4. The method of claim 1, wherein the step of obtaining a plurality of individual appearance features in the image appearance features, deducing interaction features between the plurality of individual appearance features and the initial individual appearance features, updating the interaction features, the initial individual appearance features and the initial individual spatial location information according to a relation matrix multiplication and attention mechanism and bilinear interpolation principle, and obtaining updated individual appearance features comprises:

5. The method of claim 4, wherein the step of region-dividing the image appearance features in the initial video segment to infer interaction features between the plurality of individuals within each region of the image appearance features comprises:

6. The deep learning-based group behavior recognition method according to claim 1, wherein the step of performing dot product embedded matching on the target semantic features and the updated individual appearance features and obtaining the target feature image through network learning comprises the steps of:

7. A deep learning-based group behavior recognition system, the system comprising:

8. The deep learning based group behavior recognition system of claim 7, wherein the pre-training module comprises:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the deep learning based population behavior identification method of any one of claims 1 to 6.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the deep learning based population behavior identification method of any one of claims 1 to 6.