CN115346275A

CN115346275A - Double-branch human body behavior prediction method, device and equipment based on optical flow and graph

Info

Publication number: CN115346275A
Application number: CN202211012592.XA
Authority: CN
Inventors: 胡懋成; 王秋阳; 周婧雯; 汪玉冰; 郑博超; 凤阳
Original assignee: Shenzhen Sunwin Intelligent Co Ltd
Current assignee: Shenzhen Sunwin Intelligent Co Ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-15

Abstract

The embodiment of the invention discloses a method, a device and equipment for predicting double-branch human body behaviors based on optical flow and a graph, wherein the method comprises the following steps: acquiring image data in a detection area; performing frame cutting processing on the image data to obtain a plurality of frames of static pictures; and inputting the multiple frames of static pictures into the human behavior prediction model in a picture sequence mode for processing to obtain a human behavior prediction result. The invention combines the instantaneous optical flow information of human body behaviors and long sequence information in the time dimension, and well learns the body local information of pedestrians in the space dimension by using the graph convolution network. By combining the spatial information and the time information, the accuracy of prediction is improved. The difference between pedestrians and the background in motion can be clearly distinguished by thinning the output optical flow in an iterative mode, and meanwhile, the predicted optical flow direction and optical flow speed information can also provide certain supervision information for the model to predict the human behavior type.

Description

Double-branch human body behavior prediction method, device and equipment based on optical flow and graph

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a device and equipment for predicting double-branch human body behaviors based on optical flow and a graph.

Background

With the rapid development of society, human behavior prediction is a research hotspot and difficulty in the current industrial and academic circles, and has important application value in actual life. There are several ways to predict human behavior at present:

the first method comprises the steps of firstly carrying out image processing on a collected visible light image and an infrared image to obtain a tracking target area, then detecting whether a target area to be tracked comprises a pedestrian or not, tracking the pedestrian when the target area comprises the pedestrian, detecting the edge of the pedestrian in the tracking process to obtain a pedestrian to-be-identified area from the target area to be tracked, and inputting the pedestrian to-be-identified area into an identification model to obtain a pedestrian behavior identification result. The method is based on an infrared mode to track the target, equipment needs extra cost, and the pedestrian target is judged by extracting the characteristics of the image based on a traditional direction gradient histogram mode, the pedestrian target is seriously influenced by illumination and has poor effect, and the method only performs behavior identification based on the appearance contour of a person to ignore space-time fusion information and has poor identification effect.

And secondly, training a 3D convolutional neural network by using a multitask deep learning method, taking the frames with fixed continuous frame numbers of background videos as the input of the network according to various human behavior attributes, and completing an identification task after training the 3D convolutional neural network. The method only identifies the human body behaviors at fixed positions and only identifies the behaviors for a single person, so that the method has great limitation. In addition, behavior recognition is performed based on 3D convolution, the model efficiency is low, and the trained model is easily affected seriously by background factors, so that the recognition effect is poor.

And thirdly, performing feature extraction and dimension reduction processing on an image frame sequence in the video segment by acquiring the video segment, then encoding the feature vector subjected to dimension reduction, and expanding the dimension reduction feature vector to obtain a preset number of expanded feature vectors. And inputting the expansion characteristic vector and the coding characteristic vector into a three-layer single-layer decoder for decoding, and inputting the last layer of decoding characteristic vector into a single-layer full-connection feedforward network for calculation to obtain a plurality of predicted values. And inputting the predicted value into a logistic regression network to obtain a corresponding prediction probability, and selecting a category corresponding to the maximum probability value as a human behavior action category of a rectangular frame corresponding to the last layer of decoding feature vectors. According to the method, the spatial information is firstly adopted, then the temporal characteristics are extracted through the spatial characteristics of different time periods, the spatial information is easily lost in the temporal characteristic extraction process, and the spatial characteristics and the temporal characteristics are difficult to fuse, so that the accuracy of action identification is reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method, a device and equipment for predicting the human body behavior with double branches based on optical flow and a graph.

In order to realize the purpose, the invention adopts the following technical scheme:

in a first aspect, a method for predicting dual-branch human behavior based on optical flow and map comprises:

acquiring image data in a detection area;

performing frame cutting processing on the image data to obtain a plurality of frames of static pictures;

and inputting the multiple frames of static pictures into the human behavior prediction model in a picture sequence mode for processing to obtain a human behavior prediction result.

The further technical scheme is as follows: the method for processing the multi-frame static pictures input into the human behavior prediction model in the form of picture sequences to obtain the human behavior prediction result comprises the following steps:

inputting a plurality of static pictures into a tracking model in a picture sequence mode for processing to obtain a character image with an id index;

inputting the figure image with the id index into an optical flow model for processing to obtain a two-dimensional optical flow characteristic diagram;

carrying out feature fusion on the character image with the id index and the two-dimensional optical flow feature map to obtain a fusion feature map;

and inputting the fusion characteristic graph into a graph model for processing to obtain a human behavior prediction result.

The further technical scheme is as follows: the method for inputting the multiple frames of static pictures into the tracking model in the form of picture sequences to be processed so as to obtain the character image with the id index comprises the following steps:

inputting a plurality of static pictures into a tracking model in a picture sequence mode to detect different human body target frames;

and giving indexes id to the detected different human body target frames to obtain the person image with the id index.

The further technical scheme is as follows: the method for inputting the human image with the id index into the optical flow model to be processed to obtain the two-dimensional optical flow feature map comprises the following steps:

performing feature extraction on the figure images with the id indexes of the front frame and the rear frame by adopting a cavity convolution check, and processing the figure images through a ReLU activation layer and a Max Pooling layer to obtain a first feature map and a second feature map;

carrying out full-pixel correlation processing on the first feature map and the second feature map to obtain a correlation feature map;

constructing optical flow information of the 0 th cycle;

performing a Warp operation on the second feature map by using the optical flow information of the 0 th cycle to obtain a Warp result of the 1 st cycle;

combining the results of the two frames of the figure images with the id indexes obtained by AsymOFMM processing of the occlusion mask and the Warp result of the 1 st cycle to obtain a first combination characteristic;

inputting a result obtained by merging the first merging characteristic, the first characteristic graph and the correlation characteristic graph into a coding network for processing to obtain optical flow information of the 1 st cycle;

performing a Warp operation on the second feature map by using the optical flow information of the 1-time cycle to obtain a Warp result of the 2 nd cycle;

combining the result obtained by carrying out AsymOFMM processing on the two frames of figure images with the id indexes in the front frame and the back frame with the occlusion mask with the Warp result of the 2 nd cycle to obtain a second combination characteristic;

the result obtained by merging the second merging characteristic with the first characteristic diagram and the correlation characteristic diagram is input into a coding network for processing to obtain optical flow information of the 2 nd circulation;

according to the set cycle number, obtaining optical flow information of a first stage after completing all cycles;

and inputting the optical flow information of the first stage into ConvGRU for processing to obtain a final two-dimensional optical flow feature map.

The further technical scheme is as follows: the constructing optical flow information of the 0 th cycle comprises the following steps:

merging the first feature map, the second feature map and the correlation feature map to obtain a merged feature map;

and inputting the merged feature map into an encoding network for processing to obtain optical flow information of the 0 th cycle.

The further technical scheme is as follows: the method for performing feature fusion on the character image with the id index and the two-dimensional optical flow feature map to obtain a fusion feature map comprises the following steps:

extracting edge features from the character image with the id index by using a Sobel operator to obtain an initial edge feature graph;

performing edge sharpening on the initial edge feature map by using Laplacian filtering to refine the edge feature map;

after the refined edge feature map is subjected to 3x3 convolution processing, performing downsampling processing by using Average Pooling to obtain a downsampled feature map;

adjusting the down-sampling feature map into an adjustment feature map with the same size as the original map by using a bilinear difference value;

and performing feature fusion on the two-dimensional optical flow feature map and the adjustment feature map to obtain a fusion feature map.

The further technical scheme is as follows: the method for inputting the fusion characteristic diagram into the diagram model to be processed to obtain the human behavior prediction result comprises the following steps:

splicing the fused feature graph corresponding to each frame with the original graph to obtain a spliced graph;

dividing the mosaic picture into 3 pictures of different Patch sizes;

dividing the mosaic into equal-Size Patches in corresponding quantity according to the Size of the Patches Size to obtain 3 Patches blocks;

carrying out feature transformation on the 3 Patches blocks to obtain 3 groups of Graph nodes;

respectively carrying out similarity calculation on the 3 Patches blocks by using a perceptual hash algorithm based on RGB information in the mosaic to form a dynamic adjacency matrix of the 3 Patches blocks;

inputting the dynamic adjacency matrixes of the 3 Patches blocks into a 12-layer graph convolution network and time convolution network superposition block for node learning and feature updating, and summarizing features of the 12 th-layer graph convolution network and time convolution network superposition block through a GlobalPooling layer to obtain 512-dimensional feature vectors of each sequence;

inputting the 512-dimensional feature vectors of each sequence into full-connection layer processing to generate a preliminary prediction probability of the human behavior category of each patch block;

adopting a consensus function to process the generated preliminary prediction probability of the human behavior category of each patch block to obtain consensus prediction;

and inputting the consensus prediction into a SoftMax function for processing so as to obtain a final human body behavior prediction result.

In a second aspect, an optical flow and graph-based dual-branch human behavior prediction device includes:

an acquisition unit configured to acquire image data in a detection area;

the frame cutting processing unit is used for carrying out frame cutting processing on the image data to obtain multiple frames of static pictures;

and the prediction unit is used for inputting the multi-frame static pictures into the human behavior prediction model in the form of picture sequences for processing so as to obtain a human behavior prediction result.

In a third aspect, a computer device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the optical flow and graph-based dual-branch human behavior prediction method steps as described above.

In a fourth aspect, a computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the optical flow and map based dual-branch human behavior prediction method steps as described above.

Compared with the prior art, the invention has the beneficial effects that: the invention combines the instantaneous optical flow information of human body behaviors and long sequence information in the time dimension, and well learns the body local information of pedestrians in the space dimension by using the graph convolution network. By combining the spatial information and the time information, the accuracy of prediction is improved. The difference between pedestrians and the background in motion can be clearly distinguished by refining the output optical flow in an iterative mode, and meanwhile, the predicted optical flow direction and optical flow velocity information can also provide certain supervision information for the model to predict the human behavior type.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented according to the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more apparent, the following detailed description will be given of preferred embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a dual-branch human behavior prediction method based on an optical flow and a graph according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for predicting human body behavior based on optical flow and graph according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of an apparatus for dual branch human behavior prediction based on optical flow and graph according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a dual-branch human behavior prediction method based on optical flow and graph according to an embodiment of the present invention; fig. 2 is a flowchart of a dual-branch human behavior prediction method based on optical flow and graph according to an embodiment of the present invention, the dual-branch human behavior prediction method based on optical flow and graph is applied in a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method for predicting the behavior of a two-branch human body based on an optical flow and a graph comprises the following steps: S10-S30.

And S10, acquiring image data in the detection area.

A monitoring device, such as a monitoring camera, is used to monitor capture of video data (image data) within the area to be detected. The monitoring equipment is common in the market, and the application does not limit the monitoring equipment.

And S20, performing frame cutting processing on the image data to obtain multiple frames of static pictures.

And cutting the detected image data into RGB static pictures of one frame and one frame, and using the RGB static pictures as input data of a human behavior prediction model.

And S30, inputting the multiple frames of static pictures into the human behavior prediction model in the form of picture sequences for processing to obtain a human behavior prediction result.

In an embodiment, step S30 specifically includes the following steps: S301-S304.

S301, inputting a plurality of frames of static pictures into a tracking model in a picture sequence mode for processing to obtain a person image with an id index.

In the embodiment, a BoT-SORT model is used as the tracking model, a yolov7 model is used as a pedestrian detector in the BoT-SORT model, and not only human body features are used as matching references in order to reduce the situation of id switching caused by human occlusion, but also front and rear frame optical flow features are added. And the human body features and the optical flow features obtained based on the pedestrian detector are used as fusion to be used as auxiliary information for judging whether the indexes of the pedestrians in the front frame and the rear frame are matched or not. The optical flow feature extraction mainly adopts a dense optical flow extraction method Farneback, so that the reference information of the model for predicting the human index id is richer, the model is particularly suitable for a fast moving human body, and the accuracy of the model for tracking the target is improved.

In an embodiment, step S301 specifically includes the following steps: S3011-S3012.

S3011, inputting multiple frames of static pictures into a tracking model in a picture sequence mode to detect different human body target frames.

In this embodiment, the obtained multiple still pictures (frame video streams) are input into the tracking model in the form of picture sequence, so that different human body target frames can be detected, and the target frames pass through picture position coordinates (x) _min ,y _min )(x _max ,y _max ) That is, the coordinate of the target frame relative to the upper left corner of the picture and the coordinate of the target frame relative to the lower right corner of the picture represent the positions of the coordinates.

S3012, giving index id to the detected different human body target frames to obtain the person image with the id index.

In this embodiment, the detected human body target frame is given an index id, the same index id represents the same person, and different persons are subjected to video tracking based on different person id indexes in real time. The input images of the optical flow model and the graph model are id-added single-person images obtained by cropping original images (i.e., images of RGB still pictures in which image data detected by a monitoring device is framed one frame by one frame) based on target frames detected by the tracking model.

S302, inputting the character image with the id index into the optical flow model for processing to obtain a two-dimensional optical flow feature map.

In an embodiment, the step S302 specifically includes the following steps: s3021 to S30292.

S3021, feature extraction is carried out on the front frame image and the rear frame image with the id indexes by adopting a hole convolution kernel, and the front frame image and the rear frame image are processed through a ReLU activation layer and a Max working layer to obtain a first feature map and a second feature map.

In this embodiment, 4 human images I with id indexes in front and back frames output from the hole convolution kernel tracking model with the expansion rate of 2 and the size of 3 × 3 are adopted ₁ I ₂ Performing feature extraction, and then obtaining a feature map F through a ReLU activation layer and a Maxpainting layer ₁ F ₂ 。

And S3022, carrying out correlation processing on the first characteristic diagram and the second characteristic diagram in a full-pixel mode to obtain a correlation characteristic diagram.

In this embodiment, the optical Flow model will recycle the optical Flow per output according to the set number of cycles N _i (i ∈ N), for F ₂ Since the Warp operation is performed, in the case where there is no available optical flow information in the 0 th cycle, F is directly added ₁ F ₂ Performing one-time full-pixel correlation operation to obtain a feature map F _match 。

And S3023, constructing optical flow information of the 0 th loop.

In an embodiment, step S3023 specifically includes the following steps: s30231 to S30232.

And S30231, merging the first feature map, the second feature map and the correlation feature map to obtain a merged feature map.

And S30232, inputting the merged feature map into an encoding network for processing to obtain optical flow information of the 0 th cycle.

In the present embodiment, as for S30231 and S30232, in the 0 th cycle, there is no available optical Flow ₀ So as to directly mix F ₁ F ₂ And F _match Merging to obtain a new characteristic diagram F _{0_concat} (representing the concat result of the 0 th loop) is input into an encoding network consisting of an encoder and a decoder for optical flow estimation. The coding network consists of 5 constraint + BatchNorm + ReLU layers, 2 residual layers and 3 MaxPoint layers, the decoding layer consists of 4 deconvolution layers and 3 times of upsampling, and low-layer characteristic information from the coding layer is fused. The method of upsampling is used as Nearest neighbor interpolation (Nearest interpolation). Warp beamOver-decoding layer, 0 th cycle optical Flow output Flow ₀ Is obtained.

And S3024, performing a Warp operation on the second feature map by using the optical flow information of the 0 th cycle to obtain a Warp result of the 1 st cycle.

S3025, combining the result obtained by AsymOFMM processing of the occlusion mask on the front and back frames of the figure image with the id index with the Warp result of the 1 st cycle to obtain a first combination characteristic;

s3026, merging the first merged feature, the first feature map and the correlation feature map, and inputting the result into an encoding network for processing to obtain optical flow information of the 1 st cycle;

and S3027, performing a Warp operation on the second feature map by using the optical flow information of the 1-time cycle to obtain a Warp result of the 2 nd cycle.

S3028, combining the result obtained by AsymOFMM processing of the occlusion mask on the front and back frames of the figure image with the id index with the Warp result of the 2 nd cycle to obtain a second combination characteristic;

and S3029, combining the second combination characteristic with the first characteristic diagram and the correlation characteristic diagram, and inputting the result into an encoding network for processing to obtain optical flow information of the 2 nd cycle.

S30291, obtaining optical flow information in the first stage after completing all cycles based on the set number of cycles.

S30292, inputting the optical flow information of the first stage into the ConvGRU, and processing the optical flow information to obtain a final two-dimensional optical flow feature map.

For S3024-S30292, in the present embodiment, flow is not directly transferred ₀ As a result output, but use Flow ₀ For the previous F ₂ Subjected to the Warp operation to obtain I _{1_warp} The result was recorded as the Warp result of the 1 st cycle. Then use I ₁ I ₂ Generating an occluationmask and I by an asymmetric feature matching Module (AsymOFMM) of an occlusion mask _{1_warp} Combining to obtain a MaskedImageI _{1_mask} . According to the previous description, at cycle 0, F _{0_concat} Is composed of two front and back frames with id-indexed person image F ₁ F ₂ And F _match Obtained, and in the subsequent cycles I will be used _{i_mask} And (i epsilon N) is merged with the first characteristic diagram and the correlation characteristic diagram. Thus, the iterative utilization of the optical flow output in the previous time is realized in each cycle. According to the setting of the circulation times, the optical Flow of the first stage after all circulation is completed _out1 Will be output. Then, based on the idea of iterative update, a ConvGRU recurrent neural network is added into the optical flow model. Outputting the optical Flow of the first stage to Flow _out1 And I as context information ₁ Inputting the optical Flow estimation result Flow into ConvGRU and outputting the final optical Flow estimation result Flow _output 。Flow _output Is a two-dimensional optical flow feature map containing both velocity and direction features.

And S303, carrying out feature fusion on the person image with the id index and the two-dimensional optical flow feature map to obtain a fusion feature map.

In an embodiment, step S303 specifically includes the following steps: S3031-S3035.

S3031, extracting edge features from the person image with the id index by adopting a Sobel operator to obtain an initial edge feature graph.

And S3032, performing edge sharpening on the initial edge feature map by using Laplacian filtering, and refining the edge feature map.

S3033, performing a 3 × 3 convolution process on the refined edge feature map, and performing a down-sampling process using Average firing to obtain a down-sampling feature map.

S3034, the downsampled feature map is adjusted to an adjusted feature map having the same size as the original map by using the bilinear difference value.

S3035, performing feature fusion on the two-dimensional optical flow feature map and the adjustment feature map to obtain a fusion feature map.

For S3031-S3035, in the embodiment, sobel operator is applied to image I to extract edge features, so as to obtain feature map I _s . Followed by Laplacian filtering of the pair I _s Edge sharpening is carried out to obtain a refined edge feature map I with more distinct edge information _l . Will I _l By passingAfter a 3 × 3 convolution, downsampling was performed using Average Pooling to obtain F _a . Then using the bilinear difference to convert F _a Feature map F adjusted to have the same size as the original image _b . Combining the two-dimensional optical flow characteristic diagram output in the optical flow model with F _b And carrying out feature fusion to obtain a new feature map. Preprocessing each frame in a group of time frame sequences and fusing optical flow characteristics to obtain a characteristic sequence F ₁ ,F ₂ …F _N I.e. the input of the fused feature map, i.e. the graph model.

And S304, inputting the fusion characteristic graph into a graph model for processing to obtain a human behavior prediction result.

In this embodiment, the graph model uses a Feature-GCN model. The Feature-GCN model converts a Feature map sequence into a representation of a Graph (Graph) and inputs the representation into a Graph convolution network and a time sequence convolution network to extract different dimensional features, so that the human behavior category is predicted.

In an embodiment, step S304 specifically includes the following steps: S3041-S3049.

S3041, splicing the fused feature map corresponding to each frame with the original map to obtain a spliced map.

S3042, dividing the mosaic graph into 3 graphs of different PatchSize.

S3043, dividing the mosaic into corresponding quantity of equal-Size Patches according to the Size of the Patch Size to obtain 3 Patches blocks.

S3044, performing feature transformation on the 3 Patches blocks to obtain 3 groups of Graph nodes. S3045, respectively carrying out similarity calculation on the 3 Patches blocks based on the RGB information in the splicing map by using a perceptual hash algorithm so as to form a dynamic adjacency matrix of the 3 Patches blocks.

S3046, inputting the dynamic adjacent matrixes of the 3 Patches blocks into the 12-layer graph convolution network and time convolution network superposition block for node learning and feature updating, and summarizing the features of the 12 th-layer graph convolution network and time convolution network superposition block through a Global Pooling layer to obtain 512-dimensional feature vectors of each sequence.

S3047, inputting the 512-dimensional feature vector of each sequence to the full connected layer process to generate a preliminary prediction probability of the human behavior category of each patch block.

S3048, processing the generated preliminary prediction probability of the human behavior class of each patch block by using a consensus function, so as to obtain a consensus prediction.

S3049, inputting the consensus prediction into a SoftMax function for processing so as to obtain a final human body behavior prediction result.

For S3041-S3049, in this embodiment, the feature map F (where F represents any feature map) corresponding to each frame is spliced with the original map to obtain F _concat And inputting the data into the graph model. Considering the difference of the sizes of the human objects in the images, F _concat Divide into 3 different Patch Size (P =9,16, 25) graphs (Graph) and then Size-divide F according to P _concat Dividing the image into a corresponding number of equal-size Patches to obtain 3 Patches blocks (here, 3 Patches blocks in one frame image, and 3N Patches blocks in N frames), which are called P _s ,P _m ,P _l . The letters correspond to small, middle and large, which respectively represent the size of the patch block, and if P =9, the letter represents that only 9 Patches are in the block, and the number is small. After feature transformation is performed on each Patch, 3 sets of features X = [ X ] are obtained _s1 ,…x _s9 ],X＝[x _m1 ,…x _m16 ],X＝[x _l1 ,…x _l25 ]Subsequently, the signatures-bearing Patches are represented as nodes of a Graph (Graph), resulting in 3 sets of V = [ V ]) _s1 ,…v _s9 ],V＝[v _m1 ,…v _m16 ],V＝[v _l1 ,…v _l25 ]. Since a prediction of the time frame sequence is required, dynamic time information needs to be taken into account. Therefore, there are two types of edges in each node of the Graph (Graph), one is connection with other dispatches, and the other is connection with dispatches at the same position as the previous and next frames, which represents the neighbor nodes in the time dimension. Subsequently, F-based using Perceptual hashing algorithm (Perceptual hash algorithm) _concat The RGB information in (1) carries out similarity calculation on the Patches. A fingerprint string is generated for each Patch, and then fingerprints between different Patches are compared. By doing so, for each Patch, a finding is madeAnd taking the N Patches with the closest similarity as neighbor nodes of the N Patches to construct an adjacency matrix. Based on this, 3 initial dynamic adjacency matrices A are constructed _s ,A _m ,A _l With sizes of 9x9, 16x16, and 25x25, respectively, the adjacency matrix is dynamically updated in each subsequent layer feature update. Therefore, in the first few layers, the graph model tends to determine the neighbor nodes according to the low-level features such as colors, and in the last few layers, the features extracted by the graph model have stronger semantic information. Based on the feature X, the node V, the adjacency matrix A and 3 GraphG _s ,G _m ,G _l Is constructed. First, to G _s ,G _m ,G _l After batch mutagenesis, G is added _s ,G _m ,G _l And an adjacency matrix A _s ,A _m ,A _l The data are respectively input into a 12-layer graph convolution network and a time convolution network superposition block for node learning and feature updating, namely each layer of total network has two sub-networks, the total network has 12 layers, and the number of output channels is changed once every three layers, namely 64, 128, 256 and 512. Taking the first layer as an example, when a layer of graph convolution network is passed, the feature graph G can be obtained _{s_s} ,G _{s_m} ,G _{s_l} Representing features in the spatial (spatial) dimension. In each layer of GCN network, convolution operation is carried out by using stride of 1, padding of 0, 1x1Conv. The purpose is to change the channel. Subsequently, based on the GCN principle, einstein summation convention (einsum) is used, and the corresponding feature X and the adjacency matrix a are subjected to matrix multiplication to aggregate the neighbor information of the node, so as to perform node feature update. Then, the feature map G is processed _{s_s} ,G _{s_m} ,G _{s_l} And inputting the time dimension characteristics into a time convolution network.

In addition, 3 convolution kernels of 6 × 1 are respectively used for each feature map to extract time information, where stride is 1. Then passing through a Max Pooling layer to obtain a characteristic diagram G _{T_s} ,G _{T_m} ,G _{T_l} Representing features in the time (temporal) dimension. After 12 layers of total networks are passed, the output of the last layer of time convolution network is subjected to Global Poolling layer summary characteristic to obtain the summary characteristic of each sequenceA 512-dimensional feature vector. And finally inputting the data into a full connection layer. At this time, each original patch block will generate its own preliminary prediction probability of behavior class. And then using a consensus function (consensus function) to generate a consensus prediction, and inputting the consensus prediction into the SoftMax function to obtain a final human body behavior prediction result. Mainly comprises 5 human behavior categories: running, walking, jumping, tumbling, standing. The graph model will finally predict the pedestrian's actions, resulting in one of the above 5 categories.

In addition, the loss functions used by the human behavior prediction model include a detection loss function for target tracking, which is mainly a loss function for target detection, an optical flow loss function, and a behavior recognition loss function based on a graph model, wherein:

for the loss function of the object detection, the loss function of yolov7 is used, which is mainly composed of classification loss, localization loss and confidence loss functions, that is,

Loss＝Loss _{classification} +Loss _Positioning +Loss _{Confidence level} 。

For the optical Flow loss function, all the loop units will output to create an optical Flow prediction sequence Flow ₀ 、Flow ₁ 、Flow ₂ ....Flow _N The total loss is the sum of the losses between the predicted value and the true value of each loop block output. The optical flow loss function herein uses the average endpoint error (EPE) as a training loss function that represents the average euclidean distance per pixel between the predicted flow vector and the ground truth.

For graph model-based behavior recognition loss functions, model training is performed using the cross-entropy loss of classification.

The invention combines the instantaneous optical flow information of human body behaviors and the long sequence information in the time dimension, and well learns the body local information of pedestrians by using the graph volume network in the space dimension. By combining the spatial information and the time information, the accuracy of prediction is improved. The difference between pedestrians and the background in motion can be clearly distinguished by refining the output optical flow in an iterative mode, and meanwhile, the predicted optical flow direction and optical flow velocity information can also provide certain supervision information for the model to predict the human behavior type.

Fig. 3 is a schematic block diagram of a dual-branch human behavior prediction apparatus 100 based on optical flow and graph according to an embodiment of the present invention. In correspondence to the above-mentioned method for predicting human body behavior based on optical flow and graph, the present invention further provides an apparatus 100 for predicting human body behavior based on optical flow and graph. The optical-flow and map-based dual-branch human behavior prediction apparatus 100 includes units and modules for performing the above-described optical-flow and map-based dual-branch human behavior prediction method, and may be configured in a server.

As shown in fig. 3, the apparatus 100 for predicting a two-branch human behavior based on an optical flow and a graph includes:

an acquisition unit 110 is configured to acquire image data in the detection area.

The frame-cutting processing unit 120 is configured to perform frame-cutting processing on the image data to obtain multiple frames of still pictures.

The prediction unit 130 is configured to input multiple frames of still pictures into the human behavior prediction model in the form of picture sequences to be processed, so as to obtain a human behavior prediction result.

In one embodiment, prediction unit 130 includes.

The first processing module is used for inputting a plurality of frames of static pictures into the tracking model in a picture sequence mode for processing so as to obtain a person image with an id index.

And the second processing module is used for inputting the human image with the id index into the optical flow model for processing so as to obtain a two-dimensional optical flow characteristic diagram.

And the feature fusion module is used for performing feature fusion on the character image with the id index and the two-dimensional optical flow feature map to obtain a fusion feature map.

And the third processing module is used for inputting the fusion characteristic diagram into the graph model for processing so as to obtain a human body behavior prediction result.

In one embodiment, the first processing module comprises:

and the detection sub-module is used for inputting the multiple frames of static pictures into the tracking model in a picture sequence mode so as to detect different human body target frames.

And the giving sub-module is used for giving index ids to the detected different human body target frames so as to obtain the person image with the id index.

In one embodiment, the second processing module comprises:

and the feature extraction sub-module is used for extracting features of the front and rear frames of character images with the id indexes by adopting a hole convolution kernel, and processing the character images through a ReLU activation layer and a Max Pooling layer to obtain a first feature map and a second feature map.

And the correlation processing submodule is used for carrying out full-pixel correlation processing on the first feature map and the second feature map to obtain a correlation feature map.

And the construction submodule is used for constructing the optical flow information of the 0 th cycle.

And the first Warp processing sub-module is used for performing Warp operation on the second feature map by using the optical flow information of the 0 th cycle of the European fish so as to obtain a Warp result of the 1 st cycle.

And the first combining submodule is used for combining the result obtained by carrying out AsymOFMM processing on the two frames of the figure images with the id indexes through the occlusion mask and the Warp result of the 1 st cycle to obtain a first combination characteristic.

And the first merging processing submodule is used for inputting a result obtained by merging the first merging characteristic, the first characteristic diagram and the correlation characteristic diagram into the coding network for processing so as to obtain optical flow information of the 1 st cycle.

And the second Warp processing submodule is used for performing Warp operation on the second feature map by using the optical flow information of the 1-time cycle to obtain a Warp result of the 2 nd cycle.

And the second combining submodule is used for combining the result obtained by carrying out AsymOFMM processing on the front and rear frames of the figure image with the id index through the occlusion mask and the Warp result of the 2 nd cycle to obtain a second combination characteristic.

And the second merging processing submodule is used for inputting a result obtained by merging the second merging feature with the first feature map and the correlation feature map into the coding network for processing so as to obtain optical flow information of the 2 nd cycle.

And the circulation processing submodule is used for obtaining the optical flow information of the first stage after completing all circulation according to the set circulation times.

And the centralized processing submodule is used for inputting the optical flow information of the first stage into the ConvGRU for processing so as to obtain a final two-dimensional optical flow feature map.

In one embodiment, the building submodule comprises:

and the merging submodule is used for merging the first feature map, the second feature map and the correlation feature map to obtain a merged feature map.

And the coding submodule is used for inputting the merged feature map into a coding network for processing so as to obtain optical flow information of the 0 th cycle.

In one embodiment, the feature fusion module comprises:

and the edge feature extraction submodule is used for extracting edge features from the character image with the id index by adopting a Sobel operator so as to obtain an initial edge feature graph.

And the edge sharpening submodule is used for sharpening the edge of the initial edge feature map by using Laplacian filtering and refining the edge feature map.

And the downsampling processing submodule is used for performing 3x3 convolution processing on the refined edge feature map and performing downsampling processing by using Average Pooling to obtain a downsampled feature map.

And the size adjusting sub-module is used for adjusting the down-sampling feature map into an adjusting feature map with the same size as the original map by using the bilinear difference value.

And the feature fusion submodule is used for performing feature fusion on the two-dimensional optical flow feature map and the adjustment feature map to obtain a fusion feature map.

In one embodiment, the third processing module comprises:

and the splicing submodule is used for splicing the fusion characteristic graph corresponding to each frame with the original graph to obtain a spliced graph.

And the first dividing module is used for dividing the mosaic into 3 different patterns of Patch Size.

And the second division submodule is used for dividing the mosaic into corresponding quantity of equal-Size Patches according to the Size of the Patch Size so as to obtain 3 Patches blocks.

And the feature transformation submodule is used for carrying out feature transformation on the 3 Patches blocks to obtain 3 groups of Graph nodes.

And the feature transformation submodule is used for respectively carrying out similarity calculation on the 3 Patches blocks by using a perceptual hash algorithm based on the RGB information in the splicing map so as to form a dynamic adjacency matrix of the 3 Patches blocks.

And the summarizing submodule is used for inputting the dynamic adjacent matrixes of the 3 Patches blocks into the 12-layer graph convolution network and time convolution network superposition block for node learning and feature updating, and summarizing features of the 12-layer graph convolution network and time convolution network superposition block through a Global Poolling layer to obtain 512-dimensional feature vectors of each sequence.

And the full-connection processing submodule is used for inputting the 512-dimensional feature vectors of each sequence into full-connection layer processing so as to generate a preliminary prediction probability of the human behavior category of each patch block.

And the consensus prediction submodule is used for processing the generated preliminary prediction probability of the human behavior class of each patch block by adopting a consensus function to obtain consensus prediction.

And the final processing submodule is used for inputting the consensus prediction into a SoftMax function for processing so as to obtain a final human behavior prediction result.

The above-described dual-branch human behavior prediction apparatus based on optical flow and map may be implemented in the form of a computer program, which may be run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, wherein the server may be an independent server or a server cluster composed of a plurality of servers.

As shown in fig. 4, the computer device includes a memory, a processor and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the dual-branch human behavior prediction method based on optical flow and graph as described above are implemented.

The computer device 700 may be a terminal or a server. The computer device 700 includes a processor 720, memory, and a network interface 750, which are connected by a system bus 710, where the memory may include non-volatile storage media 730 and internal memory 740.

The non-volatile storage medium 730 may store an operating system 731 and computer programs 732. The computer program 732, when executed, causes the processor 720 to perform any of the dual-branch human behavior prediction methods based on optical flow and maps.

The processor 720 is used to provide computing and control capabilities, supporting the operation of the overall computer device 700.

The internal memory 740 provides an environment for the execution of the computer program 732 in the non-volatile storage medium 730, and when the computer program 732 is executed by the processor 720, the processor 720 can execute any method for predicting the dual-branch human behavior based on the optical flow and the graph.

The network interface 750 is used for network communication such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present application, and is not intended to limit the computing device 700 to which aspects of the present application may be applied, and that a particular computing device 700 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components. Wherein the processor 720 is configured to execute the program code stored in the memory to perform the following steps:

the double-branch human body behavior prediction method based on the optical flow and the graph comprises the following steps:

acquiring image data in a detection area;

performing frame cutting processing on the image data to obtain multiple frames of static pictures;

In one embodiment: the method for processing the multi-frame static pictures input into the human behavior prediction model in the form of picture sequences to obtain the human behavior prediction result comprises the following steps:

inputting the character image with the id index into an optical flow model for processing to obtain a two-dimensional optical flow characteristic diagram;

In one embodiment: the method for inputting the multiple frames of static pictures into the tracking model in the form of picture sequences to be processed so as to obtain the person image with the id index comprises the following steps:

In one embodiment: the method for inputting the character image with the id index into the optical flow model to be processed so as to obtain the two-dimensional optical flow feature map comprises the following steps:

carrying out full-pixel correlation solving processing on the first characteristic diagram and the second characteristic diagram to obtain a correlation characteristic diagram;

constructing optical flow information of the 0 th cycle;

according to the set cycle number, obtaining the optical flow information of the first stage after completing all cycles;

In one embodiment: the constructing optical flow information of the 0 th cycle comprises the following steps:

and inputting the merged feature map into an encoding network for processing so as to obtain optical flow information of the 0 th cycle.

In one embodiment: the feature fusion of the character image with the id index and the two-dimensional optical flow feature map to obtain a fusion feature map comprises the following steps:

extracting edge features from the character image with the id index by adopting a Sobel operator to obtain an initial edge feature graph;

adjusting the down-sampling feature map into an adjusted feature map with the same size as the original image by using the bilinear difference value;

and performing feature fusion on the two-dimensional light stream feature map and the adjustment feature map to obtain a fusion feature map.

In one embodiment: the method for inputting the fusion characteristic diagram into the diagram model to be processed to obtain the human behavior prediction result comprises the following steps:

dividing the mosaic graph into 3 graphs of different Patch sizes;

dividing the mosaic into equal-Size Patches in corresponding quantity according to the Size of the patchsize to obtain 3 Patches blocks;

respectively carrying out similarity calculation on the 3 Patches blocks by using a perceptual hash algorithm based on RGB information in the spliced graph so as to form a dynamic adjacency matrix of the 3 Patches blocks;

inputting the dynamic adjacency matrixes of the 3 Patches blocks into a 12-layer graph convolution network and time convolution network superposition block for node learning and feature updating, and summarizing features of the 12 th-layer graph convolution network and time convolution network superposition block through a Global Powing layer to obtain 512-dimensional feature vectors of each sequence;

inputting the 512-dimensional feature vector of each sequence into full-connected layer processing to generate a preliminary prediction probability of the human behavior category of each patch block;

and inputting the consensus prediction into a SoftMax function for processing to obtain a final human behavior prediction result.

It should be understood that, in the embodiment of the present Application, the Processor 720 may be a Central Processing Unit (CPU), and the Processor 720 may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that the configuration of computer device 700 depicted in FIG. 4 is not intended to be limiting of computer device 700 and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

In another embodiment of the present invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the optical flow and graph-based dual-branch human behavior prediction method disclosed by the embodiments of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The double-branch human body behavior prediction method based on the optical flow and the graph is characterized by comprising the following steps of:

acquiring image data in a detection area;

and inputting the multi-frame static pictures into a human behavior prediction model in a picture sequence mode for processing to obtain a human behavior prediction result.

2. The optical flow and graph-based dual-branch human behavior prediction method of claim 1, wherein the step of inputting multiple frames of still pictures into the human behavior prediction model in a form of picture sequence for processing to obtain the human behavior prediction result comprises:

inputting multiple frames of static pictures into a tracking model in a picture sequence mode for processing to obtain a figure image with an id index;

3. The optical flow and graph-based dual-branch human behavior prediction method of claim 2, wherein the step of inputting multiple frames of still pictures into the tracking model in a picture sequence for processing to obtain a human image with an id index comprises:

and giving index id to the detected different human body target frames to obtain the person image with the id index.

4. The method for predicting human body behavior based on optical flow and graph according to claim 2, wherein the step of inputting the human figure image with id index into an optical flow model for processing to obtain a two-dimensional optical flow feature graph comprises:

constructing optical flow information of the 0 th cycle;

combining the result obtained by carrying out AsymOFMM processing on the two frames of figure images with the id indexes in the front frame and the back frame with the occlusion mask with the Warp result of the 1 st cycle to obtain a first combination characteristic;

5. The method for predicting the human body behavior based on the optical flow and the graph according to claim 4, wherein the constructing the optical flow information of the 0 th cycle comprises:

6. The method for predicting the human body behavior based on the optical flow and graph according to claim 2, wherein the feature fusion of the human image with the id index and the two-dimensional optical flow feature graph to obtain a fusion feature graph comprises:

using Laplacian filtering to sharpen the edge of the initial edge feature map and refine the edge feature map;

7. The optical flow and graph-based dual-branch human behavior prediction method according to claim 2, wherein the inputting the fused feature graph into a graph model for processing to obtain a human behavior prediction result comprises:

splicing the fused feature map corresponding to each frame with the original map to obtain a spliced map;

dividing the mosaic graph into 3 graphs of different Patch sizes;

8. A double-branch human body behavior prediction device based on optical flow and graph is characterized by comprising:

an acquisition unit configured to acquire image data in a detection area;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the optical flow and map based dual branch human behavior prediction method steps as claimed in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the optical flow and map based dual-branch human behavior prediction method steps of any one of claims 1 to 7.