CN114821766A - Behavior identification method based on space-time convolution and time sequence feature fusion - Google Patents

Behavior identification method based on space-time convolution and time sequence feature fusion Download PDF

Info

Publication number
CN114821766A
CN114821766A CN202210229686.6A CN202210229686A CN114821766A CN 114821766 A CN114821766 A CN 114821766A CN 202210229686 A CN202210229686 A CN 202210229686A CN 114821766 A CN114821766 A CN 114821766A
Authority
CN
China
Prior art keywords
features
motion
time
convolution
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210229686.6A
Other languages
Chinese (zh)
Inventor
李宏亮
黄俊强
董建伟
盛一航
任子奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210229686.6A priority Critical patent/CN114821766A/en
Publication of CN114821766A publication Critical patent/CN114821766A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention provides a behavior recognition method based on space-time convolution and time sequence feature fusion, aiming at the problem of insufficient feature extraction under the condition of single data form downlink for recognition, and the behavior recognition method comprises the steps of firstly obtaining video stream data and motion data from an inertial sensor, and then extracting high-level space-time semantic features based on space-time convolution from the video stream data; meanwhile, deep motion characteristics based on time series are extracted from the motion data stream; and fusing the high-level spatiotemporal semantic features and the deep motion features to obtain fusion features, inputting the fusion features into a multi-level sensor MLP (maximum likelihood probability) for mapping to obtain output values, and finishing behavior identification and classification by Softmax on the output values. The behavior recognition is completed based on a mode of fusing the two features, the defect that a mainstream algorithm uses single feature to lack information is overcome, the action features of the key moment are captured through the self-attention module, and the recognition accuracy of the network on abnormal and sudden behaviors is improved.

Description

Behavior identification method based on space-time convolution and time sequence feature fusion
Technical Field
The invention relates to a multi-mode-based feature fusion behavior recognition technology, and belongs to the field of deep learning.
Background
With the development of electronic and computer technologies, the intelligence and the practicability of smart wearable devices are gradually increasing. At present, portable intelligent glasses such as apple intelligent glasses and Google intelligent glasses are released by a plurality of high-tech companies. The smart device enables people to record daily life in the form of a first perspective and to record daily activity data via the inertial sensor. The data has potential value difficult to estimate, can be used for improving the life quality of people, and can relieve the social pressure of the elderly losing care and accompany under the global aging background. Behavior recognition is used as a hot spot in the field of artificial intelligence, can record behaviors and recognize the behaviors, can warn abnormal behaviors, and plays a role in nursing and emergency warning.
At present, algorithms based on deep learning and behavior recognition of neural networks have been widely used, wherein the 3D convolutional neural network based on space-time convolution is an important branch of the behavior recognition method for video extraction features. Furthermore, the use of the recurrent convolutional neural network RNN to extract motion features from inertial sensor data is another branch of behavior recognition.
The video data is complex multidimensional data with three dimensions (time and space), high-level abstract information of current daily actions is often contained in a time structure, and the video data can record motion backgrounds and overall actions of human bodies. By using the space-time convolutional network 3D convolutional neural network, deep semantic information of the video stream can be extracted without losing time-related information of the motion. The 3D convolutional neural network is usually a network expansion and structure type of 2D convolutional network such as Resnet, inclusion, etc., but has a time dimension, so as to improve the capturing capability of the motion time characteristics. The data of the inertial sensor comprises angles and accelerations in three axis directions of a sporter, which are acquired by the gyroscope and the accelerometer, the data of the angles and the accelerations in the three axis directions of the sporter are time sequence continuous information of T x 6, and the change condition of the physical quantity of the limb movement is recorded by emphasizing. The action time sequence change characteristics can be extracted by using the cyclic convolution neural network RNN, and different limb action identifications are completed.
In the current mainstream algorithm, behavior recognition is completed by only adopting a 3D convolutional network to extract video stream features, and useless background information contained in a video is a barrier to extracting action features. Based on the current display and card computing power, a 3D convolutional network can only sample 16 or 32 pictures from thousands of frames of images in a video, and cannot contain all moments, and time-related information of an action cannot be completely extracted from a long video, so that difficulty in behavior identification is increased. In a few researches, the RNN is adopted to extract action time sequence change characteristics from the inertial sensor data to complete action recognition, and the inertial sensor data only contain the physical quantity change of limb movement without any background information, so that the change and rhythmical similar actions are difficult to distinguish well.
Disclosure of Invention
The invention aims to solve the technical problem of insufficient feature extraction under the condition of single data form downlink identification, and provides a method for extracting motion features and useful background features of video data and inertial sensor data in a combined manner and improving the accuracy of behavior identification by using a neural network of a mixed structure.
The technical scheme adopted by the invention for solving the problems is as follows: a behavior identification method based on space-time convolution and time series feature fusion comprises the following steps:
1) acquiring video stream data and a motion data stream from an inertial sensor;
2) extracting global spatial features of frame images from video stream data, sending the global spatial features to a pooling layer for feature compression, and sending the compressed global spatial features to a 3D convolutional network to extract high-level spatio-temporal semantic features based on spatio-temporal convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM, the limb motion characteristics are extracted by combining hidden layer characteristics at all moments, the limb motion characteristics are input into a double-head self-attention mechanism so as to strengthen the motion information of key moments with weights, and deep motion characteristics based on time sequences are output after full-connection feedforward networks and normalization are carried out;
3) and fusing the high-level spatiotemporal semantic features and the deep motion features to obtain fusion features, inputting the fusion features into a multi-level sensor MLP (maximum likelihood probability) for mapping to obtain output values, and finishing behavior identification and classification by Softmax on the output values.
The method has the advantages that high-level space-time semantic features and deep motion features are jointly extracted from video stream data and inertial sensor stream data through the space-time convolutional network and the time sequence cyclic convolutional network, behavior recognition is completed based on a mode of fusing the two features, the defect that a mainstream algorithm uses single feature to lack information is overcome, the motion features of key moments are captured through the self-attention module, and the accuracy of the network in recognizing abnormal and sudden behaviors is improved.
Drawings
FIG. 1 is a flow chart of an example embodiment;
FIG. 2 is a schematic diagram of a 3D convolution module;
FIG. 3 is a process diagram of a fusion module.
Detailed Description
The embodiment is mainly realized on a linux platform, network training is completed on a TITANX display card, a behavior recognition set data set with mixed video and inertial sensor data needs to be constructed, the invention adopts quick-eye vision-improving intelligent glasses, develops a program for remotely acquiring the video data and the inertial sensor data based on Socket network sockets, and finally completes shooting of a daily behavior data set in a head-mounted mode.
The implementation of behavior recognition mainly comprises 3 steps:
1. the method comprises the steps of carrying out down-sampling, cutting and data enhancement on an input video to obtain video stream data, and meanwhile, carrying out filtering, abnormal value removal and normalization on the input inertial sensor data to obtain a motion data stream.
2. Extracting global spatial features of a frame image from video stream data through 3D convolution, sending the global spatial features to a pooling layer for feature compression, and sending the compressed global spatial features to a 3D convolution network to extract high-level spatio-temporal semantic features based on spatio-temporal convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM with a vertical structure and combined with motion changes at all moments, limb motion characteristics are extracted from a hidden layer of the double-layer bidirectional BilSTM, the limb motion characteristics are sent into a double-head self-attention mechanism module, motion information at key moments is strengthened by weight, the network is more stable for long-sequence attention training through a fully-connected feedforward network FFN, and finally deep motion characteristics based on a time sequence are output through LayerNorm layer normalization;
3) and (3) fusing the high-level spatiotemporal semantic features and the deep motion features, inputting the fused features into a multi-level sensor MLP to complete feature classification, and finally identifying behaviors through Softmax.
The specific network algorithm of the embodiment mainly comprises the following steps as shown in fig. 1: the video branch network extracts deep space-time semantic features of the video, the motion sensor branch network extracts action time sequence change features, the fusion network module fuses the two branch characteristics, and finally behavior recognition training is completed through the combined feature spectrum.
The video branch network is implemented as follows:
the first step is as follows: sampling 32 frames of images from a video through a random frame sampling algorithm, clipping the images to 224x224 size, wherein the clipping method comprises center clipping, random aspect ratio clipping, and then randomly turning the images horizontally, vertically and randomly.
The second step: and inputting the obtained data stream into a 64-channel 1x7x7 convolution to obtain a global space feature, wherein the time dimension of the convolution step is 1, and the space dimension is 2x2, inputting the global space feature into a max pooling layer of 1x3x3, and performing feature compression to obtain the compressed global space feature.
The third step: sending the compressed global space features into a 3D convolutional network to obtain high-level space-time semantic features, wherein the 3D convolutional network can be a 3D Resnet or a 3D increment junctionStructure of the organization. In this embodiment, 4 3D residual structure Rsenet group modules are adopted, and one 3D Rsenet group module is a 3D residual structure formed by 1x1x1 convolution, 1x3x3 convolution and 1x1x1 convolution as shown in fig. 2. And taking the high-level space-time semantic features as video path features.
The motion sensor branch network is implemented as follows:
the first step is as follows: and (5) processing the gyroscope and acceleration sensor data of T x 6 by filtering, and sampling noise and abnormal values by the sensor.
The second step is that: and sending the filtered motion data stream into a double-layer bidirectional BilSTM with a vertical structure, wherein the hidden layer feature dimension is 256, and extracting the limb motion features from the hidden layer by combining the hidden layer features at all times.
The third step: as shown in fig. 2, the obtained limb movement characteristics are sent to a double-head Self-attention module, the weighted limb movement characteristics at the key moment are scored, and then a fully-connected feedforward network FFN is passed by following a residual error structure. The network is more stable to long sequence attention training, and deep motion characteristics are output through LayerNorm layer normalization. The deep motion features are used as sensor path features.
Wherein the attention weight calculation formula is as follows:
Figure BDA0003540056220000041
the calculated weight score matrix can strengthen the capturing capability of the limb movement characteristics at the key moment.
The fusion network module is implemented as shown in fig. 3: matching feature dimensions of the video path features and the sensor path features through 1x1 convolution respectively, fusing two paths of feature spectrums through an Embedding method, mapping the fused features into output values through an MLP (maximum likelihood prediction) network, and finally generating behavior recognition results through Softmax for the output values.
In the embodiment, the Adam gradient descent method is adopted for network parameter updating and training, and the learning rate is increased firstly and then decreased by adopting a cosine method to adjust the learning step of the parameters.

Claims (5)

1. A behavior identification method based on space-time convolution and time series feature fusion is characterized by comprising the following steps:
1) acquiring video stream data and a motion data stream from an inertial sensor;
2) extracting global spatial features of frame images from video stream data, sending the global spatial features into a pooling layer for feature compression, and sending the compressed global spatial features into a 3D convolutional network to extract high-level space-time semantic features based on space-time convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM, the limb motion characteristics are extracted by combining hidden layer characteristics at all moments, the limb motion characteristics are input into a double-head self-attention mechanism so as to strengthen the motion information of key moments with weights, and deep motion characteristics based on time sequences are output after full-connection feedforward networks and normalization are carried out;
3) and fusing the high-level spatiotemporal semantic features and the deep motion features to obtain fusion features, inputting the fusion features into a multi-level sensor MLP (maximum likelihood probability) for mapping to obtain output values, and finishing behavior identification and classification by Softmax on the output values.
2. The method of claim 1, wherein the video stream data is obtained by down-sampling, cutting, and enhancing the input video to obtain video stream data;
the specific method for acquiring the motion data stream is to filter, remove abnormal values and normalize the input inertial sensor data to obtain the motion data stream.
3. The method of claim 1, wherein the global spatial features of the frame image are extracted by convolving the video stream data with 1x7x 7;
the global spatial features are fed into the max pooling layer of 1x3x3 for feature compression.
4. The method of claim 1, wherein the 3D convolutional network takes the form of sequentially concatenating 4 blocks of the 3D residual structure set.
5. The method of claim 4, wherein one 3D residual structure group block is a 3D residual structure consisting of 1x1x1 convolution, 1x3x3 convolution and 1x1x1 convolution.
CN202210229686.6A 2022-03-10 2022-03-10 Behavior identification method based on space-time convolution and time sequence feature fusion Pending CN114821766A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210229686.6A CN114821766A (en) 2022-03-10 2022-03-10 Behavior identification method based on space-time convolution and time sequence feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210229686.6A CN114821766A (en) 2022-03-10 2022-03-10 Behavior identification method based on space-time convolution and time sequence feature fusion

Publications (1)

Publication Number Publication Date
CN114821766A true CN114821766A (en) 2022-07-29

Family

ID=82529387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210229686.6A Pending CN114821766A (en) 2022-03-10 2022-03-10 Behavior identification method based on space-time convolution and time sequence feature fusion

Country Status (1)

Country Link
CN (1) CN114821766A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912645A (en) * 2023-03-29 2024-04-19 安徽医科大学第一附属医院 Blood preservation whole-flow supervision method and system based on Internet of things

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291707A (en) * 2020-02-24 2020-06-16 南京甄视智能科技有限公司 Abnormal behavior identification method and device, storage medium and server
CN111680660A (en) * 2020-06-17 2020-09-18 郑州大学 Human behavior detection method based on multi-source heterogeneous data stream
CN113627326A (en) * 2021-08-10 2021-11-09 国网福建省电力有限公司营销服务中心 Behavior identification method based on wearable device and human skeleton
CN113691542A (en) * 2021-08-25 2021-11-23 中南林业科技大学 Web attack detection method based on HTTP request text and related equipment
CN113743362A (en) * 2021-09-17 2021-12-03 平安医疗健康管理股份有限公司 Method for correcting training action in real time based on deep learning and related equipment thereof
CN113869189A (en) * 2021-09-24 2021-12-31 华中科技大学 Human behavior recognition method, system, device and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291707A (en) * 2020-02-24 2020-06-16 南京甄视智能科技有限公司 Abnormal behavior identification method and device, storage medium and server
CN111680660A (en) * 2020-06-17 2020-09-18 郑州大学 Human behavior detection method based on multi-source heterogeneous data stream
CN113627326A (en) * 2021-08-10 2021-11-09 国网福建省电力有限公司营销服务中心 Behavior identification method based on wearable device and human skeleton
CN113691542A (en) * 2021-08-25 2021-11-23 中南林业科技大学 Web attack detection method based on HTTP request text and related equipment
CN113743362A (en) * 2021-09-17 2021-12-03 平安医疗健康管理股份有限公司 Method for correcting training action in real time based on deep learning and related equipment thereof
CN113869189A (en) * 2021-09-24 2021-12-31 华中科技大学 Human behavior recognition method, system, device and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912645A (en) * 2023-03-29 2024-04-19 安徽医科大学第一附属医院 Blood preservation whole-flow supervision method and system based on Internet of things

Similar Documents

Publication Publication Date Title
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
Reddy et al. Spontaneous facial micro-expression recognition using 3D spatiotemporal convolutional neural networks
Gao et al. Human action monitoring for healthcare based on deep learning
CN111539389B (en) Face anti-counterfeiting recognition method, device, equipment and storage medium
CN112818931A (en) Multi-scale pedestrian re-identification method based on multi-granularity depth feature fusion
CN113780249B (en) Expression recognition model processing method, device, equipment, medium and program product
US20200311962A1 (en) Deep learning based tattoo detection system with optimized data labeling for offline and real-time processing
CN111488805A (en) Video behavior identification method based on saliency feature extraction
CN111797702A (en) Face counterfeit video detection method based on spatial local binary pattern and optical flow gradient
CN116311525A (en) Video behavior recognition method based on cross-modal fusion
CN113627256A (en) Method and system for detecting counterfeit video based on blink synchronization and binocular movement detection
Li et al. Dynamic long short-term memory network for skeleton-based gait recognition
CN113673308A (en) Object identification method, device and electronic system
CN110782503B (en) Face image synthesis method and device based on two-branch depth correlation network
CN114821766A (en) Behavior identification method based on space-time convolution and time sequence feature fusion
CN113626785B (en) Fingerprint authentication security enhancement method and system based on user fingerprint pressing behavior
CN115471901A (en) Multi-pose face frontization method and system based on generation of confrontation network
CN115731620A (en) Method for detecting counter attack and method for training counter attack detection model
CN113205044B (en) Deep fake video detection method based on characterization contrast prediction learning
Gu et al. Depth MHI based deep learning model for human action recognition
CN115205966A (en) Space-time Transformer action recognition method for sign language recognition
Deshpande et al. Abnormal Activity Recognition with Residual Attention-based ConvLSTM Architecture for Video Surveillance.
CN114360034A (en) Method, system and equipment for detecting deeply forged human face based on triplet network
Veerashetty et al. Texture-based face recognition using grasshopper optimization algorithm and deep convolutional neural network
CN115708135A (en) Face recognition model processing method, face recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination