WO2023087420A1 - 一种基于热红外视觉的停机坪人体动作识别方法及*** - Google Patents

一种基于热红外视觉的停机坪人体动作识别方法及*** Download PDF

Info

Publication number
WO2023087420A1
WO2023087420A1 PCT/CN2021/135634 CN2021135634W WO2023087420A1 WO 2023087420 A1 WO2023087420 A1 WO 2023087420A1 CN 2021135634 W CN2021135634 W CN 2021135634W WO 2023087420 A1 WO2023087420 A1 WO 2023087420A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
action recognition
target frame
frame
Prior art date
Application number
PCT/CN2021/135634
Other languages
English (en)
French (fr)
Inventor
丁萌
丁圆圆
孔祥浩
徐一鸣
吴仪
卢威
Original Assignee
南京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京航空航天大学 filed Critical 南京航空航天大学
Publication of WO2023087420A1 publication Critical patent/WO2023087420A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of intelligent video monitoring, in particular to a method and system for recognizing human motions on an apron based on thermal infrared vision.
  • thermal infrared (TIR) cameras are used instead of visible light cameras to receive thermal radiation from different objects, and then convert the temperature difference of objects into brightness values of image pixels for capturing low visibility Conditions for activities on the airport apron.
  • TIR thermal infrared
  • the inherent defects of infrared images such as blurred edges, low signal-to-noise ratio, and lack of color and texture information bring more challenges to action recognition based on infrared image sequences.
  • the object of the present invention is to provide a method and system for human action recognition on the parking apron based on thermal infrared vision, which improves the recognition accuracy.
  • the present invention provides a method for human action recognition on the parking apron based on thermal infrared vision, including:
  • the target tracking result includes position information of the target frame marked image in each frame
  • the position information of the target frame label image is added to the enlarged region of the target frame to obtain a three-channel sub-image;
  • the three-channel sub-image includes an abscissa channel image, a ordinate channel image and a target frame Enlarging the image corresponding to the region;
  • each of the three-channel sub-images forms a sequence of three-channel sub-images in chronological order;
  • the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network, the output of the spatial feature extraction network is connected to the input of the spatiotemporal feature extraction network;
  • the spatial feature extraction network includes 6 convolutional layers and 3 maximum pooling layers;
  • the spatio-temporal feature extraction network includes 3 layers of convLSTM.
  • the input of the action recognition model is a 30-frame three-channel sub-image sequence.
  • the action recognition model further includes a Softmax function, and the Softmax function is used to determine a classification result.
  • the enlarged area of the target frame is a square, and the side length of the square is expressed as:
  • L i represents the side length of the enlarged area of the target frame corresponding to the i-th frame image in the video sequence
  • is the scale coefficient
  • w i represents the length of the short side of the target frame
  • h i represents the length of the long side of the target frame.
  • the invention also discloses a thermal infrared vision-based human action recognition system on the parking apron, including:
  • Video sequence obtains module is used for obtaining a plurality of video sequences from infrared surveillance video, and described video sequence comprises multiclass preset target action;
  • the target frame labeling module is used to carry out target frame labeling to the set target in each frame image in the video sequence, and obtain the target tracking result;
  • the target tracking result includes the position information of the target frame label image in each frame;
  • the target frame enlargement module is used for each frame of image in the video sequence, according to the marked target frame to intercept the target frame enlarged area, and the side length of the target frame enlarged area is greater than the maximum side length of the corresponding target frame;
  • Three-channel sub-image sequence determination module for each frame image in the video sequence, adding the position information of the target frame label image to the target frame enlarged area to obtain a three-channel sub-image;
  • the three-channel sub-image includes an abscissa The image corresponding to the channel image, the ordinate channel image and the enlarged region of the target frame;
  • each of the three-channel sub-images constitutes a sequence of three-channel sub-images in chronological order;
  • An action recognition model training module for using the three-channel sub-image sequences corresponding to a plurality of video sequences as a training set to train an action recognition model to obtain a trained action recognition model;
  • the video sequence acquisition module to be identified is used to acquire the video sequence to be identified from the infrared surveillance video, and obtain the three-channel sub-image sequence corresponding to the video sequence to be identified;
  • the target action recognition module is configured to input the three-channel sub-image sequence corresponding to the video sequence to be recognized into the trained action recognition model, and output the target action type.
  • the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network, the output of the spatial feature extraction network is connected to the input of the spatiotemporal feature extraction network;
  • the spatial feature extraction network includes 6 convolutional layers and 3 maximum pooling layers;
  • the spatio-temporal feature extraction network includes 3 layers of convLSTM.
  • the input of the action recognition model is a 30-frame three-channel sub-image sequence.
  • the action recognition model further includes a Softmax function, and the Softmax function is used to determine a classification result.
  • the enlarged area of the target frame is a square, and the side length of the square is expressed as:
  • L i represents the side length of the enlarged area of the target frame corresponding to the i-th frame image in the video sequence
  • is the scale coefficient
  • w i represents the length of the short side of the target frame
  • h i represents the length of the long side of the target frame.
  • the invention discloses the following technical effects:
  • the present invention intercepts the enlarged area of the target frame according to the marked target frame to obtain effective background information around the target, adds the position information of the marked image of the target frame to the enlarged area of the target frame, and obtains three-channel sub-images, which effectively solves the problem of infrared image information.
  • the problem of low noise ratio and background interference of surveillance images improves the recognition accuracy of human actions.
  • Fig. 1 is a kind of schematic flow chart of the human action recognition method of parking lot based on thermal infrared vision of the present invention
  • Fig. 2 is an example image of the behavior category of the present invention
  • Fig. 3 is a schematic diagram of the acquisition principle of the three-channel sub-image sequence of the present invention.
  • Fig. 4 is a schematic diagram of the spatial feature extraction network structure of the present invention.
  • Fig. 5 is a schematic diagram of the data flow in the action recognition model of the present invention.
  • FIG. 6 is a schematic structural diagram of a human motion recognition system on a parking apron based on thermal infrared vision according to the present invention.
  • the object of the present invention is to provide a method and system for human action recognition on the parking apron based on thermal infrared vision, which improves the recognition accuracy.
  • Fig. 1 is a kind of schematic flow chart of the human action recognition method on the parking apron based on thermal infrared vision of the present invention, as shown in Fig. 1, a kind of human action recognition method on the parking apron based on thermal infrared vision comprises the following steps:
  • Step 101 Obtain multiple video sequences from infrared surveillance videos, the video sequences include multiple types of preset target actions.
  • the preset target actions include standing, walking, running, jumping, squatting, waving, climbing and drilling planes, among which standing and walking are normal behaviors, running, jumping, squatting, waving, climbing and drilling planes are abnormal behaviors, as shown in Figure 2.
  • Step 102 mark the set target in each frame image of the video sequence with a target frame to obtain a target tracking result; the target tracking result includes position information of the target frame marked image in each frame.
  • Step 103 For each frame of image in the video sequence, intercept the enlarged area of the target frame according to the marked target frame, and the side length of the enlarged area of the target frame is greater than the maximum side length of the corresponding target frame.
  • the enlarged area of the target frame is a square, and the side length of the square is expressed as:
  • L i represents the side length of the enlarged area of the target frame corresponding to the i-th frame image in the video sequence
  • is the scale coefficient
  • is set to 1.5.
  • Step 104 For each frame of image in the video sequence, add the position information of the marked image of the target frame to the enlarged area of the target frame to obtain a three-channel sub-image; the three-channel sub-image includes the abscissa channel image, the ordinate channel image and the enlarged target frame The image corresponding to the region; each three-channel sub-image forms a sequence of three-channel sub-images in time order.
  • the abscissa channel image is represented by U i
  • U i represents the abscissa set of each pixel in the target frame
  • the ordinate channel image is represented by V i
  • V i represents the ordinate set of each pixel point in the target frame
  • the enlarged area of the target frame The corresponding image is denoted by S i
  • the channel sizes of the U i channel and the V i channel representing the horizontal and vertical coordinates of each pixel in the target frame are equal to the size of the intercepted target image S i .
  • Fig. 3 (a) is a schematic diagram of the enlarged area of the target frame and the acquisition of U i channel and V i channel.
  • Fig. 3 (b) is the acquired U i channel, V i channel Schematic diagram of T i composed of channel i and target image S i .
  • Step 105 Using the three-channel sub-image sequences corresponding to multiple video sequences as a training set to train the action recognition model to obtain a trained action recognition model.
  • Step 106 Obtain a video sequence to be identified from the infrared surveillance video, and obtain a three-channel sub-image sequence corresponding to the video sequence to be identified.
  • Step 107 Input the three-channel sub-image sequence corresponding to the video sequence to be recognized into the trained action recognition model, and output the target action type.
  • the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network.
  • the output of the spatial feature extraction network is connected to the input of the spatiotemporal feature extraction network;
  • the spatial feature extraction network includes 6 convolutional layers and 3 maximum pooling layers;
  • the spatiotemporal feature extraction network Includes 3 layers of convLSTM.
  • the structure of the spatial feature extraction network is shown in Figure 4. After every two convolutional layers, a maximum pooling layer is connected.
  • the input sequence T i is normalized (normalized) and resized (adjusted), and the obtained size is 28 ⁇ 28 ⁇ 3 input tensor, after convolution and pooling, 30 tensors X i with a size of 3 ⁇ 3 ⁇ 256 are output.
  • the input of the action recognition model is a three-channel sub-image sequence of 30 frames (about 4s in duration).
  • the action recognition model also includes a Softmax function, which is used to determine the classification result.
  • a method for recognizing human actions on a parking apron based on thermal infrared vision of the present invention will be described in detail below.
  • the sampling frequency of the video sequence is 8hz, and the pixel value of each frame is 384 ⁇ 288; the data set has a total of 2000 action clips (video sequences) containing 30 frames of images, and the data volume ratio of the training set and the verification set is 7:1.
  • the method of intercepting the enlarged area of the target frame containing the effective background information around the target is as follows: according to the tracking result, the position of the center point of the target and the width and height of the target frame (w i ⁇ h i ), i is the frame index in the sequence; The side length L i of the region.
  • the step of adding position information to the target image in S14 to obtain a three-channel sub-image sequence includes: according to the target tracking result, that is, the horizontal and vertical coordinates of the upper left corner of the target frame and the width and height [u i , v i , w i , h i ], calculate Indicates the U i channel and the V i channel of the horizontal and vertical coordinates of each pixel in the target frame, and the size of the U i channel and the V i channel is equal to the size of the intercepted target image S i .
  • S15 Construct a convolutional neural network (spatial feature extraction network) for extracting spatial features and a convolutional long-short-term memory network (convLSTM) for spatio-temporal feature extraction, and introduce a fully connected layer and softmax function for classification to generate targets A Network Structure Model for Action Recognition.
  • convolutional neural network spatial feature extraction network
  • convLSTM convolutional long-short-term memory network
  • Fig. 5 (a) is the infrared video sequence that comprises n frames in the step S11 as the input of action recognition; (b) is the sub-image obtained by the target tracking result in the step S12; (c) is obtained after the step S14 preprocessing Input tensor; (d) is the part of the CNN network (spatial feature extraction network) used for spatial feature extraction in step S15, the output sequence of the spatial feature extraction network is x 1 , x 2 , ... x t , and t represents the serial number; (e) is the spatio-temporal feature extraction network based on convLSTM in step S15.
  • the spatio-temporal feature extraction network includes three layers of convLSTM.
  • the outputs of the first layer of convLSTM are h 1 , h 2 , ... h t respectively, and the outputs of the second layer of convLSTM are respectively h 1 ', h 2 ', ... h t ';
  • (f) are two FC layers (fully connected layers) for action classification in step S15, the downward arrow in (f) in Figure 5 indicates dropout processing, and the horizontal arrow Indicates a full join operation.
  • the input of the action recognition model is a preprocessed sub-image sequence of 30 frames (about 4s in duration).
  • the embodiment of the present invention provides a method for human action recognition on a parking lot based on thermal infrared vision to train and test the neural network on a desktop workstation, and the hardware platform is an Intel (R) Xeon (R) E5-1620 with a memory size of 64GB. v4 [email protected] CPU and an NVIDIA GTX 1060 6GB GPU, the program runs on the Keras application programming interface (API) based on the Tensorflow backend engine, built and implemented with Python 3.6.10.
  • Intel Intel
  • R Xeon
  • NVIDIA GTX 1060 6GB GPU the program runs on the Keras application programming interface (API) based on the Tensorflow backend engine, built and implemented with Python 3.6.10.
  • the method of the present invention integrates a preprocessing module based on target tracking results, a CNN-based spatial feature extraction module, a spatio-temporal feature extraction module based on three-layer convolution LSTM (ConvLSTM) and a classification composed of two fully connected layers (FC) layer.
  • the method of the invention can still better identify the target behavior under the condition of low visibility, and can be applied to the identification of the scene activity behavior in the complex environment of people.
  • the present invention intercepts the target and the effective background information around the target according to the tracking result, effectively solves the problem of low signal-to-noise ratio of the infrared image and background interference of the monitoring image, and can Monitoring a specific target in a video where multiple active targets exist is closer to the actual background of engineering applications.
  • the present invention uses the coordinate position of the target frame in the original image as two independent channels in series to the channel dimension of the image, which is convenient for subsequent scrolling.
  • Productive processing taking into account less calculation and rich feature information, improves the accuracy of action classification and recognition speed.
  • Fig. 6 is a schematic structural diagram of a human action recognition system on a parking apron based on thermal infrared vision according to the present invention.
  • a human action recognition system on an apron based on thermal infrared vision includes:
  • the video sequence obtaining module 201 is configured to obtain multiple video sequences from infrared surveillance video, and the video sequences include multiple types of preset target actions.
  • the target frame marking module 202 is used to mark the target frame in each frame of image in the video sequence to obtain the target tracking result; the target tracking result includes the position information of the target frame marked image in each frame.
  • the target frame enlargement module 203 is configured to, for each frame of image in the video sequence, intercept the enlarged target frame area according to the marked target frame, and the side length of the enlarged target frame area is greater than the maximum side length of the corresponding target frame.
  • the three-channel sub-image sequence determination module 204 is used for adding the position information of the target frame label image to the target frame enlarged region for each frame of image in the video sequence to obtain the three-channel sub-image;
  • the three-channel sub-image includes an abscissa channel image, The ordinate channel image and the image corresponding to the enlarged area of the target frame;
  • each three-channel sub-image forms a sequence of three-channel sub-images in time order.
  • the action recognition model training module 205 is configured to use the three-channel sub-image sequences corresponding to multiple video sequences as a training set to train the action recognition model to obtain a trained action recognition model.
  • the video sequence to be identified acquisition module 206 is configured to acquire the video sequence to be identified from the infrared surveillance video, and obtain a three-channel sub-image sequence corresponding to the video sequence to be identified.
  • the target action recognition module 207 is configured to input the three-channel sub-image sequence corresponding to the video sequence to be recognized into the trained action recognition model, and output the target action type.
  • the action recognition model includes a spatial feature extraction network and a spatiotemporal feature extraction network.
  • the output of the spatial feature extraction network is connected to the input of the spatiotemporal feature extraction network;
  • the spatial feature extraction network includes 6 convolutional layers and 3 maximum pooling layers;
  • the spatiotemporal feature extraction network Includes 3 layers of convLSTM.
  • the input of the action recognition model is a three-channel sub-image sequence of 30 frames.
  • the action recognition model also includes a Softmax function, which is used to determine the classification result.
  • the enlarged area of the target frame is a square, and the side length of the square is expressed as:
  • L i represents the side length of the enlarged area of the target frame corresponding to the i-th frame image in the video sequence
  • is the scale coefficient
  • w i represents the length of the short side of the target frame
  • h i represents the length of the long side of the target frame.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Closed-Circuit Television Systems (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开一种基于热红外视觉的停机坪人体动作识别方法及***,该方法包括:从红外监控视频中获取多个视频序列;对视频序列中每帧图像中的设定目标进行目标框标注;对于视频序列中每帧图像,根据标注后的目标框截取目标框放大区域;将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;各三通道子图像按时间顺序构成三通道子图像序列;将多个视频序列对应的三通道子图像序列作为训练集对动作识别模型进行训练;从红外监控视频中获取待识别视频序列,获得待识别视频序列对应的三通道子图像序列;将待识别视频序列对应的三通道子图像序列输入训练好的动作识别模型输出目标动作类型。本发明提高了复杂环境下人体动作的识别精度。

Description

一种基于热红外视觉的停机坪人体动作识别方法及***
本申请要求于2021年11月17日提交中国专利局、申请号为2021113627181、发明名称为“一种基于热红外视觉的停机坪人体动作识别方法及***”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及智能视频监控技术领域,特别是涉及一种基于热红外视觉的停机坪人体动作识别方法及***。
背景技术
为了提高交通运输的安全性和效率,交通基础设施和服务对智能视觉监控技术的依赖日渐增长。计算机视觉正被用于解决一系列问题,如事故检测和道路状况监控。民航机场是运输基础设施和服务的重要提供者,确保民航机场的安全和效率至关重要。与机场地面其他区域相比,停机坪飞机、车辆工作活动频繁且人员复杂,安全问题尤为突出。此外,由于夜间能见度较低且缺乏有效的监测方法,夜间发生不安全事件的概率远远大于白天。因此,提高停机坪区域低能见度条件下的监测能力十分重要。
为了在低能见度条件下完成监视任务,利用热红外(TIR)相机替代可见光相机,用来接收来自不同物体的热辐射,然后将物体的温度差转换为图像像素的亮度值,用于捕捉低能见度条件下机场停机坪上的活动。相比于基于可见光光谱的监视技术,红外图像边缘模糊、信噪比低、缺乏颜色纹理信息等固有缺陷给基于红外图像序列的动作识别带来更多的挑战。
发明内容
基于此,本发明的目的是提供一种基于热红外视觉的停机坪人体动作识别方法及***,提高了识别精度。
为实现上述目的,本发明提供了一种基于热红外视觉的停机坪人体动作识别方法,包括:
从红外监控视频中获取多个视频序列,所述视频序列包括多类预设目标动作;
对所述视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;所述目标跟踪结果包括各帧中目标框标注图像的位置信息;
对于所述视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长;
对于所述视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;所述三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各所述三通道子图像按时间顺序构成三通道子图像序列;
将多个视频序列对应的所述三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型;
从红外监控视频中获取待识别视频序列,并获得所述待识别视频序列对应的三通道子图像序列;
将所述待识别视频序列对应的三通道子图像序列输入所述训练好的动作识别模型,输出目标动作类型。
可选地,所述动作识别模型包括空间特征提取网络和时空特征提取网络,所述空间特征提取网络的输出连接所述时空特征提取网络的输入;所述空间特征提取网络包括6个卷积层和3个最大池化层;所述时空特征提取网络包括3层convLSTM。
可选地,所述动作识别模型的输入为30帧的三通道子图像序列。
可选地,所述动作识别模型还包括Softmax函数,所述Softmax函数用于确定分类结果。
可选地,所述目标框放大区域为正方形,所述正方形的边长表示为:
Figure PCTCN2021135634-appb-000001
其中,L i表示所述视频序列中第i帧图像对应的目标框放大区域的边长,α为尺度系数,w i表示目标框的短边长,h i表示目标框的长边长。
本发明还公开了一种基于热红外视觉的停机坪人体动作识别***,包括:
视频序列获得模块,用于从红外监控视频中获取多个视频序列,所述 视频序列包括多类预设目标动作;
目标框标注模块,用于对所述视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;所述目标跟踪结果包括各帧中目标框标注图像的位置信息;
目标框放大模块,用于对于所述视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长;
三通道子图像序列确定模块,用于对于所述视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;所述三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各所述三通道子图像按时间顺序构成三通道子图像序列;
动作识别模型训练模块,用于将多个视频序列对应的所述三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型;
待识别视频序列获取模块,用于从红外监控视频中获取待识别视频序列,并获得所述待识别视频序列对应的三通道子图像序列;
目标动作识别模块,用于将所述待识别视频序列对应的三通道子图像序列输入所述训练好的动作识别模型,输出目标动作类型。
可选地,所述动作识别模型包括空间特征提取网络和时空特征提取网络,所述空间特征提取网络的输出连接所述时空特征提取网络的输入;所述空间特征提取网络包括6个卷积层和3个最大池化层;所述时空特征提取网络包括3层convLSTM。
可选地,所述动作识别模型的输入为30帧的三通道子图像序列。
可选地,所述动作识别模型还包括Softmax函数,所述Softmax函数用于确定分类结果。
可选地,所述目标框放大区域为正方形,所述正方形的边长表示为:
Figure PCTCN2021135634-appb-000002
其中,L i表示所述视频序列中第i帧图像对应的目标框放大区域的边 长,α为尺度系数,w i表示目标框的短边长,h i表示目标框的长边长。
根据本发明提供的具体实施例,本发明公开了以下技术效果:
本发明根据标注后的目标框截取目标框放大区域,从而获得目标周围有效的背景信息,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像,有效解决了红外图像信噪比低与监视图像背景干扰的问题,提高了人体动作的识别精度。
说明书附图
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明一种基于热红外视觉的停机坪人体动作识别方法流程示意图;
图2为本发明行为类别的示例图像;
图3为本发明三通道子图像序列获取原理示意图;
图4为本发明空间特征提取网络结构示意图;
图5为本发明动作识别模型中数据流程示意图;
图6为本发明一种基于热红外视觉的停机坪人体动作识别***结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的目的是提供一种基于热红外视觉的停机坪人体动作识别方法及***,提高了识别精度。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。
图1为本发明一种基于热红外视觉的停机坪人体动作识别方法流程示意图,如图1所示,一种基于热红外视觉的停机坪人体动作识别方法包括以下步骤:
步骤101:从红外监控视频中获取多个视频序列,视频序列包括多类预设目标动作。
以机场停机坪为背景,预设目标动作包括站立、走路、奔跑、跳跃、下蹲、挥手、攀爬和钻飞机,其中站立和走路为正常行为,奔跑、跳跃、下蹲、挥手、攀爬和钻飞机为异常行为,如图2所示。
步骤102:对视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;目标跟踪结果包括各帧中目标框标注图像的位置信息。
目标跟踪结果用[u i,v i,w i,h i]表示,i=1,2,…,n,u i和v i分别为第i帧图像中目标框左上角的横坐标,v i为目标框左上角的纵坐标,w i为目标框的宽度(短边长),h i为目标框的高度(长边长),n表示视频序列中图像的帧数。
步骤103:对于视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长。
目标框放大区域为正方形,正方形的边长表示为:
Figure PCTCN2021135634-appb-000003
其中,L i表示视频序列中第i帧图像对应的目标框放大区域的边长,α为尺度系数,α设置为1.5。
步骤104:对于视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各三通道子图像按时间顺序构成三通道子图像序列。
横坐标通道图像用U i表示,U i表示目标框中每个像素点横坐标集合,纵坐标通道图像用V i表示,V i表示目标框中每个像素点纵坐标集合,目 标框放大区域对应的图像用S i表示,形成最终的三通道子图像序列表示为T i,i=1,2,…,n。
Figure PCTCN2021135634-appb-000004
表示目标框中每个像素点横纵坐标的U i通道和V i通道的通道大小与截取的目标图像S i大小相等。
三通道子图像序列的获得原理如图3所示,图3中(a)为目标框放大区域以及U i通道和V i通道获取示意图,图3中(b)为获取的U i通道、V i通道和目标图像S i组成的T i示意图。
步骤105:将多个视频序列对应的三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型。
步骤106:从红外监控视频中获取待识别视频序列,并获得待识别视频序列对应的三通道子图像序列。
步骤107:将待识别视频序列对应的三通道子图像序列输入训练好的动作识别模型,输出目标动作类型。
动作识别模型包括空间特征提取网络和时空特征提取网络,空间特征提取网络的输出连接时空特征提取网络的输入;空间特征提取网络包括6个卷积层和3个最大池化层;时空特征提取网络包括3层convLSTM。
空间特征提取网络的结构如图4所示,每两个卷积层后连接一个最大池化层,输入序列T i经过normalization(归一化)和resize(调整),获得尺寸大小为28×28×3的输入张量,经过卷积和池化后输出30个尺寸为3×3×256的张量X i
动作识别模型的输入为30帧(时长约为4s)的三通道子图像序列。
动作识别模型还包括Softmax函数,Softmax函数用于确定分类结果。
下面详细说明本发明一种基于热红外视觉的停机坪人体动作识别方法。
S1、构建特定目标行为的动作识别模型。
S11、从红外监控视频中截取各类目标动作发生的完整视频序列,构建用于停机坪人体动作识别的训练及验证数据集。
视频序列的采样频率为8hz,每一帧像素值为384×288;数据集共有2000个包含30帧图像的动作片段(视频序列),训练集、验证集数据量比例为7:1。
S12、对视频的每一帧中的特定目标进行跟踪框标注,获取图像序列的连续目标跟踪结果[u i,v i,w i,h i],i=1,2,…,n,四个值分别为第i帧图像中目标框左上角的横纵坐标以及宽高。
S13、基于目标跟踪结果,对每一帧图像截取包含目标周围部分有效背景信息的目标框放大区域,得到目标图像序列S i,i=1,2,…,n。
截取包含目标周围部分有效背景信息的目标框放大区域的方法为:根据跟踪结果得到目标的中心点位置以及目标框的宽高(w i×h i),i是序列中的帧索引;计算截取区域的边长L i
以每一帧的目标中心为截取中心,L i为边长截取正方形区域S i
S14、将目标在原图像中的位置运动信息映射到二维图像大小,得到张量U i和V i,添加到目标图像S i的第三维,形成最终的三通道子图像序列T i,i=1,2,…,n。
S14中添加位置信息到目标图像得到三通道子图像序列的步骤包括:根据目标跟踪结果,即目标框左上角的横纵坐标以及宽高[u i,v i,w i,h i],计算表示目标框中每个像素点横纵坐标的U i通道和V i通道,U i通道和V i通道的大小与截取的目标图像S i大小相等。
通过连接归一化后的U i通道和V i通道到目标图像通道S i的第三维,形成大小为L i×L i×3的三维特征张量作为子图像序列T i输入到后续的动作识别模型如图4所示。
S15、构建用于提取空间特征的卷积神经网络(空间特征提取网络)和用于时空特征提取的卷积长短时记忆网络(convLSTM),并引入用于分类的全连接层和softmax函数生成目标行为识别的网络结构模型。
S15搭建行为识别网络模型具体流程包括:首先将步骤S14得到的T i,i=1,2,…,n,采用零中心归一化和调整大小操作得到时间序列为30、尺寸大小为28×28×3的输入张量;经过由6个卷积层和3个最大值池化层组成的空间特征提取网络输出30个尺寸为3×3×256的张量,如图4所示;接着送入3层convLSTM组成的时空特征提取网络,输出大小为1×3×3×64的特征张量;将时空特征展平为矢量,送入两个全连接层,使用Softmax函数获得分类结果,如图5所示。图5中(a)为步骤S11中包含n帧的红外视频序列作为动作识别的输入;(b)为步骤S12中由目标跟踪结果得到的子图像;(c)为步骤S14预处理后得到的输入张量;(d)为步骤S15中用于空间特征提取的CNN网络(空间特征提取网络)部分,空间特征提取网络的输出序列为x 1、x 2、…x t,t表示序列号;(e)为步骤S15中基于convLSTM的时空特征提取网络,时空特征提取网络包括三层convLSTM,第一层convLSTM的输出分别为h 1、h 2、…h t,第二层convLSTM的输出分别为h 1’、h 2’、…h t’;(f)为步骤S15中进行动作分类的两个FC层(全连接层),图5中(f)中向下箭头表示dropout处理,横向箭头表示全连接操作。
S16、利用用于停机坪人体动作识别的训练数据集对所构建的行为识别网络进行模型训练,通过精度评估调整动作识别模型中超参数,确定网络权重,得到最终适用于停机坪活动人员目标的动作识别模型。
S16中行为识别网络模型训练策略采用指数衰减率β 1=0.9,β 2=0.999的ADAM优化器,初始学习率设置为0.0005,学习率衰减策略采用余弦退火方法,全连接层的dropout率设置为0.5,损失函数采用交叉熵损失函数。
S2、对机场停机坪人员的行为动作进行识别。
S21、对红外监控视频中的特定目标进行跟踪,得到一段时间序列长度的目标跟踪结果。
S22、对步骤S21获得的目标跟踪结果执行步骤S13-S14进行的图像序列预处理,获得三通道子图像序列T i
S23、将所得到的三通道子图像序列输入到动作识别模型进行识别,得出目标的动作类型。
动作识别模型的输入为30帧(时长约为4s)的经过预处理的子图像序列。
本发明实施例提供的一种基于热红外视觉的停机坪人体动作识别方法在台式工作站上进行神经网络的训练与测试,硬件平台为内存大小为64GB的Intel(R)Xeon(R)E5-1620 v4 [email protected] CPU和一个NVIDIA GTX 1060 6GB的GPU,程序运行在基于Tensorflow后端引擎的Keras应用程序编程接口(API),使用Python 3.6.10构建和实现。
本发明一种基于热红外视觉的停机坪人体动作识别方法的有益效果是:
1、本发明方法集成了基于目标跟踪结果的预处理模块、基于CNN的空间特征提取模块、基于三层卷积LSTM(ConvLSTM)的时空特征提取模块和两个全连接层(FC)组成的分类层。本发明方法在低能见度的情况下依然可以较好的识别目标行为,能够应用于人员复杂环境下场面活动行为的识别。
2、针对停机坪环境以及场面监视***中目标成像比例小的特点,本发明依据跟踪结果截取目标以及目标周围有效背景信息,有效解决红外图像信噪比低与监视图像背景干扰的问题,并能够在多活动目标存在的视频中监视特定目标,更加贴近工程应用的实际背景。
3、由于跟踪框的提取带来目标原始位置信息丢失的问题,本发明为了有效融合目标运动特征,将目标框在原图像中的坐标位置作为两个独立通道串联到图像的通道维,便于后续卷积处理,兼顾了较少的计算量和丰富的特征信息,提高了动作分类准确率和识别速度。
图6为本发明一种基于热红外视觉的停机坪人体动作识别***结构示意图,如图6所示,一种基于热红外视觉的停机坪人体动作识别***包括:
视频序列获得模块201,用于从红外监控视频中获取多个视频序列,视频序列包括多类预设目标动作。
目标框标注模块202,用于对视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;目标跟踪结果包括各帧中目标框标注图 像的位置信息。
目标框放大模块203,用于对于视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长。
三通道子图像序列确定模块204,用于对于视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各三通道子图像按时间顺序构成三通道子图像序列。
动作识别模型训练模块205,用于将多个视频序列对应的三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型。
待识别视频序列获取模块206,用于从红外监控视频中获取待识别视频序列,并获得待识别视频序列对应的三通道子图像序列。
目标动作识别模块207,用于将待识别视频序列对应的三通道子图像序列输入训练好的动作识别模型,输出目标动作类型。
动作识别模型包括空间特征提取网络和时空特征提取网络,空间特征提取网络的输出连接时空特征提取网络的输入;空间特征提取网络包括6个卷积层和3个最大池化层;时空特征提取网络包括3层convLSTM。
动作识别模型的输入为30帧的三通道子图像序列。
动作识别模型还包括Softmax函数,Softmax函数用于确定分类结果。
目标框放大区域为正方形,正方形的边长表示为:
Figure PCTCN2021135634-appb-000005
其中,L i表示视频序列中第i帧图像对应的目标框放大区域的边长,α为尺度系数,w i表示目标框的短边长,h i表示目标框的长边长。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种基于热红外视觉的停机坪人体动作识别方法,其特征在于,包括:
    从红外监控视频中获取多个视频序列,所述视频序列包括多类预设目标动作;
    对所述视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;所述目标跟踪结果包括各帧中目标框标注图像的位置信息;
    对于所述视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长;
    对于所述视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;所述三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各所述三通道子图像按时间顺序构成三通道子图像序列;
    将多个视频序列对应的所述三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型;
    从红外监控视频中获取待识别视频序列,并获得所述待识别视频序列对应的三通道子图像序列;
    将所述待识别视频序列对应的三通道子图像序列输入所述训练好的动作识别模型,输出目标动作类型。
  2. 根据权利要求1所述的基于热红外视觉的停机坪人体动作识别方法,其特征在于,所述动作识别模型包括空间特征提取网络和时空特征提取网络,所述空间特征提取网络的输出连接所述时空特征提取网络的输入;所述空间特征提取网络包括6个卷积层和3个最大池化层;所述时空特征提取网络包括3层convLSTM。
  3. 根据权利要求1所述的基于热红外视觉的停机坪人体动作识别方法,其特征在于,所述动作识别模型的输入为30帧的三通道子图像序列。
  4. 根据权利要求1所述的基于热红外视觉的停机坪人体动作识别方法,其特征在于,所述动作识别模型还包括Softmax函数,所述Softmax函数用于确定分类结果。
  5. 根据权利要求1所述的基于热红外视觉的停机坪人体动作识别方法,其特征在于,所述目标框放大区域为正方形,所述正方形的边长表示为:
    Figure PCTCN2021135634-appb-100001
    其中,L i表示所述视频序列中第i帧图像对应的目标框放大区域的边长,α为尺度系数,w i表示目标框的短边长,h i表示目标框的长边长。
  6. 一种基于热红外视觉的停机坪人体动作识别***,其特征在于,包括:
    视频序列获得模块,用于从红外监控视频中获取多个视频序列,所述视频序列包括多类预设目标动作;
    目标框标注模块,用于对所述视频序列中每帧图像中的设定目标进行目标框标注,获得目标跟踪结果;所述目标跟踪结果包括各帧中目标框标注图像的位置信息;
    目标框放大模块,用于对于所述视频序列中每帧图像,根据标注后的目标框截取目标框放大区域,目标框放大区域的边长大于对应目标框的最大边长;
    三通道子图像序列确定模块,用于对于所述视频序列中每帧图像,将目标框标注图像的位置信息添加到目标框放大区域,获得三通道子图像;所述三通道子图像包括横坐标通道图像、纵坐标通道图像和目标框放大区域对应的图像;各所述三通道子图像按时间顺序构成三通道子图像序列;
    动作识别模型训练模块,用于将多个视频序列对应的所述三通道子图像序列作为训练集对动作识别模型进行训练,获得训练好的动作识别模型;
    待识别视频序列获取模块,用于从红外监控视频中获取待识别视频序列,并获得所述待识别视频序列对应的三通道子图像序列;
    目标动作识别模块,用于将所述待识别视频序列对应的三通道子图像序列输入所述训练好的动作识别模型,输出目标动作类型。
  7. 根据权利要求6所述的基于热红外视觉的停机坪人体动作识别系 统,其特征在于,所述动作识别模型包括空间特征提取网络和时空特征提取网络,所述空间特征提取网络的输出连接所述时空特征提取网络的输入;所述空间特征提取网络包括6个卷积层和3个最大池化层;所述时空特征提取网络包括3层convLSTM。
  8. 根据权利要求6所述的基于热红外视觉的停机坪人体动作识别***,其特征在于,所述动作识别模型的输入为30帧的三通道子图像序列。
  9. 根据权利要求6所述的基于热红外视觉的停机坪人体动作识别***,其特征在于,所述动作识别模型还包括Softmax函数,所述Softmax函数用于确定分类结果。
  10. 根据权利要求6所述的基于热红外视觉的停机坪人体动作识别***,其特征在于,所述目标框放大区域为正方形,所述正方形的边长表示为:
    Figure PCTCN2021135634-appb-100002
    其中,L i表示所述视频序列中第i帧图像对应的目标框放大区域的边长,α为尺度系数,w i表示目标框的短边长,h i表示目标框的长边长。
PCT/CN2021/135634 2021-11-17 2021-12-06 一种基于热红外视觉的停机坪人体动作识别方法及*** WO2023087420A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111362718.1 2021-11-17
CN202111362718.1A CN114067438A (zh) 2021-11-17 2021-11-17 一种基于热红外视觉的停机坪人体动作识别方法及***

Publications (1)

Publication Number Publication Date
WO2023087420A1 true WO2023087420A1 (zh) 2023-05-25

Family

ID=80273337

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135634 WO2023087420A1 (zh) 2021-11-17 2021-12-06 一种基于热红外视觉的停机坪人体动作识别方法及***

Country Status (2)

Country Link
CN (1) CN114067438A (zh)
WO (1) WO2023087420A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392606A (zh) * 2023-10-19 2024-01-12 应急管理部大数据中心 基于图像识别的粉尘设备维护行为监测方法和***
CN117437635A (zh) * 2023-12-21 2024-01-23 杭州海康慧影科技有限公司 一种生物组织类图像的预标注方法、装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403162B (zh) * 2023-04-11 2023-10-27 南京航空航天大学 一种机场场面目标行为识别方法、***及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985259A (zh) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
CN110889324A (zh) * 2019-10-12 2020-03-17 南京航空航天大学 一种基于yolo v3面向末端制导的热红外图像目标识别方法
US11055872B1 (en) * 2017-03-30 2021-07-06 Hrl Laboratories, Llc Real-time object recognition using cascaded features, deep learning and multi-target tracking
CN113158983A (zh) * 2021-05-18 2021-07-23 南京航空航天大学 一种基于红外视频序列图像的机场场面活动行为识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055872B1 (en) * 2017-03-30 2021-07-06 Hrl Laboratories, Llc Real-time object recognition using cascaded features, deep learning and multi-target tracking
CN108985259A (zh) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
CN110889324A (zh) * 2019-10-12 2020-03-17 南京航空航天大学 一种基于yolo v3面向末端制导的热红外图像目标识别方法
CN113158983A (zh) * 2021-05-18 2021-07-23 南京航空航天大学 一种基于红外视频序列图像的机场场面活动行为识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DING MENG, DING YUAN-YUAN, WU XIAO-ZHOU, WANG XU-HUI, XU YU-BIN: "Action recognition of individuals on an airport apron based on tracking bounding boxes of the thermal infrared target", INFRARED PHYSICS AND TECHNOLOGY., ELSEVIER SCIENCE., GB, vol. 117, 1 September 2021 (2021-09-01), GB , pages 103859, XP093067847, ISSN: 1350-4495, DOI: 10.1016/j.infrared.2021.103859 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392606A (zh) * 2023-10-19 2024-01-12 应急管理部大数据中心 基于图像识别的粉尘设备维护行为监测方法和***
CN117437635A (zh) * 2023-12-21 2024-01-23 杭州海康慧影科技有限公司 一种生物组织类图像的预标注方法、装置
CN117437635B (zh) * 2023-12-21 2024-04-05 杭州海康慧影科技有限公司 一种生物组织类图像的预标注方法、装置

Also Published As

Publication number Publication date
CN114067438A (zh) 2022-02-18

Similar Documents

Publication Publication Date Title
WO2023087420A1 (zh) 一种基于热红外视觉的停机坪人体动作识别方法及***
CN111209810B (zh) 向可见光与红外图像准确实时行人检测的边界框分割监督深度神经网络架构
CN110532970B (zh) 人脸2d图像的年龄性别属性分析方法、***、设备和介质
WO2021238019A1 (zh) 基于Ghost卷积特征融合神经网络实时车流量检测***及方法
CN110795982A (zh) 一种基于人体姿态分析的表观视线估计方法
CN109086803B (zh) 一种基于深度学习与个性化因子的雾霾能见度检测***及方法
CN106056624A (zh) 无人机高清图像小目标检测与跟踪***及其检测跟踪方法
CN103729620B (zh) 一种基于多视角贝叶斯网络的多视角行人检测方法
CN110852179B (zh) 基于视频监控平台的可疑人员入侵的检测方法
KR102309111B1 (ko) 딥러닝 기반 비정상 행동을 탐지하여 인식하는 비정상 행동 탐지 시스템 및 탐지 방법
CN111291587A (zh) 一种基于密集人群的行人检测方法、存储介质及处理器
CN113762009B (zh) 一种基于多尺度特征融合及双注意力机制的人群计数方法
Li et al. Improved YOLOv4 network using infrared images for personnel detection in coal mines
CN108471497A (zh) 一种基于云台摄像机的船目标实时检测方法
CN112465854A (zh) 基于无锚点检测算法的无人机跟踪方法
Liu et al. D-CenterNet: An anchor-free detector with knowledge distillation for industrial defect detection
Wang Exploring intelligent image recognition technology of football robot using omnidirectional vision of internet of things
Dou et al. An improved yolov5s fire detection model
Liu et al. Multi-scale personnel deep feature detection algorithm based on Extended-YOLOv3
CN111708907B (zh) 一种目标人员的查询方法、装置、设备及存储介质
Deng et al. Deep learning in crowd counting: A survey
Peng et al. [Retracted] Helmet Wearing Recognition of Construction Workers Using Convolutional Neural Network
Aldabbagh et al. Classification of chili plant growth using deep learning
CN110427920B (zh) 一种面向监控环境的实时行人解析方法
CN115424352A (zh) 一种基于计算机视觉识别后厨有害生物入侵的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21964565

Country of ref document: EP

Kind code of ref document: A1