CN111353452A

CN111353452A - Behavior recognition method, behavior recognition device, behavior recognition medium and behavior recognition equipment based on RGB (red, green and blue) images

Info

Publication number: CN111353452A
Application number: CN202010151359.4A
Authority: CN
Inventors: 熊德智; 陈向群; 胡军华; 柳青; 刘小平; 杨茂涛; 黄瑞; 温和; 欧阳黎; 陈浩; 曾文伟
Original assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Metering Center of State Grid Hunan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Metering Center of State Grid Hunan Electric Power Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-06-30

Abstract

The invention discloses a behavior recognition method, a behavior recognition device, a behavior recognition medium and behavior recognition equipment based on RGB images, which belong to the technical field of behavior recognition and are used for solving the technical problem that no behavior specification intelligent recognition analysis exists in the current service occasion, and the method comprises the following steps: 1) preprocessing the RGB image, segmenting the region of a worker, and capturing or tracking a target; 2) extracting image characteristic parameters, and sending the image characteristic parameters into a cyclic neural network to obtain the mapping between the image characteristic parameters and high-dimensional vectors; 3) on the basis of obtaining the high-dimensional vector of the video frame, establishing a classifier model, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model; 4) and acquiring RGB images in the monitoring video information, and identifying the behaviors of service personnel in the power supply business hall based on the trained classifier model. The invention has the advantages of intelligent identification and analysis of the behavior of service personnel, high identification precision, improvement of working efficiency and service level and the like.

Description

Behavior recognition method, behavior recognition device, behavior recognition medium and behavior recognition equipment based on RGB (red, green and blue) images

Technical Field

The invention mainly relates to the technical field of behavior recognition, in particular to a behavior recognition method, a behavior recognition device, a behavior recognition medium and behavior recognition equipment based on RGB images.

Background

The power supply business hall is the most important service window of a power supply enterprise and has important social functions of communicating, displaying and spreading the enterprise image. The power supply business hall is the front edge of a window of a power supply enterprise and represents the image of the power supply enterprise. The client transacts various electricity utilization businesses to the electricity supply business hall, and the service staff of the electricity supply business hall is contacted firstly. Therefore, the service skills of the staff of the power supply business hall and the attitude of the waiting person and the receiving object often determine the cognitive degree of the client on the service level of the power supply enterprise. The casual and lackluster behaviors of some workers, such as mobile phone playing during working time, sleeping, bad attitude and the like, can leave an extremely bad impression on customers. In addition, the microblog is widely used from media in the information era, and if dissatisfactory customers release information to the internet, the image of an enterprise and a large amount of economic loss are easily caused. At present, the service of the power supply business hall has a perfect standard system, but the conditions of incomplete execution and difficult supervision often exist, and if the service only depends on the field inspection of a competent department, the service is difficult to play a good role in supervision and control. The research on the intelligent recognition, analysis and early warning of the business hall behavior specification is carried out, and the exploration and establishment of demonstration projects are necessary.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the behavior recognition method, the behavior recognition device, the behavior recognition medium and the behavior recognition equipment based on the RGB image, which are simple and convenient to operate, high in recognition accuracy and capable of improving the working efficiency and the service level.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a behavior recognition method based on RGB images is characterized by comprising the following steps:

1) preprocessing an RGB image in a video frame, segmenting the region of a worker, and capturing or tracking a target;

2) extracting image characteristic parameters in the preprocessed RGB image, and sending the image characteristic parameters into a recurrent neural network to obtain the mapping between the image characteristic parameters and high-dimensional vectors;

3) on the basis of obtaining the high-dimensional vector of the video frame, establishing a final classifier model, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model;

4) and acquiring RGB images in the monitoring video information, and identifying the behaviors of service personnel in the power supply business hall based on the trained classifier model.

As a further improvement of the above technical solution, the step 3) specifically includes:

3.1) calling each video frame unit used for feature extraction in the step 2) as a segment, recording the high-dimensional vector output each time as a segment action score, and finally obtaining an SAS feature sequence with equal length for a video containing T frame images;

3.2) after obtaining a signature sequence of length T, using it as input to the SSAD model; the SSAD model is a network which is completely formed by time sequence convolution and mainly comprises three convolution layers: the system comprises a base layer, an anchor frame layer and a prediction layer, wherein the base layer is used for shortening the length of a characteristic sequence and increasing the receptive field of each position in the characteristic sequence;

3.3) continuing to use in the SSAD model to reduce the length of the feature sequence, each position in the feature sequence output by the anchor frame layer is associated with an anchor frame instance of multiple scales;

3.4), obtaining the coordinate offset, the overlapping confidence coefficient and the classification result corresponding to each anchor frame example through a prediction layer;

and 3.5) obtaining the motion instance prediction of each time scale from small to large by the SSAD model through a characteristic sequence with a plurality of layers of time scales being reduced continuously, and establishing a final classifier model.

As a further improvement of the above technical solution, in step 3), training of a classifier model is further included:

correcting the obtained anchor frame by using coordinate offset, and matching the corrected anchor frame with the label example to determine whether the anchor frame example is a positive sample or a negative sample; wherein the SSAD model is model-trained using a loss function comprising a classification loss L_classOverlap confidence regression loss L_overBoundary regression loss L_locAnd a regularization term L₂；

L＝L_class+α·L_over+β·L_loc+λ·L₂(Θ)

Wherein α, β and lambda are coefficients;

during testing, the obtained anchor frame examples are corrected by coordinate offset, and then the final classification result of each anchor frame example is obtained.

As a further improvement of the above technical solution, in step 4), after all the prediction action instances of a segment of video are obtained, a non-maximization suppression algorithm is used to deduplicate the overlapped predictions, so as to obtain a final time sequence action detection result.

As a further improvement of the above technical solution, in step 2), image feature parameters in the RGB image are extracted through a C3D model; the C3D model includes 8 convolution operations, 5 pooling operations; wherein the convolution kernels are all 3 x 3 in size, and the step size is 1 x 1; the size of the pooling nuclei was 2 x 2, the step size was 2 x 2, except for the first pooling, both size and step size were 1 x 2, so as not to reduce the length on the time series too early; finally, after two full connection layers, a 4096-dimensional high-dimensional vector is obtained.

As a further improvement of the above technical solution, in step 1), the preprocessing the video frame specifically includes: the method comprises the steps of adopting a background extraction algorithm to segment the region of a worker, using a voting algorithm to calculate a connected domain positioning target region, capturing or tracking a target, and finally obtaining an image only containing a single target; the motion area in the image is extracted by subtracting the pixel values of two adjacent frames or two images separated by several frames in the video stream and thresholding the subtracted images; or carrying out difference operation on the currently acquired image frame and the background image to obtain a gray level image of the target motion region, carrying out thresholding on the gray level image to extract the motion region, wherein the background image is updated according to the currently acquired image frame.

As a further improvement of the above technical solution, in step 1), the preprocessing of the video frame further includes that a specific start frame and an end frame of the irregular behavior are calibrated for identification, and the specific process includes: extracting a feature sequence of a video frame, generating a plurality of nominations with different sizes at each position in the video by using a sliding window mechanism, then training an action classifier and a ranking for each nomination to classify and sequence the nominations, and finely adjusting an action boundary in the time-series action detection by using a CDC algorithm so as to enable the action boundary to be more accurate.

The invention also discloses a behavior recognition device based on the RGB image, which comprises

The preprocessing unit is used for preprocessing the RGB image in the video frame, segmenting the region of a worker and capturing or tracking a target;

the feature extraction module is used for extracting image feature parameters in the preprocessed RGB image and sending the image feature parameters into a recurrent neural network to obtain the mapping between the image feature parameters and high-dimensional vectors;

the classifier model establishing and training module is used for establishing a final classifier model on the basis of obtaining the high-dimensional vector of the video frame, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model;

and the behavior recognition module is used for acquiring RGB images in the monitoring video information and recognizing the behaviors of the service personnel in the power supply business hall based on the trained classifier model.

The invention further discloses a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the RGB image-based behavior recognition method as described above.

The invention also discloses a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the computer program is characterized in that when being executed by the processor, the computer program executes the steps of the behavior recognition method based on the RGB image.

Compared with the prior art, the invention has the advantages that:

(1) the behavior recognition method based on the RGB image adopts the behavior recognition technology of the RGB image to extract the characteristics, and sends the characteristics into a recurrent neural network to obtain the mapping between the characteristic parameters of the image and high-dimensional vectors; on the basis of obtaining the high-dimensional vector of the video frame, establishing a final classifier model, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model; therefore, RGB images in monitoring video information are obtained, the behavior of service personnel in the power supply business hall is recognized based on the trained classifier model, the operation is simple and convenient, and the recognition precision is high; by the method, the administrative department does not need to frequently check the site, but can check the working condition of the service personnel through monitoring information, thereby greatly improving the efficiency; and personalized training can be carried out according to the service level and the defects of different business hall personnel based on the business hall monitoring information.

(2) The method adopts a frame difference method or a background difference method to extract the motion area, has simple operation and is not easily influenced by environmental light; in the background difference method, the method is used for carrying out motion segmentation on a static scene, specifically, difference operation is carried out on a currently acquired image frame and a background image to obtain a gray image of a target motion region, thresholding is carried out on the gray image to extract the motion region, and the background image is updated according to the currently acquired image frame to avoid the influence of environmental illumination change; or different algorithms are respectively applied to the monitoring video frames, and operations such as voting algorithm, calculation of connected domain positioning target area and the like are used for further improving the segmentation accuracy, and finally an image only containing a single target is obtained; the effect of the model is further improved through the combination of the models.

(3) The method extracts a feature sequence of a video frame, generates a plurality of nominations with different sizes at each position in the video by using a sliding window mechanism, trains an action classifier and a rank for each nomination to classify and sequence the nominations, and finely adjusts an action boundary in the time sequence action detection by adopting a CDC algorithm so as to enable the action boundary to be more accurate.

Drawings

FIG. 1 is a flow chart of an embodiment of the method of the present invention.

Fig. 2a is a schematic diagram of a single frame 2D convolution.

Fig. 2b is a schematic diagram of a 2D convolution of multiple frames.

Fig. 2c is a schematic diagram of the 3D convolution.

Fig. 3 is a schematic diagram of a 3D type network.

FIG. 4 is a schematic diagram of the structure of the SSAD model.

Detailed Description

The invention is further described below with reference to the figures and the specific embodiments of the description.

As shown in fig. 1, the behavior recognition method based on RGB images of this embodiment is applied to behavior recognition of service personnel in a power supply business hall, and specifically includes the following steps:

3) on the basis of obtaining the high-dimensional vector of the video frame, establishing a classifier model, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model;

The behavior recognition method based on the RGB image adopts the behavior recognition technology of the RGB image to extract the characteristics, and sends the characteristics into a recurrent neural network to obtain the mapping between the characteristic parameters of the image and high-dimensional vectors; on the basis of obtaining the high-dimensional vector of the video frame, establishing a final classifier model, establishing mapping from the high-dimensional vector to the final irregular behavior category, and training the classifier model; therefore, RGB images in monitoring video information are obtained, the behavior of service personnel in the power supply business hall is recognized based on the trained classifier model, the operation is simple and convenient, and the recognition precision is high; by the method, the administrative department does not need to frequently check the site, but can check the working condition of the service personnel through monitoring information, thereby greatly improving the efficiency; and personalized training can be carried out according to the service level and the defects of different business hall personnel based on the business hall monitoring information.

In this embodiment, since there are often many people in the monitored video, the preprocessing of the video frame specifically includes: the method comprises the steps of segmenting regions of workers by adopting a background extraction algorithm, calculating a connected domain positioning target region by using a voting algorithm, capturing or tracking a target, finally obtaining an image only containing a single target, and laying a foundation for subsequent classification and behavior analysis and understanding.

Specifically, the background extraction algorithm (or the object detection algorithm) includes an optical flow method, a frame difference method, a background difference method, ViBe, and the like. In the frame difference method (inter-frame difference method), pixel values of two images adjacent to each other or separated by several frames in a video stream are subtracted, and the subtracted images are thresholded to extract a motion region in the images. If the frame numbers of the two subtracted frame images are respectively the kth frame and the (k +1) th frame, the frame images are respectively f_k(x,y),f_k+1(x,y)f_k(x,y),f_k+1(x, y), the difference image binarization threshold is T, the difference image is represented by D (x, y), and the formula of the inter-frame difference method is as follows:

the algorithm is simple and is not easily influenced by ambient light.

In the background difference method, the method is used for performing motion segmentation on a static scene, specifically, difference operation is performed on a currently acquired image frame and a background image to obtain a gray level image of a target motion region, thresholding is performed on the gray level image to extract the motion region, and the background image is updated according to the currently acquired image frame, so that the influence of environmental illumination change is avoided. Background difference methods also differ according to foreground detection, background maintenance and post-processing methods. If It and Bt are respectively the current frame and background frame image, and T is the foreground gray threshold, one of the method flows is as follows:

taking the average value of the images of the previous frames as an initial background image Bt;

carrying out gray subtraction operation on the current frame image and the background image, and taking an absolute value; the formula is | It (x, y) -Bt (x, y) |;

for a pixel (x, y) of the current frame, if | It (x, y) -Bt (x, y) | > T exists, the pixel is a foreground point;

performing morphological operations (corrosion, expansion, opening and closing operations and the like) on the foreground pixel map;

the background image is updated with the current frame image. The method is simple and overcomes the influence of ambient light to a certain extent.

The ViBe is an algorithm for pixel-level video background modeling or foreground detection, and occupies little hardware memory. The algorithm mainly differs from the background model updating strategy in that a sample of pixels needing to be replaced is randomly selected, and neighborhood pixels are randomly selected for updating. When the model of the pixel change cannot be determined, the random updating strategy can simulate the uncertainty of the pixel change to a certain extent. In addition, the ViBe stores a sample set for all the pixel points, and the sampling values stored in the sample set are the past pixel values of the pixel point and the pixel values of the neighbor points of the pixel point. And comparing the new pixel value of each frame in the following with the sample historical value in the sample set to judge whether the new pixel value belongs to the background point. In the model, the background is a stationary or very slowly moving object. The foreground is the object relative to the background, i.e. the object that is moving. Therefore, the background extraction algorithm can also be regarded as a classification problem, and in the process of traversing pixel points, whether a pixel point belongs to a foreground point or a background point is determined. In the ViBe model, the background model stores a sample set for each pixel point, and the size of the sample set is generally 20 points. For a new frame of image, when a certain pixel point of the frame is closer to the sampling value in the sample set of the pixel point, it can be judged that the pixel point is a background point.

Is expressed by the formula:

v (x, y): a current pixel value at pixel point (x, y);

m (x, y) { v1(x, y), v2(x, y),.. vN (x, y) }: a background sample set (sample set size is N) of pixel points (x, y);

r: up and down value ranges;

and (3) subtracting all sample values in v (x, y) and M (x, y), wherein the number of all difference values within the range of +/-R is Nb, and if Nb is greater than a given threshold value min, the current pixel value is similar to a plurality of values in the historical sample of the point, and the (x, y) point is considered to belong to a background point.

The initialization is a process of establishing a background model, a general detection algorithm needs to complete learning of a video sequence with a certain length, detection real-time performance is affected, and when a video picture changes suddenly, the background model needs to be learned again for a long time. The method comprises the steps of taking a first frame of a video as a background model, simultaneously randomly taking a plurality of pixel points around each pixel point in the frame, and filling a sample set of the pixel points, so that the sample set contains the space-time distribution information of the pixel points.

Formulaically, M0(x, y): a pixel point (x, y) in the initial background model;

NG: neighbor points; v0(x, y): pixel values of pixel points (x, y) in the initial original image; thus, there are:

M₀(x)＝{v₀(y|y∈N_G(x))},t＝0

of course, the different algorithms can be applied to the monitoring video frame respectively, and the accuracy of segmentation is further improved by using operations such as voting algorithm, calculation of connected domain positioning target area and the like, and finally an image only containing a single target is obtained; through the combination of the models, the effect of the models is further improved, for example, the finally generated high-dimensional feature vectors are subjected to operations such as averaging, weight averaging, maximum value taking, splicing and the like to obtain synthetic feature vectors, and the synthetic feature vectors are sent to a classifier, and meanwhile, the model training efficiency is further improved by further applying a parameter adjusting skill in practice.

In this embodiment, for the identification of the irregular behavior, the specific start frame and the specific end frame need to be calibrated: extracting a feature sequence of a video frame, generating a plurality of nominations with different sizes at each position in the video by using a sliding window mechanism, then training an action classifier and a ranking for each nomination to classify and sequence the nominations, and finely adjusting an action boundary in the time-series action detection by using a CDC algorithm so as to enable the action boundary to be more accurate.

In the embodiment, a C3D model is adopted to extract features, and then the full-connection layer is sent to a subsequent classifier; among them, Convolutional Neural Networks (CNN) have been widely used in computer vision in recent years, including tasks such as classification, detection, and segmentation. These tasks are typically performed on images using two-dimensional convolution (i.e., the dimension of the convolution kernel is two-dimensional). For the problem based on video analysis, the two-dimensional convolution cannot capture information on time sequence well, so the three-dimensional convolution is proposed. The C3D model is proposed as a general network, and can be used in the fields of behavior recognition, scene recognition, video similarity analysis and the like.

As shown in fig. 2a and fig. 2b, in the case of 2D convolution for single-channel image and multi-channel image (where the multi-channel image may refer to 3 color channels of the same picture, and also refers to a plurality of stacked pictures, i.e. a short segment of video), the output is a two-dimensional feature map for a filter, and the information of the multi-channel is completely compressed. While the output of the 3D convolution in 2c is still a 3D signature. The value of the (x, y, z) position of the ith layer jth feature map can be found as follows:

where Ri is the size of the 3D convolution kernel in the timing dimension,

is the value at the (p, q, r) position of the mth feature map at the upper layer of the convolution kernel connection. Consider a video segment input of size c l h w, where c is the image channel (typically 3), l is the length of the video sequence, and h and w are the width and height of the video, respectively. And performing 3D convolution with the kernel size of 3 x 3, the step length of 1, edge supplementing and the number of filters K, outputting the convolution with the size of K x l h w, and performing pooling.

Where a C3D type network is shown in fig. 3, where there are 8 convolution operations and 5 pooling operations. Wherein the convolution kernels are all 3 x 3 in size and have a step size of 1 x 1. The number below the name is the number of convolution kernels. The size of the pooling nuclei was 2 x 2 and the step size was 2 x 2, except for the first pooling, which was 1 x 2 in both size and step size. This is to reduce the length of the time sequence without early, and the final network gets 4096-dimensional high-dimensional feature vectors after two full-connection layers.

In this embodiment, in step 3), the classifier model uses softmax and a multi-class support vector machine multi-class SVM to establish a mapping from a high-dimensional vector to a final class; the specific construction process is as follows:

In this embodiment, in step 3), training of the classifier model is further included:

correcting the obtained anchor frame by using coordinate offset, and matching the corrected anchor frame with the tag instance to determine the anchor frameWhether an instance is a positive or negative sample; wherein the SSAD model is model-trained using a loss function comprising a classification loss L_classOverlap confidence regression loss L_overBoundary regression loss L_locAnd a regularization term L₂；

L＝L_class+α·L_over+β·L_loc+λ·L₂(Θ)

Wherein α, β and lambda are coefficients;

In this embodiment, in step 4), after all the predicted action instances of a section of video are obtained, a non-maximization suppression algorithm is used to perform deduplication on overlapped predictions, so as to obtain a final time sequence action detection result.

In this embodiment, the staff of the power supply business hall is mainly divided into two categories, namely, a leader and a service staff, and each job has a common behavior specification and also has a respective unique behavior specification. The following table lists the main non-canonical behavior for both working categories. Respectively training two kinds of working personnel with six kinds of classifiers comprising 5 irregular behaviors and normal behaviors as shown in the following table 1:

table 1:

and defining the category of the non-standard behaviors according to a service specification manual of the power supply business hall, selecting representative category of the non-standard behaviors for model training, wherein the category of the non-standard behaviors is not specified to confirm the grade. Reporting the statistical information of each service person to a manager at intervals, calculating by a design program according to the statistical frequency of each service person and each non-standard behavior grade through a certain formula to obtain a service standard coefficient, and performing early warning if the service standard coefficient exceeds a set threshold value. In addition, the nonstandard behaviors of cloud service personnel are analyzed, the occurrence frequency and the occupied proportion of the nonstandard behaviors are counted, a training classroom is established, training courses with corresponding weight values are distributed according to the nonstandard behavior statistical information of different service personnel, and meanwhile, a demonstration project is established, so that personalized training is realized.

Specifically, monitoring is carried out through a depth camera, the depth camera is arranged at the four directions of a hall and 45-degree deviation in front of counter service personnel, the hall service personnel and the counter service personnel are monitored in real time, the actions of the service personnel are detected and learned through a face recognition technology and an action start and end frame detection technology, learning results are compared with a cloud nonstandard action feature library, information such as nonstandard action features and early warning levels of the service personnel is recorded, and the information is stored in the cloud.

The behavior recognition device based on the RGB image is used for executing the behavior recognition method, has the advantages of the method, and is simple in structure and convenient to operate.

The invention further discloses a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the RGB image-based behavior recognition method as described above.

The invention also discloses a computer device comprising a memory and a processor, wherein the memory is stored with a computer program, and the computer program executes the steps of the behavior recognition method based on the RGB image when being executed by the processor.

All or part of the flow of the method of the embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and executed by a processor, to implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. The memory may be used to store computer programs and/or modules, and the processor may perform various functions by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A behavior recognition method based on RGB images is characterized by comprising the following steps:

2. The RGB image-based behavior recognition method according to claim 1, wherein the step 3) is specifically:

3.2) after obtaining a signature sequence of length T, using it as input to the SSAD model; the SSAD model is a network which is composed of time sequence convolution, and comprises three convolution layers: the system comprises a base layer, an anchor frame layer and a prediction layer, wherein the base layer is used for shortening the length of a characteristic sequence and increasing the receptive field of each position in the characteristic sequence;

3. The RGB image-based behavior recognition method as claimed in claim 2, further comprising, in step 3), training of classifier models:

L＝L_class+α·L_over+β·L_loc+λ·L₂(Θ)

Wherein α, β and lambda are coefficients;

4. The method as claimed in claim 3, wherein in step 4), after all the prediction motion instances of a segment of video are obtained, the overlapped predictions are de-duplicated by using a non-maximization suppression algorithm, so as to obtain a final time sequence motion detection result.

5. The behavior recognition method based on RGB images as claimed in any one of claims 1-4, wherein in step 2), image feature parameters in RGB images are extracted by C3D model; the C3D model includes 8 convolution operations, 5 pooling operations; wherein the convolution kernels are all 3 x 3 in size, and the step size is 1 x 1; the size of the pooling nuclei was 2 x 2, the step size was 2 x 2, except for the first pooling, both size and step size were 1 x 2, so as not to reduce the length on the time series too early; finally, after two full connection layers, a 4096-dimensional high-dimensional vector is obtained.

6. The behavior recognition method based on RGB image as claimed in any of claims 1-4, wherein in step 1), the preprocessing of the video frame specifically comprises: the method comprises the steps of adopting a background extraction algorithm to segment the region of a worker, using a voting algorithm to calculate a connected domain positioning target region, capturing or tracking a target, and finally obtaining an image only containing a single target; the motion area in the image is extracted by subtracting the pixel values of two adjacent frames or two images separated by several frames in the video stream and thresholding the subtracted images; or carrying out difference operation on the currently acquired image frame and the background image to obtain a gray level image of the target motion region, carrying out thresholding on the gray level image to extract the motion region, wherein the background image is updated according to the currently acquired image frame.

7. The method for recognizing behaviors based on RGB images as claimed in any one of claims 1-4, wherein in step 1), the preprocessing of the video frame further comprises calibrating the specific start frame and end frame for recognizing the irregular behaviors, and the specific process is as follows: extracting a feature sequence of a video frame, generating a plurality of nominations with different sizes at each position in the video by using a sliding window mechanism, then training an action classifier and a ranking for each nomination to classify and sequence the nominations, and finely adjusting an action boundary in the time-series action detection by using a CDC algorithm so as to enable the action boundary to be more accurate.

8. A behavior recognition device based on RGB image is characterized by comprising

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the RGB image-based behavior recognition method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program, when executed by the processor, performs the steps of the RGB image-based behavior recognition method according to any one of claims 1 to 7.