CN110610159A

CN110610159A - Real-time bus passenger flow volume statistical method

Info

Publication number: CN110610159A
Application number: CN201910869554.8A
Authority: CN
Inventors: 靳展; 章国泰; 钟明旸; 王红广
Original assignee: Tianjin Card Intelligent Network Polytron Technologies Inc
Current assignee: Tianjin Card Intelligent Network Polytron Technologies Inc
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2019-12-24

Abstract

The invention relates to a real-time bus passenger flow rate statistical method. The method aims to detect and track the head in a video image. The video data of the method is passenger flow video data shot from the top areas of the front door and the rear door of the bus by using a camera, and the statistical method comprises the following steps: front-end data acquisition, model training, head tracking and passenger flow counting; compared with the prior art, the invention has the advantages that: the method utilizes a deep learning method to detect the head, has real-time performance, and can efficiently and accurately detect the passenger flow.

Description

Real-time bus passenger flow volume statistical method

The technical field is as follows:

the invention relates to the technical field of image processing in pattern recognition, and further relates to a real-time bus passenger flow volume statistical method.

Background art:

the infrared device and pressure sensor method is a technique for making statistics of passenger flow, and the detection error is large, so that the method is not used until now and is gradually abandoned. In recent years, with the continuous development of the deep learning field and the GPU parallel computing field, the computer vision direction is rapidly developed. The passenger flow statistics is an important application in the field of image processing, and is also a new field and direction in the current intelligent video monitoring.

In recent years, passenger flow statistics methods are mainly based on three main categories: the detection method is based on feature points, human body segmentation and tracking and deep learning. The accuracy of the detection method based on the feature points and based on the human body segmentation and tracking needs to be improved. As hardware conditions are continuously improved and popularized, the detection method based on deep learning is more and more emphasized and popularized.

The invention content is as follows:

the invention aims to provide a method for detecting the human head in a video image and realize end-to-end training. Such a method should be able to detect the passenger flow efficiently and accurately. The specific technical scheme is as follows:

the video data of the method is passenger flow video data shot from the top areas of the front door and the rear door of the bus by using a camera, and the statistical method comprises the following steps:

step 1: front-end data acquisition:

step 1.1: video acquisition: vertically installing a camera right above a bus door, and acquiring an image video of passengers getting on and off the bus;

step 1.2: video framing, namely framing the video and dividing the video into 640 × 480 RGB three-channel images;

step 1.3: image zooming: scaling each frame of image into 224 × 224 data;

step 2: training a model;

step 2.1: and (3) feature calculation: extracting the characteristics of the image by using a convolution kernel of 3 x 3; then, in order to reduce the calculation time of the features, the original convolution operation is changed into a depth separable convolution operation, a series of mathematical transformations are carried out on the feature mapping by utilizing normalized Batch-Normalization and a nonlinear activation function Relu, and the dimensionality of the feature mapping is reduced by times by utilizing a maximum value downsampling maxporoling method; performing convolution on the feature mapping by using convolution with the size of 1 × 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu to ensure the number of channels of the feature mapping;

step 2.2: feature extraction: because the target sizes are different, feature mappings of different dimensions need to be extracted;

step 2.3: extracting an anchor frame: on feature maps of different dimensions; selecting four frames with different scales by taking each feature point as a center; taking all frames as candidate frames of target classification and regression;

step 2.4: and (3) target classification: carrying out convolution on the feature mapping by utilizing the depth separable convolution, obtaining the prediction score of the candidate box obtained in the step 2.3 for each category, comparing the prediction score with the ground truth value, and calculating the cross entropy of the candidate box and the ground truth value to obtain the classification loss; optimizing the parameters of the network by using a random gradient descent method until the classification loss reaches a specified range;

step 2.5: target regression: performing convolution on the feature mapping by using depth separable convolution, obtaining the central point and width and height predicted values of the candidate frame obtained in the step 2.3, comparing the predicted values with ground truth values, and calculating linear regression functions of the predicted values and the ground truth values to obtain linear regression loss; optimizing the parameters of the network by using a random gradient descent method until the linear regression loss reaches a specified range;

and step 3: head detection;

step 3.1: and (3) feature calculation: extracting the characteristics of the image by using a convolution kernel of 3 x 3; then, in order to reduce the calculation time of the features, the original convolution operation is changed into a depth separable convolution operation, the feature mapping is subjected to mathematical transformation by utilizing normalized Batch-Normalization and a nonlinear activation function Relu, and the dimensionality of the feature mapping is reduced in a multiplied mode by utilizing a maximum value down-sampling maxporoling method; performing convolution on the feature mapping by using convolution with the size of 1 × 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu to ensure the number of channels of the feature mapping;

step 3.2: feature extraction: because the target sizes are different, feature mappings of different dimensions need to be extracted;

step 3.3: extracting an anchor frame: on feature maps of different dimensions; selecting four frames with different scales by taking each feature point as a center; taking all frames as candidate frames of target classification and regression;

step 3.4: and (3) target classification: convolving the feature mapping by using depth separable convolution, obtaining the prediction score of the candidate frame obtained in the step 2.3 for each category, and screening the candidate frame by using an interaction over Union (Ious) and non-maximum suppression (NMS) method;

step 3.5: target regression: performing convolution on the feature mapping by using depth separable convolution, obtaining a central point and a width and height predicted value of the candidate frame obtained in the step 2.3, and screening the candidate frame by using an interaction over Union (Ious) and non-maximum suppression (NMS) method;

and 4, step 4: tracking the head: the kernel filter method tracks the detection window and forms a track, and if the track exceeds a specified limit, the passenger finishes the action of getting on or off the vehicle;

a tracking head stage:

and 5: and (3) passenger flow volume counting: if the passenger finishes the action of getting on/off the bus, the algorithm updates the data added by 1 for the passenger flow, otherwise, the passenger flow is kept unchanged;

as one of the preferable schemes, the specific process of step 2.1 is as follows: firstly, carrying out convolution kernel with the size of 3 x 3 and the step length of 1 on an RGB image, then carrying out mathematical transformation on feature mapping by utilizing normalized Batch-Normalization and a nonlinear activation function Relu, and carrying out down-sampling on the feature mapping by utilizing a maximum value down-sampling maxporoling method; then, performing convolution on the feature mapping by using depth separable convolution with the size of 3 × 3 and the step size of 1, performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu, and performing downsampling on the feature mapping by using a maximum downsampling method; and (3) performing convolution on the feature mapping by using the convolution with the size of 1 × 1 and the step size of 1, and performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu to ensure the number of channels of the feature mapping.

As a second preferred solution, the selection method of the feature mapping in step 2.2 is as follows: all the different feature mappings need to be selected due to the different sizes of the targets; for small targets, a larger feature mapping needs to be selected for classification and regression; for large targets, a smaller feature map needs to be selected for classification and regression.

As a third preferred scheme, in step 2.3, the candidate frame selection method comprises: selecting four frames with different scales on feature mapping with different dimensions by taking each feature point as a center; this allows the entire area of the image to be covered to avoid missing portions.

As a fourth preferred solution, in step 2.4, the cross entropy loss is calculated as follows:

y_ithe real value of the ground is shown,indicates the predicted value, n is the number of candidate frames

As a fifth preferred embodiment, the method further comprises step 2.5, and the linear regression loss is calculated as follows:

x_i，y_i，w_i，h_ithe real value of the ground is shown,and expressing a predicted value, wherein alpha is a coefficient, alpha needs to be selected to be a proper value according to a specific scene, and n is the number of candidate frames.

As a further preferable embodiment of the fifth preferable embodiment, the overall loss L is further obtained in the step 2.5, and the calculation manner of L is as follows:

L＝L_loc+βL_reg

wherein β is a coefficient;

and optimizing the parameters of the network by using a random gradient descent method until the total loss reaches a specified range.

As a sixth preferred embodiment, the method further comprises the step 6: storing the coordinates and width and height of the passenger head detected by each frame of image into a file of a detection result, and storing each frame of image; when the bus stops running, the method can output the final passenger flow.

Compared with the prior art, the invention has the advantages that:

the detection method has real-time performance and can efficiently and accurately detect the passenger flow.

Depth separable convolution (II) can speed up the time of feature extraction.

And (III) in the model training process, end-to-end training can be realized, and all training can be completed on the GPU.

And (IV) maximum value sampling maxporoling is used for replacing the step size to be 2, so that the loss of detail information can be avoided, and the smoothness of the target is improved.

Description of the drawings:

fig. 1 is a flow chart of a bus passenger flow volume statistical method in the embodiment of the invention.

Fig. 2 is a schematic flow chart of a convolution module in the embodiment of the present invention.

FIG. 3 is a flow chart illustrating a depth separable convolution according to an embodiment of the present invention.

The specific implementation mode is as follows:

example (b):

a real-time bus passenger flow volume statistical method is characterized in that video data of the method are passenger flow video data shot from top areas of a front door and a rear door of a bus by using cameras, and the statistical method comprises the following steps:

step 1: front-end data acquisition:

step 1.3: image zooming: scaling each frame of image into 224 × 224 data;

step 2: completing model training on a CPU or a GPU;

step 2.1: and (3) feature calculation: firstly, carrying out convolution kernel with the size of 3 x 3 and the step length of 1 on an RGB image, then carrying out mathematical transformation on feature mapping by utilizing normalized Batch-Normalization and a nonlinear activation function Relu, and carrying out down-sampling on the feature mapping by utilizing a maximum value down-sampling maxporoling method; then, performing convolution on the feature mapping by using depth separable convolution with the size of 3 × 3 and the step size of 1, performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu, and performing downsampling on the feature mapping by using a maximum downsampling method; performing convolution on the feature mapping by using convolution with the size of 1 × 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using normalized Batch-Normalization and a nonlinear activation function Relu to ensure the number of channels of the feature mapping;

step 2.2: feature extraction: because the target sizes are different, feature mappings of different dimensions need to be extracted; for small targets, a larger feature mapping needs to be selected for classification and regression; for a large target, a smaller feature mapping needs to be selected for classification and regression;

step 2.3: extracting an anchor frame: selecting four frames with different scales on feature mapping with different dimensions by taking each feature point as a center; thus, all areas of the image can be completely covered to avoid missed detection;

step 2.4: and (3) target classification: carrying out convolution on the feature mapping by utilizing the depth separable convolution, obtaining the prediction score of the candidate box obtained in the step 2.3 for each category, comparing the prediction score with the ground truth value, and calculating the cross entropy of the candidate box and the ground truth value to obtain the classification loss; optimizing the parameters of the network by using a random gradient descent method until the classification loss reaches a specified range; the cross entropy loss is calculated as follows:

y_ithe real value of the ground is shown,representing a predicted value, wherein n is the number of candidate frames;

step 2.5: target regression: performing convolution on the feature mapping by using depth separable convolution, obtaining the central point and width and height predicted values of the candidate frame obtained in the step 2.3, comparing the predicted values with ground truth values, and calculating linear regression of the candidate frame and the ground truth values to obtain linear regression loss; the linear regression loss was calculated as follows:

x_i，y_i，w_i，h_ithe real value of the ground is shown,representing a predicted value, wherein alpha is a coefficient, alpha is 0.5 in the scene, other scenes need to select a proper numerical value according to a specific scene, and n is the number of candidate frames;

the overall loss function of the model is calculated as follows:

L＝L_loc+βL_reg

wherein β is a coefficient, β is 0.25 in the present scenario, and other scenarios need to select a suitable value according to a specific scenario;

optimizing the parameters of the network by using a random gradient descent method until the total loss reaches a specified range;

and step 3: completing head detection on a CPU or a GPU;

step 3.5: target regression: convolving the feature mapping by using depth separable convolution, obtaining predicted values of the center point and the width and the height of the candidate frame obtained in the step 2.3, and screening the candidate frame by using an interaction over Union (Ious) and non-maximum suppression (NMS) method

And 4, step 4: tracking the head: training a relevant filter by the relevant kernel filter according to the information of the previous and next frames, and carrying out relevance calculation on the relevant filter and the newly input frame to obtain a confidence image which is a predicted tracking result; the point or block with the highest score is the tracking result;

and 5: and (3) passenger flow volume counting: if the passenger has finished getting on/off the bus, the algorithm will update the data of 1 increase to the passenger flow, otherwise, the passenger flow will remain unchanged

Step 6: storing the coordinates and width and height of the passenger head detected by each frame of image into a file of a detection result, and storing each frame of image; when the bus stops running, the method can output the final passenger flow.

Claims

1. A real-time bus passenger flow rate statistical method is characterized by comprising the following steps:

step 1: front-end data acquisition:

step 1.2: video framing, namely framing the video;

dividing into 640 × 480 RGB three-channel images;

step 1.3: image zooming: scaling each frame of image into 224 × 224 data;

step 2: model training:

step 2.1: and (3) feature calculation: extracting the characteristics of the image by using a convolution kernel of 3 x 3; then, in order to reduce the calculation time of the features, the original convolution operation is changed into a depth separable convolution operation, a series of mathematical transformation is carried out on the feature mapping by utilizing a normalization function and a nonlinear activation function, and the dimensionality of the feature mapping is reduced by times by utilizing a maximum value down-sampling method; performing convolution on the feature mapping by using convolution with the size of 1 x 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using normalization and a nonlinear activation function to ensure the number of channels of the feature mapping;

step 2.3: extracting an anchor frame: on feature mapping with different dimensions, taking each feature point as a center, selecting four frames with different scales, and taking all the frames as candidate frames for target classification and regression;

and step 3: head detection:

step 3.1: and (3) feature calculation: extracting the characteristics of the image by using a convolution kernel of 3 x 3; then, in order to reduce the calculation time of the features, the original convolution operation is changed into a depth separable convolution operation, a series of mathematical transformation is carried out on the feature mapping by utilizing a normalization function and a nonlinear activation function, and the dimensionality of the feature mapping is reduced by times by utilizing a maximum value down-sampling method; performing convolution on the feature mapping by using convolution with the size of 1 x 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using normalization and a nonlinear activation function to ensure the number of channels of the feature mapping;

step 3.4: and (3) target classification: carrying out convolution on the feature mapping by utilizing depth separable convolution, obtaining the prediction score of the candidate frame obtained in the step 2.3 to each category, and screening the candidate frame by utilizing a non-maximum value inhibition method;

step 3.5: target regression: convolving the feature mapping by using depth separable convolution to obtain the central point and width and height predicted values of the candidate frame obtained in the step 2.3, and screening the candidate frame by using a method of suppressing non-maximum values

And 4, step 4: tracking the head:

tracking the detection window by using a related kernel filter method, forming a track, and if the track crosses a specified limit, indicating that the passenger finishes the action of getting on or off the vehicle;

and 5: and (3) passenger flow volume counting:

if the passenger has finished getting on or off, the algorithm will update the data with the addition of 1 to the passenger flow, otherwise, the passenger flow will remain unchanged.

2. The method for real-time bus passenger flow statistics according to claim 1, wherein the specific process of the step 2.1 is as follows: firstly, carrying out convolution kernel with the size of 3 x 3 and the step length of 1 on an RGB image, then carrying out mathematical transformation on feature mapping by utilizing a normalization and nonlinear activation function, and carrying out down-sampling on the feature mapping by utilizing a maximum value down-sampling method; then, performing convolution on the feature mapping by using depth separable convolution with the size of 3 x 3 and the step length of 1, performing mathematical transformation on the feature mapping by using a normalization and nonlinear activation function, and performing down-sampling on the feature mapping by using a maximum value down-sampling method; and (3) performing convolution on the feature mapping by using the convolution with the size of 1 × 1 and the step length of 1, and performing mathematical transformation on the feature mapping by using the normalization and the nonlinear activation function to ensure the channel number of the feature mapping.

3. The method for real-time statistics of the passenger flow volume of the bus according to claim 1, wherein the selection method of the feature mapping in the step 2.2 is as follows: all the different feature mappings need to be selected due to the different sizes of the targets; for small targets, a larger feature mapping needs to be selected for classification and regression; for large targets, a smaller feature map needs to be selected for classification and regression.

4. The method for real-time bus passenger flow statistics according to claim 1, wherein the selection method of the candidate frame in the step 2.3 comprises the following steps: on feature maps of different dimensions; selecting four frames with different scales by taking each feature point as a center; this allows the entire area of the image to be covered to avoid missing portions.

5. The method for real-time bus passenger flow statistics according to claim 1, characterized in that the cross entropy loss in step 2.4 is calculated as follows:

y_ithe real value of the ground is shown,indicates the predicted value, and n is the number of candidate frames.

6. The method for real-time statistics of bus passenger flow according to claim 1, characterized in that the linear regression loss in step 2.5 is calculated as follows:

x_i，y_i，w_i，h_ithe real value of the ground is shown,denotes a prediction value, α is a coefficient, and n is the number of candidate frames.

7. The method according to claim 6, wherein the step 2.5 further obtains the total loss L, and the calculation method of L is as follows:

L＝L_loc+βL_reg

wherein β is a coefficient;

8. The method for real-time bus passenger flow statistics according to any one of claims 1-7, characterized in that in step 2 and step 3, the end-to-end operation is completed on the CPU or GPU.

9. The method for real-time statistics of bus passenger flow according to any one of claims 1-7, wherein in step 4, a correlation kernel filter is trained according to the information of the previous and next frames, and correlation calculation is performed with the newly inputted frame, and the obtained confidence map is the predicted tracking result; the point or block with the highest score is the tracking result.

10. The method for real-time statistics of bus passenger flow according to any one of claims 1-7, characterized by further comprising: