CN111814755A

CN111814755A - Multi-frame image pedestrian detection method and device for night motion scene

Info

Publication number: CN111814755A
Application number: CN202010832374.5A
Authority: CN
Inventors: 陈海波; 罗志鹏; 徐振宇; 姚粤汉
Original assignee: Shenyan Technology Beijing Co ltd
Current assignee: Shenyan Technology Beijing Co ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-10-23

Abstract

The invention provides a method and a device for detecting pedestrians by multi-frame images facing a night motion scene, wherein the method comprises the following steps: acquiring a data set containing a plurality of night multi-frame images, and performing enhancement processing on the night multi-frame images in the data set; constructing a neural network, wherein the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, a deformable convolution network is fused in each backbone network, and the prediction network comprises a double-branch structure; training the neural network through the enhanced data set, and judging a pedestrian target according to the interframe IOU value of the multi-frame image in the training process to obtain a pedestrian detection model; and carrying out pedestrian detection on the night multi-frame image to be detected through the pedestrian detection model. The pedestrian detection method and the pedestrian detection system can realize pedestrian detection aiming at multi-frame images of scenes such as night, and are high in accuracy and robustness.

Description

Multi-frame image pedestrian detection method and device for night motion scene

Technical Field

The invention relates to the technical field of target detection, in particular to a method for detecting pedestrians by using a multi-frame image facing a night motion scene, a device for detecting pedestrians by using a multi-frame image facing a night motion scene, computer equipment, a non-transitory computer readable storage medium and a computer program product.

Background

With the great improvement of computer storage capacity and computing capacity, video information is increasingly used as an information medium in daily life, and therefore, video processing and analysis are very important. As a basic problem in video analysis, video target detection has been a research hotspot in the industry and the trade. The video pedestrian automatic detection technology has wide application in the fields of intelligent transportation, unmanned driving, intelligent video monitoring and the like, but the video pedestrian detection field faces huge challenges due to the problems of large deformation, different postures, shadow shielding and the like during the movement of pedestrians. Particularly, the night video sequence has the problems of weak illumination intensity, high image noise and the like, so that the research work is more difficult to obtain outstanding results.

Disclosure of Invention

The invention provides a method and a device for detecting pedestrians by multi-frame images facing a night motion scene, aiming at solving the technical problems, and the method and the device can realize the detection of the pedestrians by the multi-frame images of the night scene, and have higher accuracy and robustness.

The technical scheme adopted by the invention is as follows:

a multi-frame image pedestrian detection method facing a night motion scene comprises the following steps: acquiring a data set containing a plurality of night multi-frame images, and performing enhancement processing on the night multi-frame images in the data set; constructing a neural network, wherein the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, a deformable convolution network is fused in each backbone network, and the prediction network comprises a double-branch structure; training the neural network through the enhanced data set, and judging a pedestrian target according to an inter-frame IOU (Intersection Over Unit) value of a plurality of frames of images in the training process to obtain a pedestrian detection model; and carrying out pedestrian detection on the night multi-frame image to be detected through the pedestrian detection model.

And carrying out spatial-level image enhancement on the night multi-frame images in the data set in a batch data mode.

The main network is ResNeXt, the double-branch structures are FC-head and Conv-head respectively, the FC-head is used as a classification network, and the Conv-head is used as a regression network.

The method for judging the pedestrian target according to the interframe IOU value of the multi-frame image in the training process comprises the following steps: filtering the detection frames obtained by training, leaving the detection frames with the category scores larger than a first threshold value theta, setting the detection frames as Box 1, for a current frame, firstly calculating IOU values of the detection frames Boxes1 of the current frame and the tracking frames of a tracking queue of a previous frame, judging the maximum IOU value of each detection frame, if the maximum IOU value is larger than a second threshold value sigma, considering that the detection of the detection frame is correct, otherwise, if the maximum IOU value is smaller than the second threshold value sigma, judging whether the maximum detection score of the tracking frame in a previous video frame is larger than a third threshold value, and whether the number of times of the tracking frame appearing in the previous frame is larger than a minimum occurrence threshold value T, if the maximum IOU value is larger than the corresponding threshold value, judging that the detection frame of the current frame is wrong.

Regression loss L in training a network_locUsing smoothed L₁Loss, x is ROI, b is predicted coordinates for ROI, g is tag coordinate values, f represents regressor,

b＝(b_x,b_y,b_w,b_h)

to ensure invariance of regression operations to scale, location, L_locOperation-associated vector Δ ═ and_x,_y,_w,_h)，

and (3) carrying out a regularization operation on delta:

′_x＝(_x-u_x)/σ_x；

detecting each Head in a network_iTotal loss of (i ═ 1,2, 3):

L(x^t,g)＝L_cls(h_t(x^t),y^t)+λ[y^t≥1]L_loc(f_t(x^t,b^t),g)

b^t＝f_t-1(x^t-1,b^t-1)

wherein T represents the total number of branches of Cascade RCNN superposition, T represents the current branch, and each branch f in Cascade RCNN_tBy training data b on individual branches_tOptimization, b_tDerived from b₁As a result of the outputs of all the branches, λ is a weighting coefficient, λ is 1, [ y ═ y-^t≥1]Means that the regression loss, y, is calculated only in the positive samples^tIs x^tAccording to the above formulae_tThe calculated label.

A multi-frame image pedestrian detection device facing a night motion scene comprises: the enhancement module is used for acquiring a data set containing a plurality of night multi-frame images and enhancing the night multi-frame images in the data set; the device comprises a construction module and a prediction module, wherein the construction module is used for constructing a neural network, the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, a deformable convolution network is fused in each backbone network, and the prediction network comprises a double-branch structure; the training module is used for training the neural network through the enhanced data set, and judging a pedestrian target according to the interframe IOU value of the multi-frame image in the training process to obtain a pedestrian detection model; the detection module is used for detecting pedestrians for the night multi-frame images to be detected through the pedestrian detection model.

A computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the multi-frame image pedestrian detection method facing the night motion scene is realized.

A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the above-described method for detecting pedestrians with respect to multiple frames of images in a moving night scene.

A computer program product, wherein instructions are executed by a processor to execute the method for detecting pedestrians by using multi-frame images facing a night motion scene.

The invention has the beneficial effects that:

the method inputs the enhanced multi-frame images into the neural network for training, fuses a plurality of trunk networks in the characteristic extraction network of the neural network, fuses a deformable convolution network in each trunk network, sets a double-branch structure in the prediction network, and judges the pedestrian target according to the inter-frame IOU value of the multi-frame images in the training process, so that the obtained pedestrian detection model can realize pedestrian detection aiming at the multi-frame images in night scenes, and has high accuracy and robustness.

Drawings

FIG. 1 is a flowchart of a method for detecting pedestrians through multiple frames of images facing a night motion scene according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature extraction network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the RPN according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of the structure of Cascade RCNN according to one embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a Double Head according to an embodiment of the present invention

Fig. 6 is a block diagram of a multi-frame image pedestrian detection device facing a night motion scene according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the method for detecting pedestrians by using multi-frame images facing a moving scene at night in the embodiment of the present invention includes the following steps:

and S1, acquiring a data set containing a plurality of night multi-frame images, and performing enhancement processing on the night multi-frame images in the data set.

The data set may include a large number of multi-frame images captured in a night motion scene, such as a video captured at night by a camera disposed on a corresponding road, or an image in gif format, and some of the multi-frame images include moving pedestrians and some of the multi-frame images do not include pedestrians. The data set is used as a training set, and the higher the number of multi-frame images contained in the data set is, the higher the accuracy of a subsequently trained detection model is.

In one embodiment of the invention, spatial-level image enhancement can be performed on night multi-frame images in a data set in the form of batch data so as to remove image noise without destroying structural information of original images.

Specifically, the multi-frame image in the data set can be randomly sampled, and for the sampled multi-frame image I_iCompare its own width I_iW and high I_iH, selecting the long side max (I) in width and height_i_w,I_iH) scaling to L, short side min (I)_i_w,I_iH) scaling to S, S from S₁～S₂Randomly selected from the above. Sampled multiple multiframe images I_i(I ═ 1,2,3 … n) is fed into feature extraction in the form of batch IAnd a network, wherein the long side of all the multi-frame images in the batch is L, and the short sides of the images are uniform in size, and the short sides S of the multi-frame images in the whole batch are used_i(i is 1,2,3 … n) is the maximum value max (S)_i) Is a reference S _ base, the rest S_iAdding padding to S _ base.

S_base＝S_i+padding

In one embodiment of the present invention, L may be 2048 and the short sides S1-S2 may be 1024-1536.

And S2, constructing a neural network, wherein the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, each backbone network fuses a deformable convolution network, and the prediction network comprises a double-branch structure.

In an embodiment of the invention, the backbone network can be ResNeXt, and a deformable convolution network can be added in the ResNeXt, so that the spatial information modeling capability of the network is improved, and the robustness of a subsequently trained detection model to the size of an object can be improved to a certain extent by adding additional parameters to learn the deformation of a target; fusing a plurality of ResNeXt networks by using a composite backbone network to fuse high-layer semantic information and low-layer semantic information and extract more effective characteristic information; and a characteristic pyramid network is accessed, and the multi-scale characteristics are fused by combining the shallow semantic information and the deep position information, so that the detection of the multi-scale object by the model is facilitated.

The dual-branch structure is respectively FC-Head and Conv-Head, the FC-Head is used as a classification network and the Conv-Head is used as a regression network aiming at different requirements, different branches have different biases, and compared with a single-Head structure, the dual-Head structure classification and coordinate regression precision is higher.

And S3, training the neural network through the enhanced data set, and judging the pedestrian target according to the interframe IOU value of the multi-frame image in the training process to obtain a pedestrian detection model.

Specifically, the multiframe image I in the enhanced data set may be first subjected to a convolution operation of 7 × 7, the purpose of which is to directly downsample the input image, and to retain as much information as possible of the original image without performing convolution on the input imageThe number of channels needs to be increased. Then, as shown in FIG. 2, the image is sequentially passed through four stages (stages)₁，Stage₂，Stage₃，Stage₄) Each Stage is composed of a plurality of Residual Block Residual blocks horizontally. Each Residual Block is used to extract features more finely over the broader features obtained in the previous stage and is composed of two branches, one of which is a Residual branch and the other of which is composed of three layers in turn. The three layers are sequentially a 1x1 convolutional layer, a deformable convolutional layer, and a 1x1 convolutional layer. The deformable convolution layer comprises two steps, firstly, the position offset of each pixel required by deformable convolution is calculated through a convolution operation of 3x3, and then the position offset is acted on a convolution kernel to obtain the deformable convolution layer. The residual branch is composed of a 1x1 convolution layer and is mainly used for extracting residual characteristic information of the image. And after the characteristic diagrams respectively pass through two Residual branch circuits of the Residual Block, adding the formed characteristic diagrams to serve as input characteristics of the next Stage.

In particular, each Stage takes its output signature as an input signature for the Stage alongside it laterally before it enters the next Stage. Specifically, the input image passes through Stage₁Then, a feature map F is generated₁，F₁As Stage₁Stage arranged side by side transversely_{1_1}) Input characteristic of (1), F₁Passing through Stage_{1_1}Post-production profile F₂；F₁Passing through Stage₂Then, a feature map F is generated₃，F₃And F₂Added to obtain Stage₂Stage arranged side by side transversely_{2_2}) Input features of (1), via Stage_{2_2}Post-production profile F₄；F₃Passing through Stage₃Then, a feature map F is generated₅，F₅And F₄Added to obtain Stage₃Stage arranged side by side transversely_{3_3}) Input features of (1), via Stage_{3_3}Post-production profile F₆；F₅Passing through Stage₄Then, a feature map F is generated₇，F₇And F₆Added to obtain Stage₄Stage arranged side by side transversely_{4_4}) Input features of (1), via Stage_{4_4}Post-production profile F₈。

Extracting F produced by the above process₂、F₄、F₆、F₈Let it first go through a convolution of 1x1 to make their channels equal. Then, F₈After interpolation, form F₆Feature maps of the same size, same channel, added to fuse stages_{4_4}And Stage_{3_3}Characterization of the phases (denoted M)₂)；M₂After interpolation, form F₄Feature maps of the same size, same channel, added to fuse stages_{3_3}And Stage_{2_2}Characterization of the phases (denoted M)₁)；M₁After interpolation, form F₂Feature maps of the same size, same channel, added to fuse stages_{2_2}And Stage_{1_1}Characterization of the phases (denoted M)₀) (ii) a F is to be₈Directly as M₃And (6) outputting.

Next, M may be first aligned₃、M₂、M₁、M₀A 3x3 convolution is performed and then fed into a two-stage Network, such as RPN (Region pro-active Network) and Cascade RCNN, respectively. The structure of the first-stage network, i.e. the RPN, is shown in fig. 3, and a plurality of anchors with fixed size and fixed proportion are manually set as the reference frames for prediction, and then propusals with higher confidence degree is screened from the anchors through a classification network and a regression network to be used as the reference frames of the second-stage network. The classification network is a two-class network, only predicts the probability value of whether a target exists in an anchor, and predicts the offset through a regression network, namely if a certain anchor possibly has a target, the anchor deviates from the real bounding box of the target. Similarly, the second stage network uses the propulses as reference frames for prediction, and then screens the final detection frames from these propulses through a classification network and a regression network. The classification network is a multi-classification network, and the number of classes of the multi-classification network depends on the number of classes to be detected in the data set. Regression netThe network predicts the offset between all propassals and the real bounding box.

The structure of the second stage network, i.e., Cascade RCNN, is shown in FIG. 4, which comprises a three-stage Cascade network, i.e., a first stage network Head₁Output of (3) propusals 1 as second stage network Head₂Input of the first network, after screening, second network Head₂Output of (3) propusals 2 as third level network Head₃Input of (3) Proposals, third level network Head₃The output of propusals 3 is the final prediction result. The output box of Head at each stage of the network, namely, Proposal, is obtained by inputting the Pooling-based features and Proposal into the stage of the network, and predicting the class score and regression offset of Proposal. That is, each stage of network is composed of a classification network and a regression network, in the embodiment of the present invention, FC-Head is used as the classification network, Conv-Head is used as the regression network, and a two-branch structure, i.e., a Double Head structure, is shown in fig. 5, and is composed of a ROI Align layer and two parallel branches (a classification branch and a regression branch), i.e., is generally divided into a classification prediction branch and a regression prediction branch. The classification task often needs more image semantic information, and the regression task needs more spatial information. Therefore, the adopted Double Head structure considers the characteristics of different requirements, and the effect is more obvious.

In one embodiment of the invention, the classification loss L in training the network_clsUsing cross-entropy loss, for each ROI (Region Of Interest), Head structure (Head) is traversed_i) Then obtaining a classification result C_i(i＝1,2,3)：

Wherein h (x) represents Head_iThe classification branch in (1) outputs a vector with dimension of M +1, the ROI is predicted to be one category in the dimension of M +1, and N represents the current Head_iThe number of ROIs in a stage, y corresponds to a category label, and the category label of y is determined by the IoU size of the ROI and the corresponding label:

wherein, Head₁IoU threshold u set at u₁，Head₂And Head₃Is set to u respectively₂、u₃X is ROI, g_yIs the class label of the object x, the IoU threshold u defines the quality of the detector. Through different IOU threshold values, the noise interference problem in detection is effectively solved. In one embodiment of the invention, u₁、u₂、u₃May be set to 0.5, 0.6, 0.7, respectively.

Regression loss L in training a network_locUsing smoothed L₁Loss, x is ROI, b is predicted coordinates for ROI, g is tag coordinate values, f represents regressor:

b＝(b_x,b_y,b_w,b_h)

the numerical values in the above formula are all small, and in order to improve the efficiency of the multi-task training, the regularization operation is performed on delta:

′_x＝(_x-u_x)/σ_x；

detecting each Head in a network_iTotal loss of (i ═ 1,2, 3):

L(x^t,g)＝L_cls(h_t(x^t),y^t)+λ[y^t≥1]L_loc(f_t(x^t,b^t),g)

b^t＝f_t-1(x^t-1,b^t-1)

wherein T represents the total number of branches of Cascade RCNN superposition, T represents the current branch, and each branch f in Cascade RCNN_tBy training data b on individual branches_tOptimization, b_tDerived from b₁The result after all the previous branches are output, instead of directly using the initial distribution b of RPN₁To train f_tλ is a weighting coefficient, [ y ]^t≥1]Means that the regression loss, y, is calculated only in the positive samples^tIs x^tAccording to the above formulae_tThe calculated label. In one embodiment of the invention, T is 3 and λ is 1.

Further, for the detection frames obtained through the training process, firstly, a filtering operation may be performed, the detection frames with the category scores larger than the first threshold θ are left, and set as Boxes1, for the current frame, the IOU values of the detection frames box 1 of the current frame and the tracking frames of the previous frame tracking queue are calculated first, the maximum IOU value of each detection frame is determined, if the maximum IOU value is larger than the second threshold σ, the detection frame is considered to be correctly detected, otherwise, if the maximum IOU value is smaller than the second threshold σ, it is determined whether the maximum detection score of the tracking frame in the previous video frame is larger than a third threshold, and whether the number of times that the tracking frame appears in the previous frame is larger than the minimum occurrence threshold T, and if both are larger than the corresponding thresholds, the detection frame of the current frame is erroneous. For the object tracked by using the IOU information, if no detection box which can match with the previous frame exists in the current frame, the object of the current frame is newly appeared, and the object needs to be added into the tracking queue again.

And S4, carrying out pedestrian detection on the night multi-frame image to be detected through the pedestrian detection model.

According to the multi-frame image pedestrian detection method for the night motion scene, the multi-frame images subjected to enhancement processing are input into the neural network for training, the plurality of trunk networks are fused in the feature extraction network of the neural network, the deformable convolution network is fused in each trunk network, the double-branch structure is arranged in the prediction network, and the pedestrian target is judged according to the inter-frame IOU value of the multi-frame images in the training process, so that the obtained pedestrian detection model can achieve pedestrian detection for the multi-frame images of the night scene, and is high in accuracy and robustness.

The invention further provides a multi-frame image pedestrian detection device facing the night motion scene, which is corresponding to the multi-frame image pedestrian detection method facing the night motion scene of the embodiment.

As shown in fig. 6, the device for detecting pedestrians facing a multi-frame image in a moving night scene according to an embodiment of the present invention includes an enhancement module 10, a construction module 20, a training module 30, and a detection module 40. The enhancement module 10 is configured to obtain a data set including a plurality of night multi-frame images, and perform enhancement processing on the night multi-frame images in the data set; the building module 20 is configured to build a neural network, where the neural network includes a feature extraction network and a prediction network, the feature extraction network merges a plurality of trunk networks and includes a feature pyramid network, each trunk network merges a deformable convolution network, and the prediction network includes a double-branch structure; the training module 30 is configured to train the neural network through the enhanced data set, and judge a pedestrian target according to the inter-frame IOU value of the multiple frames of images in the training process to obtain a pedestrian detection model; the detection module 40 is configured to perform pedestrian detection on night multi-frame images to be detected through a pedestrian detection model.

In an embodiment of the present invention, the enhancement module 10 may perform spatial-level image enhancement on night multi-frame images in a data set in the form of batch data to remove image noise without destroying structural information of an original image.

Specifically, the multi-frame image in the data set can be randomly sampled, and for the sampled multi-frame image I_iCompare its own width I_iW and high I_iH, selecting the long side max (I) in width and height_i_w,I_iH) scaling to L, short side min (I)_i_w,I_iH) scaling to S, S from S₁～S₂Randomly selected from the above. Sampled multiple multiframe images I_i(I is 1,2,3 … n) is sent to the feature extraction network in the form of batch, the long side of all multi-frame images in the batch is L, and the short sides of the images are uniform in size, and the short side S of the multi-frame images in the whole batch is used_i(i is 1,2,3 … n) is the maximum value max (S)_i) Is a reference S _ base, the rest S_iAdding padding to S _ base.

S_base＝S_i+padding

The training module 30 may first perform a convolution operation of 7 × 7 on the multi-frame image I in the enhanced data set, which aims to directly down-sample the input image, and to retain as much information as possible of the original image,without increasing the number of channels. Then, as shown in FIG. 2, the image is sequentially passed through four stages (stages)₁，Stage₂，Stage₃，Stage₄) Each Stage is composed of a plurality of Residual Block Residual blocks horizontally. Each Residual Block is used to extract features more finely over the broader features obtained in the previous stage and is composed of two branches, one of which is a Residual branch and the other of which is composed of three layers in turn. The three layers are sequentially a 1x1 convolutional layer, a deformable convolutional layer, and a 1x1 convolutional layer. The deformable convolution layer comprises two steps, firstly, the position offset of each pixel required by deformable convolution is calculated through a convolution operation of 3x3, and then the position offset is acted on a convolution kernel to obtain the deformable convolution layer. The residual branch is composed of a 1x1 convolution layer and is mainly used for extracting residual characteristic information of the image. And after the characteristic diagrams respectively pass through two Residual branch circuits of the Residual Block, adding the formed characteristic diagrams to serve as input characteristics of the next Stage.

Next, M may be first aligned₃、M₂、M₁、M₀A 3x3 convolution is performed before being fed into the two-stage network, e.g., RPN and Cascade RCNN, respectively. The structure of the first-stage network, i.e. the RPN, is shown in fig. 3, and a plurality of anchors with fixed size and fixed proportion are manually set as the reference frames for prediction, and then propusals with higher confidence degree is screened from the anchors through a classification network and a regression network to be used as the reference frames of the second-stage network. The classification network is a two-class network, only predicts the probability value of whether a target exists in an anchor, and predicts the offset through a regression network, namely if a certain anchor possibly has a target, the anchor deviates from the real bounding box of the target. Similarly, the second stage network uses the propulses as reference frames for prediction, and then screens the final detection frames from these propulses through a classification network and a regression network. The classification network is a multi-classification network, and the number of classes of the multi-classification network depends on the number of classes to be detected in the data set. Regression network prediction of all proposals and real boundOffset between the ing boxes.

In one embodiment of the invention, training module 30 trains the network with a classification penalty L_clsUsing cross entropy loss, for each ROI, via Head structures (Head)_i) Then obtaining a classification result C_i(i＝1,2,3)：

Regression loss L when training module 30 trains a network_locUsing smoothed L₁Loss, x is ROI, b is predicted coordinates for ROI, g is tag coordinate values, f represents regressor:

b＝(b_x,b_y,b_w,b_h)

′_x＝(_x-u_x)/σ_x；

detecting each Head in a network_iTotal loss of (i ═ 1,2, 3):

L(x^t,g)＝L_cls(h_t(x^t),y^t)+λ[y^t≥1]L_loc(f_t(x^t,b^t),g)

b^t＝f_t-1(x^t-1,b^t-1)

According to the multi-frame image pedestrian detection device for the night motion scene, the enhanced multi-frame images are input into the neural network for training, the plurality of trunk networks are fused in the feature extraction network of the neural network, the deformable convolution network is fused in each trunk network, the double-branch structure is arranged in the prediction network, and the pedestrian target is judged according to the inter-frame IOU value of the multi-frame images in the training process, so that the obtained pedestrian detection model can realize pedestrian detection for the multi-frame images of the night scene, and the accuracy and the robustness are high.

The invention further provides a computer device corresponding to the embodiment.

The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the method for detecting pedestrians by using multi-frame images facing to the night motion scene can be realized according to the embodiment of the invention.

According to the computer device of the embodiment of the invention, when the processor executes the computer program stored on the memory, the enhanced multi-frame images are input into the neural network for training, the plurality of trunk networks are fused in the characteristic extraction network of the neural network, the deformable convolution network is fused in each trunk network, the double-branch structure is arranged in the prediction network, and the pedestrian target is judged according to the inter-frame IOU value of the multi-frame images in the training process, so that the obtained pedestrian detection model can realize the pedestrian detection aiming at the multi-frame images of the night scene, and has high accuracy and robustness.

The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.

A non-transitory computer-readable storage medium of an embodiment of the present invention stores thereon a computer program, which when executed by a processor, can implement the method for detecting pedestrians using multiple frames of images facing a moving night scene according to the above-described embodiment of the present invention.

According to the non-transitory computer readable storage medium of the embodiment of the invention, when the processor executes the computer program stored on the processor, the enhanced multi-frame images are input into the neural network for training, the feature extraction network of the neural network is fused with a plurality of trunk networks, a deformable convolution network is fused in each trunk network, a double-branch structure is arranged in the prediction network, and the pedestrian target is judged according to the inter-frame IOU value of the multi-frame images in the training process, so that the obtained pedestrian detection model can realize pedestrian detection for the multi-frame images of night scenes, and has high accuracy and robustness.

The present invention also provides a computer program product corresponding to the above embodiments.

When the instructions in the computer program product of the embodiment of the invention are executed by the processor, the method for detecting pedestrians by using multi-frame images facing the night motion scene can be executed according to the embodiment of the invention.

According to the computer program product of the embodiment of the invention, when the processor executes the instruction, the enhanced multi-frame image is input into the neural network for training, the plurality of trunk networks are fused in the characteristic extraction network of the neural network, the deformable convolution network is fused in each trunk network, the double-branch structure is arranged in the prediction network, and the pedestrian target is judged according to the inter-frame IOU value of the multi-frame image in the training process, so that the obtained pedestrian detection model can realize pedestrian detection aiming at the multi-frame image of the night scene, and has high accuracy and robustness.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A multi-frame image pedestrian detection method facing a night motion scene is characterized by comprising the following steps:

acquiring a data set containing a plurality of night multi-frame images, and performing enhancement processing on the night multi-frame images in the data set;

constructing a neural network, wherein the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, a deformable convolution network is fused in each backbone network, and the prediction network comprises a double-branch structure;

training the neural network through the enhanced data set, and judging a pedestrian target according to the interframe IOU value of the multi-frame image in the training process to obtain a pedestrian detection model;

and carrying out pedestrian detection on the night multi-frame image to be detected through the pedestrian detection model.

2. The method for detecting pedestrians through multiple frames of images facing to the night moving scene, according to claim 1, wherein spatial-level image enhancement is performed on the night multiple frames of images in the data set in the form of batch data.

3. The method for detecting pedestrians through the multiframe images in the night motion scene as claimed in claim 1 or 2, wherein the main network is ResNeXt, the dual-branch structures are FC-head and Conv-head respectively, the FC-head is used as a classification network, and the Conv-head is used as a regression network.

4. The method for detecting pedestrians through the multi-frame images facing the night motion scene as claimed in claim 3, wherein the step of judging the pedestrian target according to the inter-frame IOU value of the multi-frame images in the training process comprises the following steps:

filtering the detection frames obtained by training, leaving the detection frames with the category scores larger than a first threshold value theta, setting the detection frames as Box 1, for a current frame, firstly calculating IOU values of the detection frames Boxes1 of the current frame and the tracking frames of a tracking queue of a previous frame, judging the maximum IOU value of each detection frame, if the maximum IOU value is larger than a second threshold value sigma, considering that the detection of the detection frame is correct, otherwise, if the maximum IOU value is smaller than the second threshold value sigma, judging whether the maximum detection score of the tracking frame in a previous video frame is larger than a third threshold value, and whether the number of times of the tracking frame appearing in the previous frame is larger than a minimum occurrence threshold value T, if the maximum IOU value is larger than the corresponding threshold value, judging that the detection frame of the current frame is wrong.

5. The method for detecting pedestrians through multiple frames of images in night motion scene as claimed in claim 4, wherein the regression loss L during network training_locUsing smoothed L₁Loss, x is ROI, b is predicted coordinates for ROI, g is tag coordinate values, f represents regressor,

b＝(b_x,b_y,b_w,b_h)

and (3) carrying out a regularization operation on delta:

′_x＝(_x-u_x)/σ_x；

detecting each Head in a network_iTotal loss of (i ═ 1,2, 3):

L(x^t,g)＝L_cls(h_t(x^t),y^t)+λ[y^t≥1]L_loc(f_t(x^t,b^t),g)

b^t＝f_t-1(x^t-1,b^t-1)

6. A multi-frame image pedestrian detection device facing a night motion scene, comprising:

the enhancement module is used for acquiring a data set containing a plurality of night multi-frame images and enhancing the night multi-frame images in the data set;

the device comprises a construction module and a prediction module, wherein the construction module is used for constructing a neural network, the neural network comprises a feature extraction network and a prediction network, the feature extraction network fuses a plurality of backbone networks and comprises a feature pyramid network, a deformable convolution network is fused in each backbone network, and the prediction network comprises a double-branch structure;

the training module is used for training the neural network through the enhanced data set, and judging a pedestrian target according to the interframe IOU value of the multi-frame image in the training process to obtain a pedestrian detection model;

the detection module is used for detecting pedestrians for the night multi-frame images to be detected through the pedestrian detection model.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the method for detecting pedestrians using multiple frames of images according to any one of claims 1 to 5.

8. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a multi-frame image pedestrian detection method oriented to a nighttime moving scene according to any one of claims 1 to 5.

9. A computer program product, characterized in that instructions in the computer program product, when executed by a processor, perform a multi-frame image pedestrian detection method for a night-time moving scene according to any of claims 1-5.