CN112733754A

CN112733754A - Infrared night vision image pedestrian detection method, electronic device and storage medium

Info

Publication number: CN112733754A
Application number: CN202110051711.1A
Authority: CN
Inventors: 李承政; 秦豪; 赵明
Original assignee: Shanghai Yogo Robot Co Ltd
Current assignee: Shanghai Yogo Robot Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-04-30

Abstract

The utility model relates to an infrared night vision image pedestrian detection method, an electronic device and a storage medium, which improves the overall effect by adopting Quality Focal local constraint to the classification or Quality fraction constraint of the pedestrian image under the preset scene in a pedestrian detection model and Distribution Focal local constraint to the bounding box regression constraint of the pedestrian image under the preset scene, simultaneously, a backbone network selects shuffle-v 2 as a feature extractor, and prevents the problem of non-circulation of feature information by using a channel recombination mode, so that the model parameters and the inference speed of the infrared night vision image pedestrian detection method are greatly improved, and the problems of complex structure, low speed and low efficiency of the existing detection are solved.

Description

Infrared night vision image pedestrian detection method, electronic device and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an infrared night vision image pedestrian detection method, electronic equipment and a storage medium.

Background

In the normal driving process of the robot, the surrounding environment often needs to be understood to a certain extent, for example, the acquired images are subjected to sensing operations such as pedestrian detection and road segmentation, so that an auxiliary effect can be generated on the normal driving of the robot, if a pedestrian is detected, an obstacle is effectively avoided, an area which cannot be driven is avoided, and meanwhile, some human-computer interaction functions such as voice prompt can be additionally added on the basis of pedestrian information. In daytime, the environment sensing operations can be directly obtained by analyzing color images acquired by the front camera, however, in dark light or in a night scene, a general camera cannot acquire available effective data, and at this time, a night vision camera needs to be used for capturing images in the night through infrared light supplement operation, so that trained models are used for prediction. However, general detection models based on convolutional neural networks are too complex, and real-time requirements in actual scenes are difficult to meet when the detection models run on robots and other similar mobile edge devices, so that the detection models which are rapid, efficient and equivalent in performance need to be designed.

Disclosure of Invention

In order to overcome the problems in the related art, the application provides an infrared night vision image pedestrian detection method, electronic equipment and a storage medium, and aims to provide a rapid and efficient infrared night vision image pedestrian detection method and a lightweight infrared night vision image pedestrian detection module so as to solve the problems of complex original detection structure, low speed and low efficiency.

The technical scheme for solving the technical problems is as follows: an infrared night vision image pedestrian detection method comprises the following steps: step 1, acquiring a pedestrian image in a preset scene, and preprocessing the pedestrian image in the preset scene to form a training set; step 2, constructing a pedestrian detection network model, wherein the pedestrian detection network model comprises a backbone network of a Shufflenet-v2 algorithm, a feature fusion module and a detection head; and 3, training the pedestrian detection network model by adopting the image of the training set, and optimizing the network parameters of the pedestrian detection network model.

Preferably, in the pedestrian detection network model, the classification or Quality score of the pedestrian image in the preset scene is constrained by Quality local (QFL).

Preferably, in the pedestrian detection network model, the bounding box regression of the pedestrian image in the preset scene is constrained by Distribution local area (DFL). .

Preferably, the constraint function of Quality local (QFL) is:

QFL(x)＝-|y-x|^β(1-y) log (1-x) + ylog (x); wherein x is confidence coefficient activated by sigmoid, y is a prediction target, and when the pedestrian image is a negative sample, the value of y is 0; when the pedestrian image is a positive sample, the y value represents the IoU value corresponding to the sample.

Preferably, the constraint function of Distribution local (DFL) is:

DFL(S_i,S_i+1)＝-((y_i+1-y)log(S_i)+(y-y_i)log(S_i+1) ); wherein S is_iAnd S_i+1Respectively indicates that the prediction target y falls within the corresponding prediction interval [ y_i,y_i+1]The probability of (c).

Preferably, the backbone network includes a plurality of sub-modules, and the convolution layers of the sub-modules employ depth separable convolution, where the depth separable convolution is composed of a depth convolution and a point-by-point convolution.

Preferably, when the size of the input picture is D_F×D_FX M, convolution kernel size D_K×D_KX M, size of output picture is D_F×D_FAnd multiplying N, performing convolution on the input picture by using the depth separable convolution comprises the following step of S101, performing channel-by-channel convolution operation on the M groups of characteristics by using the depth convolution to obtain a parameter (D) in the depth convolution_K×D_KX 1) x M; s102, obtaining the parameter number of the point-by-point convolution as (1 × 1 × M) × N by utilizing the point-by-point convolution conversion channel.

Preferably, the sub-modules include a base module and a down-sampling module, and the structure of the base module is, according to the processing sequence of the computing unit: a channel separator for separating features of an input image; the first convolution layer adopts standard convolution, and the size of a convolution kernel is 1 x 1; the second convolution layer adopts depth separable convolution, the convolution kernel size is 3 x 3, and the first convolution layer and the second convolution layer are connected through an activation function; the third convolutional layer is standard convolution, the size of a convolution kernel is 1 x 1, and the third convolutional layer is obtained after batch normalization of the second convolutional layer; the concatenator is used for concatenating the characteristic channels separated by the channel separator again; and the channel recombiner is used for recombining the characteristic channels connected in series by the concatenater.

Preferably, the feature fusion module adopts a two-layer feature fusion mode, the detection head comprises a plurality of layers of feature pyramids, and the feature pyramids share two layers of convolution.

A second aspect of an embodiment of the present application provides an electronic device, including:

a processor; and one or more processors; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the methods described above.

A third aspect of the application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.

The application provides an infrared night vision image pedestrian detection method, electronic equipment and storage medium, which improves the overall effect by adopting Quality Focal Local (QFL) to classify pedestrian images under a preset scene or restricting Quality fraction and Distribution Focal Local (DFL) to restrict regression of a boundary frame of the pedestrian images under the preset scene in a pedestrian detection model, simultaneously, a backbone network selects shuffle 2 as a feature extractor, the backbone network largely uses Depth-wise convolution and Point-wise convolution in submodules to form Depth separable convolution (Depth-wise separable convolution) to replace a traditional convolution layer, and prevents the problem of non-circulation of feature information by utilizing a channel recombination (shuffle) mode, so that the model parameter and pedestrian speed of the infrared night vision image detection method are greatly improved, the problems of complex structure, low speed and low efficiency of the existing detection are solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a flow chart diagram illustrating a pedestrian detection method using infrared night vision images according to an embodiment of the present application;

FIG. 2 is a model loss dip graph of a pedestrian detection network model shown in an embodiment of the present application;

FIG. 3 is another flow diagram of pedestrian detection with infrared night vision images in accordance with an embodiment of the present application;

FIG. 4 is a graph comparing a standard convolution with a depth separable convolution as shown in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a base module according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a downsampling module according to an embodiment of the present application;

FIG. 7 is a graph of a size distribution of bounding boxes in a data set according to an embodiment of the present application;

FIG. 8 is a diagram illustrating an example of the detection result according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed Description

Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

At present, in the process of driving in the daytime, the robot can acquire images and detect and recognize the images through a front-mounted camera, but effective data cannot be acquired in a night scene, so that a night vision camera is required to capture the images, and a detection algorithm is used for analyzing infrared night vision images. When the detection algorithm is operated on a development version of the robot, strict requirements are imposed on the parameter quantity and the operand of the detection model.

In order to solve the above problem, an embodiment of the present application provides an infrared night vision image pedestrian detection method, which solves the problem that an existing infrared night vision image detector is complex in design.

The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flow chart of a pedestrian detection method using infrared night vision images according to a first embodiment of the present application, as shown in fig. 1, the method includes the following steps:

step S1, acquiring a pedestrian image in a preset scene, and preprocessing the pedestrian image in the preset scene to form a training set;

specifically, under the preset scene of two conditions of darkness and existence, targeted data acquisition work is carried out through an infrared night vision camera, so that an infrared night vision pedestrian image is obtained, corresponding marking work is carried out on the pedestrian image, a training set is formed after the pedestrian image is preprocessed, and the training set is used for inputting a detection model.

Step S2, constructing a pedestrian detection network model, wherein the pedestrian detection network model comprises a backbone network of a Shufflent-v 2 algorithm, a feature fusion module and a detection head;

specifically, the pedestrian Detection network model is modified based on a GFL (Generalized Focal Loss) model, the GFL inherits a full-volume operation in an FCOS (full volumetric One-Stage Object Detection) model, is easy to move deployment on edge devices, and greatly improves the problem of low confidence in the original FCOS model, and is friendly to setting a threshold of a Detection score in actual use, and the GFL model is a Detection model improved based on the FCOS and mainly improved to Generalized Focal Loss (Generalized Focal Loss); the backbone network adopts Shufflenet-v2 as a feature extractor for extracting high-dimensional features in the pedestrian image, filtering out invalid background information interference and generating a high-dimensional feature map.

In one embodiment, in the pedestrian detection network model, the classification or Quality score of the pedestrian image in the preset scene is constrained by Quality Focal Local (QFL); the constraint function of Quality Focal Local (QFL) is:

QFL(x)＝-|y-x|^β((1-y)log(1-x)+ylog(x))；

wherein x is confidence coefficient activated by sigmoid, y is a prediction target, and when the pedestrian image is a negative sample, the value of y is 0; when the pedestrian image is a positive sample, the y value represents the IoU value corresponding to the sample.

In one embodiment, in the pedestrian detection network model, a bounding box regression constraint is performed on a pedestrian image in the preset scene through Distribution local area (DFL); the constraint function of Distribution Focal Local (DFL) is:

DFL(S_i,S_i+1)＝-((y_i+1-y)log(S_i)+(y-y_i)log(S_i+1))；

wherein S is_iAnd S_i+1Respectively indicates that the prediction target y falls within the corresponding prediction interval [ y_i,y_i+1]The probability of (c).

And step S3, training the pedestrian detection network model by using the images of the training set, and optimizing the network parameters of the pedestrian detection network model.

Specifically, in this step, 120 rounds are trained, the learning rate is set to 0.1, the attenuation learning rate is 0.1 in 80 th and 110 th rounds, respectively, and the batch size is set to 160. The model loss (loss) descending curve of the pedestrian detection network model is shown in fig. 2, the abscissa in fig. 2 is iteration times, the ordinate represents a loss value, and it can be seen that the convergence of the model is good.

In the embodiment, the overall effect is improved by adopting Quality local Loss (QFL) to classify or mass fraction restrict the pedestrian images under the preset scene and Distribution local Loss (DFL) to carry out regression restriction on the boundary frame of the pedestrian images under the preset scene in the pedestrian detection model, and meanwhile, the trunk network selects shuffle-v 2 as a feature extractor and prevents the problem of non-circulation of feature information by using a channel recombination (shuffle) mode, so that the model parameters and the inference speed of the infrared night vision image pedestrian detection method are greatly improved, and the problems of complex detection structure, low speed and low efficiency of the existing detection method are solved.

Referring to fig. 3 and 4, fig. 3 is another flow chart of a pedestrian detection method using infrared night vision images according to a second embodiment of the present application. The method comprises the following specific steps:

in this embodiment, the backbone network includes a plurality of sub-modules, and the convolution layers of the sub-modules adopt depth separable convolution, where the depth separable convolution is composed of a depth convolution and a point-by-point convolution.

Referring to FIG. 3, when the size of the input picture is D_F×D_FX M, convolution kernel size D_K×D_KX M, size of output picture is D_F×D_FxN, the convolving the input picture with the depth separable convolution comprises the steps of:

s101, performing channel-by-channel convolution operation on the M groups of features by utilizing the deep convolution to obtain a parameter D in the deep convolution_K×D_K×1)×M；

S102, obtaining the parameter number of the point-by-point convolution as (1 × 1 × M) × N by utilizing the point-by-point convolution conversion channel.

Specifically, referring to fig. 4, fig. 4 is a comparison diagram of the standard convolution and the depth separable convolution, where (a) in fig. 4 is the standard convolution, (b) is the depth convolution, and (c) is the point-by-point convolution;

when the size of the input picture is D_F×D_FX M, convolution kernel size D_K×D_KX M, size of output picture is D_F×D_FX N, the standard convolution corresponds to a parameter of (D)_K×D_KXm) xn, and in the deep separable convolution, the convolution operation is divided into M groups, and the M groups of features are first subjected to a channel-by-channel convolution operation using deep convolution, followed by a channel conversion using point-by-point convolution, where the parameter number in deep convolution is (D)_K×D_KX 1) x M and the parameters of the point-by-point convolution are (1 x M) x N, such that the parameters of the deep separable convolution operation occupy only a small fraction of the standard convolution, the deep separable convolution being proportional to the standard convolution by:

therefore, the backbone network brings great improvement on model parameters and reasoning speed.

Referring to fig. 5 and 6, the sub-module includes a base module and a down-sampling module;

fig. 5 is a schematic structural diagram of a basic module, which has the following structure according to the processing sequence of a computing unit:

a channel separator (channel split) for separating features of an input image;

in the present embodiment, the channel separator separates the features of the input image into two feature channels;

the first convolution layer adopts standard convolution Conv, and the size of a convolution kernel is 1 x 1;

the second convolutional layer adopts a depth separable convolution DWConv, the size of a convolution kernel is 3 x 3, and the first convolutional layer and the second convolutional layer are connected through an activation function (ReLU);

the third convolution layer is standard convolution Conv, the convolution kernel size is 1 x 1, and the third convolution layer is obtained after Batch Normalization (BN) is carried out on the second convolution layer;

a concatenator (Concat) for concatenating the characteristic channels separated by the channel separator again;

in this embodiment, the concatenator is configured to concatenate two feature channels together;

a channel recombiner (channel shuffle) for recombining the characteristic channels concatenated by the concatenator;

in this example, the channel recombiner is used to recombine the two characteristic channels and output the result.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a down-sampling module, where the down-sampling module includes two output feature channels, and the structure of the down-sampling module is, according to a processing sequence of a computing unit:

the first feature channel includes: the first convolution layer adopts standard convolution Conv, the convolution kernel size is 1 x 1, the second convolution layer adopts depth separable convolution DWConv, the convolution kernel size is 3 x 3, the step size is 2, the third convolution layer is standard convolution Conv, and the convolution kernel size is 1 x 1;

the second characteristic channel includes: the first convolution layer adopts a depth separable convolution DWConv, the convolution kernel size is 3 x 3, the step size is 2, the second convolution layer adopts a standard convolution Conv, and the convolution kernel size is 1 x 1;

a concatenator (Concat) for concatenating the first feature channel with the second feature channel;

a channel recombiner (channel shuffle) for recombining the first eigen channel with the second eigen channel.

In one embodiment, the feature fusion module adopts a two-layer feature fusion mode, the detection head comprises a plurality of layers of feature pyramids, and a plurality of feature pyramids share two-layer convolution.

Specifically, in the aspect of selecting the feature pyramid, only a 2-layer feature fusion mode is adopted according to the characteristics of the data set, that is, the feature pyramid is divided into two layers.

Referring to fig. 7, fig. 7 is a size distribution diagram of a bounding box in a data set (where the x-axis is an area and the y-axis is a frequency), and two layers of feature fusion results obtained by setting specific parameters are respectively shown by two red lines.

Specifically, the plurality of layers of feature pyramids respectively use a set of convolution modes, namely a plurality of feature pyramid sharing convolutions, in this example, in order to match the scale of the feature extractor, the plurality of feature pyramids share two layers of convolutions, and the two layers of convolutions are replaced by the depth separable convolutions, so that the model is further lightened. In the existing method, shared four-layer convolution layers are used on a multilayer characteristic pyramid, although overall parameter weight can be saved in a general detector, the mode of sharing parameters does not bring acceleration effect to CPU reasoning on edge equipment, and therefore the method is not suitable for deployment on mobile edge equipment such as robots.

In this embodiment, a trunk network selects shuffle-v 2 as a feature extractor, the trunk network largely uses Depth-wise convolution (Depth-wise convolution) and Point-wise convolution (Point-wise convolution) in submodules to form Depth-wise separable convolution (Depth-wise separable convolution) to replace a traditional convolution layer, and parameters of the Depth-wise separable convolution only occupy a small part of a standard convolution, so that the trunk network can greatly improve model parameters and inference speed, prevent feature information from not circulating in a channel recombination (channel shuffle) manner, and solve the problems of complex structure, slow speed and low efficiency of the existing detection.

In the embodiment, the infrared night vision image pedestrian detection method can realize accurate pedestrian detection in the infrared night vision image in a dark light/dark scene, as shown in fig. 8, fig. 8 is an exemplary diagram of detection results, and the labels in fig. 8 represent the category and the confidence (where the category is person, the confidence is 0.94 and 1.00, and the confidence represents how much probability belongs to the category), and the validity of the visual model. Under the condition that the input size is 480 × 320, the fast and efficient detector designed by the method can reach the inference speed of 77ms when 2A 72 CPUs are used, while the speed of the existing model is generally 173ms, and the inference speed is obviously improved.

Referring to fig. 9, the electronic device 400 includes a memory 410 and a processor 420.

The Processor 420 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 410 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are needed by the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 410 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 410 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, minSD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 410 has stored thereon executable code that, when processed by the processor 420, may cause the processor 420 to perform some or all of the methods described above.

The aspects of the present application have been described in detail hereinabove with reference to the accompanying drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required in the present application. In addition, it can be understood that the steps in the method of the embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device of the embodiment of the present application may be combined, divided, and deleted according to actual needs.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or electronic device, server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the present application.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the applications disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An infrared night vision image pedestrian detection method is characterized by comprising the following steps:

step 1, acquiring a pedestrian image in a preset scene, and preprocessing the pedestrian image in the preset scene to form a training set;

step 2, constructing a pedestrian detection network model, wherein the pedestrian detection network model comprises a backbone network of a Shufflenet-v2 algorithm, a feature fusion module and a detection head;

and 3, training the pedestrian detection network model by adopting the image of the training set, and optimizing the network parameters of the pedestrian detection network model.

2. The infrared night vision image pedestrian detection method of claim 1, characterized in that in the pedestrian detection network model, classification or Quality score of the pedestrian image in the preset scene is constrained by Quality Focal Local (QFL).

3. The infrared night vision image pedestrian detection method of claim 1, wherein in the pedestrian detection network model, the bounding box regression of the pedestrian image in the preset scene is constrained by Distribution local area (DFL).

4. The infrared night vision image pedestrian detection method of claim 2, wherein the constraint function of Quality Focal Local (QFL) is:

QFL(x)＝-|y-x|^β((1-y)log(1-x)+ylog(x))；

5. The infrared night vision image pedestrian detection method of claim 3, wherein the Distribution Focal Local (DFL) constraint function is:

DFL(S_i,S_i+1)＝-((y_i+1-y)log(S_i)+(y-y_i)log(S_i+1))；

6. The infrared night vision image pedestrian detection method of claim 1, wherein the backbone network includes a plurality of sub-modules, convolution layers of the sub-modules employ a depth separable convolution, the depth separable convolution consisting of a depth convolution and a point-by-point convolution.

7. The infrared night vision image pedestrian detection method of claim 6, wherein when the size of the input picture is D_F×D_FX M, convolution kernel size D_K×D_KX M, size of output picture is D_F×D_FxN, the convolving the input picture with the depth separable convolution comprises the steps of:

8. The infrared night vision image pedestrian detection method of claim 7, wherein the sub-module includes a base module and a down-sampling module, the base module being configured to, in computational unit processing order:

a channel separator for separating features of an input image;

the first convolution layer adopts standard convolution, and the size of a convolution kernel is 1 x 1;

the second convolution layer adopts depth separable convolution, the convolution kernel size is 3 x 3, and the first convolution layer and the second convolution layer are connected through an activation function;

the third convolutional layer is standard convolution, the size of a convolution kernel is 1 x 1, and the third convolutional layer is obtained after batch normalization of the second convolutional layer;

the concatenator is used for concatenating the characteristic channels separated by the channel separator again;

and the channel recombiner is used for recombining the characteristic channels connected in series by the concatenater.

9. The infrared night vision image pedestrian detection method according to claim 1, characterized in that: the feature fusion module adopts a two-layer feature fusion mode, the detection head comprises a plurality of layers of feature pyramids, and the feature pyramids share two layers of convolution.

10. An electronic device, comprising: a memory; one or more processors; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-8.

11. A storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the infrared night vision image pedestrian detection method of any one of claims 1 to 9.