CN112163545A

CN112163545A - Head feature extraction method and device, electronic equipment and storage medium

Info

Publication number: CN112163545A
Application number: CN202011087869.6A
Authority: CN
Inventors: 杨建权; 赵阳; 朱涛; 张天麒; 李高杨
Original assignee: China Hualu Group Co Ltd; Beijing E Hualu Information Technology Co Ltd
Current assignee: China Hualu Group Co Ltd; Beijing E Hualu Information Technology Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-01

Abstract

The invention provides a head feature extraction method, a head feature extraction device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be detected; inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected; and inputting the human body detection frame into a pre-trained head feature detection classification model to obtain the human body head features of the human body detection frame. By implementing the method and the device, the human body detection frame of the image to be detected is obtained first, the head characteristics in the human body detection frame are detected, and the human body occupies a larger screen than the head in the image to be detected for the same pedestrian, so that the missing detection is less likely to happen, and the accuracy of the head characteristic detection is improved.

Description

Head feature extraction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of neural networks, in particular to a head feature extraction method and device, electronic equipment and a storage medium.

Background

With the development of deep learning technology, the application of computer vision technology in social production and life is more and more extensive, and from game machines capable of recognizing gestures to police, according to the intelligent tracking of road monitoring to criminals, computers gradually have the functions of seeing, knowing, analyzing and feeding back by naked eyes. In each city, considerable road monitoring cameras are installed, and the road monitoring cameras have the functions of recording road conditions, normalizing road behaviors, tracing the occurrence process of events, or preventing accidents and the like.

How to apply computer vision technology to automatically mine effective information in videos is always an important subject for smart city development. The target detection algorithm plays a major role in road monitoring all the time, and position information of an object needing attention in a video can be automatically deduced through a deep learning model. In the related art, human head features are generally directly extracted for feature recognition, but actually in a video surveillance video in a street scene, a camera is generally deep and high and is used for shooting the overall situation of a crowd or a car crowd, and head missing detection is possibly caused because head pixels of a single pedestrian occupy a small screen ratio, so that the head feature detection accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for extracting a head feature, an electronic device, and a storage medium, so as to solve the problem in the prior art that head missing detection is caused, so that the head feature detection accuracy is low.

According to a first aspect, an embodiment of the present invention provides a head feature extraction method, including the following steps: acquiring an image to be detected; inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected; and inputting the human body detection frame into a pre-trained head feature detection classification model to obtain the human body head features of the human body detection frame.

Optionally, the human body detection neural network model is a YOLOv3 neural network model, and the head feature detection classification model is a MobileNet classification network.

Optionally, the training process of the human detection neural network model includes: acquiring a first training sample, wherein the first training sample comprises scene images of different regions, different time periods and different illumination conditions and human body pre-labeling information in the scene images; acquiring a first pre-training neural network model trained according to a target data set; and carrying out transfer learning on the first pre-training neural network model according to the first training sample to obtain a human body detection neural network model.

Optionally, the training process of the head feature detection classification model includes: acquiring a second training sample, wherein the second training sample comprises a multi-class sample label, and the multi-class sample label is obtained according to a multi-label binarization function in a target function library; inputting the second training sample to a second pre-trained neural network model; and when the second pre-training neural network model meets the preset condition, obtaining a head feature detection classification model.

Optionally, the method further comprises: and when the second training sample is input into the second pre-training neural network model and the feature extraction error occurs, repeatedly inputting the second training sample with the feature extraction error into the second pre-training neural network model, and performing iterative training for the target times.

Optionally, obtaining the second training sample comprises: acquiring an image to be trained; inputting the image to be trained into a human body detection YOLO V3 model trained in advance to obtain a human body detection frame in the image to be trained; inputting the human body detection frame into a pre-trained feature classification YOLO V3 model to obtain a label corresponding to the human body detection frame, and constructing according to the human body detection frame and the corresponding label to obtain the second training sample.

Optionally, inputting the image to be detected into a pre-trained human detection neural network model, including: adjusting the size of the image to be detected to a first target size, and inputting the image to be detected of the first target size into a human body detection neural network model trained in advance; and/or inputting the human body detection frame into a pre-trained head feature detection classification model, comprising: and adjusting the size of the human body detection frame to a second target size, and inputting the human body detection frame with the second target size into a pre-trained head feature detection classification model.

According to a second aspect, an embodiment of the present invention provides a head feature extraction device, including: the image acquisition module to be detected is used for acquiring an image to be detected; the human body detection module is used for inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected; and the head characteristic detection module is used for inputting the human body detection frame into a pre-trained head characteristic detection classification model to obtain the human body head characteristics of the human body detection frame.

According to a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the head feature extraction method according to the first aspect or any of the embodiments of the first aspect when executing the program.

According to a fourth aspect, an embodiment of the present invention provides a storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the steps of the head feature extraction method according to the first aspect or any of the embodiments of the first aspect.

The technical scheme of the invention has the following advantages:

according to the head feature extraction method/device provided by the embodiment of the invention, the human body detection frame of the image to be detected is obtained firstly, and the head feature in the human body detection frame is detected.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a specific example of a head feature extraction method in an embodiment of the present invention;

fig. 2 is a schematic block diagram of a specific example of a head feature extraction device in an embodiment of the present invention;

fig. 3 is a schematic block diagram of a specific example of an electronic device in the embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The present embodiment provides a method for extracting head features, as shown in fig. 1, including the following steps:

s101, acquiring an image to be detected;

for example, the image to be detected may be a street view image including a pedestrian or a non-motor vehicle driver, or may be an image including a human body in a specific environment. The acquisition mode of the image to be detected can be shooting through a camera arranged at the street or frame-extracting video, the embodiment does not limit the image to be detected and the mode of acquiring the image to be detected, and the technical personnel in the field can determine the mode according to the needs.

S102, inputting an image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected;

for example, the pre-trained human body detection neural network model may be a training model with YOLO V3 as a framework, or may be a training model with YOLO V4 and EfficientDet as a framework. And inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected.

S103, inputting the human body detection frame into a pre-trained head feature detection classification model to obtain the human body head features of the human body detection frame.

Illustratively, the pre-trained head feature detection classification model may be a model trained by a framework, such as a residual error network ResNet series and a dense link network densnet series. Inputting the human body detection frame into a pre-trained head feature detection classification model to obtain the human body head features of the human body detection frame, wherein the human body head features can be determined according to actual needs, for example, when a 'helmet-with-belt' action is carried out, the human body head features can judge whether the head wears a helmet or not; when detecting whether the mask is worn in a public place, the human head characteristics can be used for wearing the mask; when used to track an object, the human head feature may be an accessory item of the human head (e.g., glasses, earrings, etc.), or the like.

According to the head feature extraction method provided by the embodiment of the invention, the human body detection frame of the image to be detected is obtained firstly, and the head feature in the human body detection frame is detected.

As an optional implementation manner of this embodiment, the human body detection neural network model is a YOLOv3 neural network model, and the head feature detection classification model is a MobileNet classification network. The processing speed of the YOLOv3 neural network model (about 40ms) can meet the requirement of video real-time processing, which keeps higher mAP (mAP @ 0.5 ═ 58) while the processing speed is outstanding. The MobileNet classification network is a lightweight classification network, the calculated amount of a model can be reduced by nearly one order of magnitude by utilizing a deep separable network structure, and the loss of precision is less than one percent, so that the classification speed can be improved.

As an optional implementation manner of this embodiment, the training process of the human detection neural network model includes:

firstly, acquiring a first training sample, wherein the first training sample comprises scene images with different regions, different time periods and different illumination conditions and human body pre-labeling information in the scene images;

for example, the first training sample may be obtained by shooting street view images in different time periods, different weather conditions and different illumination conditions by using cameras arranged in different regions, or may be obtained by obtaining scene images in different regions, different weather conditions and different illumination conditions from a database, and then pre-labeling the positions of human bodies in the obtained street view/scene images. The method for pre-labeling the human body position in the obtained street view/scene image can adopt a manual labeling method, or can preliminarily apply a yolov3.weights file of a YOLO official to a YOLO V3 neural network model, label the human body position in the scene image according to the YOLO V3 neural network model, and then manually correct and finely label the label, wherein the yolov3.weights weight file contains neural network weight parameters trained by an ImageNet data set and a coco data set. The method for obtaining the first training sample is not limited in this embodiment, and can be determined by those skilled in the art as needed.

Secondly, acquiring a first pre-training neural network model trained according to a target data set;

illustratively, the target dataset may be an ImageNet dataset and/or a coco dataset. The first pre-trained neural network model trained according to the target data set may be obtained by obtaining yolov3.weights file at YOLO official website and using the yolov3.weights file in YOLO v3 neural network framework.

And thirdly, performing transfer learning on the first pre-trained neural network model according to the first training sample to obtain the human body detection neural network model.

Illustratively, the strategy of FineTune is adopted to perform fine-tuning training on the first pre-trained neural network model, because the scene image distribution in the actual scene is different from the ImageNet dataset and/or the coco dataset, which may cause false detection and missing detection.

Because the ImageNet data set and the coco data set have huge personnel data volume, distribution information of personnel data in various scenes is contained, compared with the situation that the first training sample is used for training from the beginning, the transfer learning is beneficial to improving the generalization capability of the model in strange data distribution, and the limitation of the adaptive capability of the model in the distribution of the first training sample is avoided. The specific way of performing the migration learning is to freeze the front 81 layers of YOLO v3, and only train and adjust the weight coefficients of the latter layers. The command to freeze the weights of the front 81 layers is as follows:

darknet partial cfg/yolov3.cfg yolov3.weights yolov3. conv.8181 now gets a pre-trained model named yolov3.conv.81 at the current path. And then, training the first pre-training neural network model by using the first training sample, wherein the accuracy of the verification set shows a trend of ascending first and then descending in the training process, the descending proves that the model is over-fit in the training process, the point with the highest verification accuracy is regarded as the optimal weight of the model training, and the training is stopped when the verification accuracy reaches the highest point. The command to train the weights of the network 81 layers of successors using the first training sample is as follows:

Darknet detector train cfg/coco.data cfg/yolov3.cfg yolov3.conv.81－gpus 0，1

as an optional implementation manner of this embodiment, the training process of the head feature detection classification model includes:

firstly, obtaining a second training sample, wherein the second training sample comprises a multi-class sample label, and the multi-class sample label is obtained according to a multi-label binarization function in a target function library;

illustratively, the library of objective functions may be a sklern library of functions and the multi-label binarization function may be a multilabel binarizer function. The method for obtaining the second training sample may be to manually perform head feature labeling on the human body obtained in the scene image, for example, to put helmet labels on all helmet-worn head frames in the current scene image, put glasses labels on all glasses-worn head frames, put mask labels on all mask-worn head frames, and so on. The embodiment does not limit the way of performing the head feature marking, and a person skilled in the art can determine the way as needed. After the head features of all scene images are labeled, a multi-label binary multiLabelBinarizer function in a sklern function library is used for converting a plurality of labels such as the presence or absence of a helmet, the presence or absence of an eye lens, the presence or absence of a mask and the like into one-dimensional vectors (0 represents the absence, and 1 represents the presence) composed of 0/1, and each position of the fused 0/1 vector corresponds to a fixed label. For example, vector [0, 1, 1, … ] indicates [ helmet-free, glasses-containing, mask-containing, … ].

Secondly, inputting a second training sample into a second pre-training neural network model;

and thirdly, when the second pre-training neural network model meets the preset condition, obtaining a head feature detection classification model.

For example, the predetermined condition may be that the loss function value of the second pre-trained neural network model is smaller than a preset threshold, or that the accuracy of the validation set reaches a preset threshold. The preset condition is not limited in this embodiment, and can be determined by those skilled in the art as needed.

According to the head feature extraction method provided by the embodiment of the invention, a multi-classification sample label is used in the training process instead of a simple single classification label, and mutual exclusion among features is avoided, so that the representation capability of head feature extraction is enhanced.

As an optional implementation manner of this embodiment, the head feature extraction method further includes:

and when the second training sample is input into the second pre-training neural network model and the feature extraction error occurs, repeatedly inputting the second training sample with the feature extraction error into the second pre-training neural network model, and performing iterative training for the target times.

Illustratively, when the second training sample is input into the second pre-training neural network model and the feature extraction is manually verified, the second training sample with the error is input into an error-prone set feature library, wherein the error-prone set feature library usually includes data which are difficult to recognize by human eyes and have fuzzy head features (caused by human movement or camera shake and the like). The error-prone set feature library is added into the training set in parallel for incremental training, and through three times of iterative training, the accuracy rate of extracting the head features of the model can reach over 95 percent, so that the accuracy rate of extracting the head features is improved.

As an optional implementation manner of this embodiment, the obtaining the second training sample includes:

firstly, acquiring an image to be trained;

for example, the image to be trained may be a street view image containing a pedestrian or a non-motor vehicle driver, or may be an image containing a human body in a specific environment. The acquisition mode of the image to be trained can be shooting through a camera arranged at the street or frame-extracting video, the embodiment does not limit the image to be trained and the mode of acquiring the image to be trained, and a person skilled in the art can determine the mode as required.

Secondly, inputting the image to be trained into a human body detection YOLO V3 model trained in advance to obtain a human body detection frame in the image to be trained; inputting the human body detection frame into a pre-trained feature classification YOLO V3 model to obtain a label corresponding to the human body detection frame, and constructing according to the human body detection frame and the corresponding label to obtain a second training sample.

Illustratively, a pre-trained human detection YOLO V3 model and a pre-trained feature classification YOLO V3 model are concatenated to obtain a head detection box and a corresponding label. And taking the head detection frame and the corresponding label as a second training sample, namely taking the detection result of the pre-trained neural network model as the training sample of the head feature detection classification model.

Compared with a feature classification YOLO V3 model, the head feature detection classification model is simpler in network model and faster in processing speed, and therefore, in this embodiment, the head feature detection classification model is selected to perform head feature detection on an actual image to be detected. However, in the training process of the head feature detection classification model, because the pictures input into the network are human-shaped frames and the classification is based on the local features of the head, a small amount of data (a hundred-degree data set) cannot teach that the model focuses on the concerned feature points, and the generalization capability of the model is often poor. Therefore, a large number of training samples are required to achieve a better classification effect.

In this embodiment, the pre-trained YOLO V3 model and the pre-trained feature classification YOLO V3 model are connected in series, and the input data of the pre-trained feature classification YOLO V3 model is the output result of the pre-trained YOLO V3 model, and the output result already contains the position information of the head, so that the model trained from a small data set (500+) has a relatively good generalization capability. The neural network model trained by a small number of data sets provides a large number of training samples for the head characteristic detection classification model, so that the cost of a large number of manual labels can be saved, the ground propulsion speed of the project is increased, and the thought is provided for rapidly establishing a large-scale classification data set.

As an optional implementation manner of this embodiment, inputting an image to be detected to a pre-trained human detection neural network model includes: adjusting the size of an image to be detected to a first target size, and inputting the image to be detected of the first target size into a human body detection neural network model trained in advance; and/or

Inputting a human body detection frame into a pre-trained head feature detection classification model, comprising: and adjusting the size of the human body detection frame to a second target size, and inputting the human body detection frame with the second target size into the pre-trained head feature detection classification model.

Illustratively, the first target size may be 608 × 608, and the second target size may be 416 × 416 (or 320 × 320), and the present embodiment does not limit the first target size and the second target size, and may be determined by a person skilled in the art as needed. The size of the image to be detected and/or the human body detection frame is enlarged, so that the accuracy and the speed of detection can be improved.

An embodiment of the present invention provides a head feature extraction device, as shown in fig. 2, including:

an image to be detected acquisition module 201, configured to acquire an image to be detected; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The human body detection module 202 is used for inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The head feature detection module 203 is configured to input the human body detection frame into a pre-trained head feature detection classification model to obtain human body head features of the human body detection frame. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

As an optional implementation manner of this embodiment, the human body detection neural network model is a YOLOv3 neural network model, and the head feature detection classification model is a MobileNet classification network. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

As an optional implementation manner of this embodiment, the human body detection module includes:

the first training sample acquisition module is used for acquiring a first training sample, wherein the first training sample comprises scene images of different regions, different time periods and different illumination conditions and human body pre-labeling information in the scene images; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The first pre-training neural network model acquisition module is used for acquiring a first pre-training neural network model trained according to a target data set; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

And the human body detection neural network model determining module is used for carrying out transfer learning on the first pre-trained neural network model according to the first training sample to obtain the human body detection neural network model. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

As an optional implementation manner of this embodiment, the head feature detection module includes:

the second training sample acquisition module is used for acquiring a second training sample, wherein the second training sample comprises a multi-classification sample label, and the multi-classification sample label is obtained according to a multi-label binarization function in the target function library; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The second training sample input module is used for inputting a second training sample to the second pre-training neural network model; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

And the head feature detection classification model determining module is used for obtaining a head feature detection classification model when the second pre-training neural network model meets the preset condition. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

As an optional implementation manner of this embodiment, the apparatus further includes:

and the iterative training module is used for repeatedly inputting the second training sample with the characteristic extraction error into the second pre-training neural network model when the second training sample is input into the second pre-training neural network model and the characteristic extraction error occurs, and performing iterative training for the target times. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

As an optional implementation manner of this embodiment, the second training sample obtaining module includes:

the image to be trained acquisition module is used for acquiring an image to be trained; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The image to be trained input module is used for inputting the image to be trained into a human body detection YOLO V3 model trained in advance to obtain a human body detection frame in the image to be trained; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

And the second training sample determining module is used for inputting the human body detection frame into a pre-trained feature classification YOLO V3 model to obtain a label corresponding to the human body detection frame, and constructing according to the human body detection frame and the corresponding label to obtain a second training sample. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

the first size adjusting module is used for adjusting the size of the image to be detected to a first target size and inputting the image to be detected of the first target size to a pre-trained human body detection neural network model; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again. And/or

A head feature detection module comprising: and the second size adjusting module is used for adjusting the size of the human body detection frame to a second target size and inputting the human body detection frame with the second target size into the pre-trained head feature detection classification model. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The embodiment of the present application also provides an electronic device, as shown in fig. 3, including a processor 310 and a memory 320, where the processor 310 and the memory 320 may be connected by a bus or in other manners.

Processor 310 may be a Central Processing Unit (CPU). The Processor 310 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof.

The memory 320, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the head feature extraction method in the embodiment of the present invention. The processor executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions, and modules stored in the memory.

The memory 320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 320 may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 320 and, when executed by the processor 310, perform a head feature extraction method as in the embodiment shown in fig. 1.

The details of the electronic device may be understood with reference to the corresponding related description and effects in the embodiment shown in fig. 1, and are not described herein again.

The present embodiment also provides a computer storage medium, which stores computer-executable instructions that can execute the method for extracting the head features in any of the method embodiments 1. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A head feature extraction method is characterized by comprising the following steps:

acquiring an image to be detected;

inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected;

and inputting the human body detection frame into a pre-trained head feature detection classification model to obtain the human body head features of the human body detection frame.

2. The method of claim 1, wherein the human detection neural network model is a YOLOv3 neural network model and the head feature detection classification model is a MobileNet classification network.

3. The method of claim 1, wherein the training process of the human detection neural network model comprises:

acquiring a first training sample, wherein the first training sample comprises scene images of different regions, different time periods and different illumination conditions and human body pre-labeling information in the scene images;

acquiring a first pre-training neural network model trained according to a target data set;

and carrying out transfer learning on the first pre-training neural network model according to the first training sample to obtain a human body detection neural network model.

4. The method of claim 1, wherein the training process of the head feature detection classification model comprises:

acquiring a second training sample, wherein the second training sample comprises a multi-class sample label, and the multi-class sample label is obtained according to a multi-label binarization function in a target function library;

inputting the second training sample to a second pre-trained neural network model;

and when the second pre-training neural network model meets the preset condition, obtaining a head feature detection classification model.

5. The method of claim 4, further comprising:

6. The method of claim 4, wherein obtaining second training samples comprises:

acquiring an image to be trained;

inputting the image to be trained into a human body detection YOLO V3 model trained in advance to obtain a human body detection frame in the image to be trained;

inputting the human body detection frame into a pre-trained feature classification YOLO V3 model to obtain a label corresponding to the human body detection frame, and constructing according to the human body detection frame and the corresponding label to obtain the second training sample.

7. The method of claim 1, wherein inputting the image to be detected into a pre-trained human detection neural network model comprises: adjusting the size of the image to be detected to a first target size, and inputting the image to be detected of the first target size into a human body detection neural network model trained in advance; and/or

Inputting the human body detection frame into a pre-trained head feature detection classification model, comprising: and adjusting the size of the human body detection frame to a second target size, and inputting the human body detection frame with the second target size into a pre-trained head feature detection classification model.

8. A head feature extraction device characterized by comprising:

the image acquisition module to be detected is used for acquiring an image to be detected;

the human body detection module is used for inputting the image to be detected into a pre-trained human body detection neural network model to obtain a human body detection frame in the image to be detected;

and the head characteristic detection module is used for inputting the human body detection frame into a pre-trained head characteristic detection classification model to obtain the human body head characteristics of the human body detection frame.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the head feature extraction method according to any one of claims 1 to 7 are implemented when the program is executed by the processor.

10. A storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the head feature extraction method of any one of claims 1 to 7.