CN111950507A

CN111950507A - Data processing and model training method, device, equipment and medium

Info

Publication number: CN111950507A
Application number: CN202010863364.8A
Authority: CN
Inventors: 朱宏吉; 张彦刚
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-11-17
Anticipated expiration: 2040-08-25
Also published as: CN111950507B

Abstract

The invention discloses a data processing and model training method, a data processing and model training device and a data processing and model training medium, which are used for solving the problems of large calculation amount, low efficiency and large occupied storage space in the existing passenger flow data determining process. In the embodiment of the invention, the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face can be obtained through the joint detection model, so that the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face can be extracted only by inputting the image to be recognized into the joint detection model once, the calculation amount required for extracting the feature vectors of the regions of the target detection frames in the image to be recognized for multiple times is reduced, and the efficiency of the passenger flow data determining process is improved.

Description

Data processing and model training method, device, equipment and medium

Technical Field

The invention relates to the technical field of data analysis, in particular to a training method, a training device, training equipment and a training medium for a data processing and joint detection model.

Background

At present, the effect of the device on soliciting customers in a scene can be evaluated by analyzing statistical passenger flow data, namely a passenger flow value parameter and an attention frequency parameter, so that technical support is provided for merchants to make better business decisions. Therefore, how to count the passenger flow data is a problem that people pay more attention in recent years.

In the prior art, in the process of determining the attention frequency parameter, position information of a target detection frame containing a human head in an image to be recognized is generally obtained through a human head detection model which is trained in advance, and for each target detection frame, information whether the target detection frame contains a human face is recognized through a human face classification model which is trained in advance based on an area of the target detection frame in the image to be recognized. And then, for each target detection frame, determining the identification information of the target detection frame, and updating the attention frequency parameter in the currently stored passenger flow data when the target detection frame is determined to contain the human face and the identification information of the target detection frame meets the preset updating condition.

For the above method for determining the attention number parameter in the passenger flow data, since the human head detection model and the human face classification model need to be stored respectively, a large amount of storage space is wasted, and through the human head detection model and the human face classification model, in the process of acquiring each target detection frame included in the image to be recognized and identifying whether the information of the human face is included in each target detection frame, the feature vector of the region of the target detection frame in the image to be recognized is inevitably extracted repeatedly, so that the calculation amount required in the process of determining the attention number parameter is large, and the efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a training method, a training device, a training equipment and a training medium for a data processing and joint detection model, which are used for solving the problems of large calculation amount, low efficiency and large occupied storage space in the existing passenger flow data determining process.

The embodiment of the invention provides a data processing method, which comprises the following steps:

acquiring the position information of each target detection frame containing the human head in the image to be recognized and the information whether each target detection frame contains the human face or not through a joint detection model;

and determining passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains the human face.

The embodiment of the invention provides a training method of a joint detection model, which comprises the following steps:

acquiring any sample image in a sample set, wherein the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value corresponding to each sample human head frame, the first identification value is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face or not;

acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through an original joint detection model;

and training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

An embodiment of the present invention provides a data processing apparatus, including:

the acquisition unit is used for acquiring the position information of each target detection frame containing the human head in the image to be recognized and the information whether each target detection frame contains the human face or not through the joint detection model;

and the processing unit is used for determining the passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains the human face.

The embodiment of the invention provides a training device for a joint detection model, which comprises:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring any sample image in a sample set, and the sample image is marked with first position information of a head frame of each sample, a first identification value and a second identification value corresponding to the head frame of each sample, the first identification value is used for identifying whether the head frame of the sample contains a human head, and the second identification value is used for identifying whether the head frame of the sample contains a human face;

the second acquisition module is used for acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through the original joint detection model;

and the training module is used for training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and the first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and the second identification value corresponding to the corresponding sample head frame.

An embodiment of the present invention further provides an electronic device, where the electronic device at least includes a processor and a memory, and the processor is configured to implement the steps of the data processing method or the steps of the training method of the joint detection model when executing the computer program stored in the memory.

Embodiments of the present invention further provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the data processing method or the steps of the training method of the joint detection model.

The embodiment of the invention can obtain the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face or not through the combined detection model, thereby realizing that the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face or not can be extracted only by inputting the image to be recognized into the combined detection model once, reducing the calculation amount required for extracting the feature vectors of the regions of the target detection frames in the image to be recognized for multiple times and improving the efficiency of the process of determining passenger flow data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a data processing process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of each angle value of a face according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a specific data processing procedure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a training process of a joint detection model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of each key point of a sample human head frame according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a joint detection model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a training apparatus for joint detection models according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention is further described in detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1: fig. 1 is a schematic diagram of a data processing process provided in an embodiment of the present invention, including:

s101: and acquiring the position information of each target detection frame containing the human head in the image to be recognized and the information whether each target detection frame contains the human face or not through a joint detection model.

The embodiment of the invention is applied to the electronic equipment which can be intelligent equipment, such as an intelligent robot, an intelligent screen, an intelligent monitoring system and the like, and can also be a server and the like.

After the electronic device acquires the image to be identified, the image to be identified is analyzed based on the data processing method provided by the embodiment of the invention, so that the passenger flow data is determined. The image to be recognized may be acquired by the electronic device itself, or may be sent by other image acquisition devices, which is not limited herein.

In order to improve the efficiency of determining the passenger flow data, in the embodiment of the invention, a joint detection model is trained in advance. After the electronic equipment acquires the image to be recognized, the image to be recognized is processed through the pre-trained joint detection model, so that the position information of each target detection frame containing the human head in the image to be recognized and the information whether the target detection frame contains the human face or not are acquired.

The position information of the target detection frame is coordinates (such as pixel coordinates, that is, pixel coordinates of the region where each human head is located in the image to be recognized) of the region where each human head is located in the image to be recognized, and subsequently, the region of each target detection frame in the image to be recognized may be determined respectively according to the position information of each target detection frame, for example, coordinate values of pixel points at the upper left corner of the target detection frame in the image to be recognized and coordinate values of pixel points at the lower right corner of the target detection frame in the image to be recognized.

It should be noted that, whether each obtained target detection frame includes information of a face may be whether the target detection frame includes an identification value of the face, for example, the identification value of the face included in the target detection frame is "1", and the identification value of the face not included in the target detection frame is "0"; or the probability of whether the target detection frame contains the face or not.

In a possible implementation manner, if the joint detection model outputs the probability of whether each target detection frame includes a human face, a decision threshold is preset to determine whether the target detection frame includes a human face. After the probability of whether each target detection frame includes a face is obtained based on the above embodiment, for each target detection frame, the probability of whether the target detection frame includes a face is compared with a decision threshold, so as to determine whether the target detection frame includes a face. Specifically, for each target detection frame, if the probability of whether the target detection frame includes a face is greater than a decision threshold, it is determined that the target detection frame includes a face, otherwise, it is determined that the target detection frame does not include a face. For example, the decision threshold is 0.8, the probability of whether a certain target detection frame contains a human face is 0.9, it is determined that the probability 0.9 is greater than the decision threshold 0.8, and it is determined that the target detection frame contains a human face.

The decision threshold may be set empirically, or may be set to different values in different scenarios. For example, if the accuracy requirement on whether the target detection frame contains the information of the face is high, the decision threshold may be set to be larger; if it is desired that the box containing the face target detection be recognized as much as possible, the decision threshold may be set smaller. Specifically, the flexible setting can be performed according to actual requirements, and is not specifically limited herein.

S102: and determining passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains the human face.

After the position information of each target detection frame and the information whether each target detection frame contains the human face are acquired based on the above embodiment, subsequent processing is performed to determine the passenger flow data, that is, the passenger flow value parameter and/or the attention number parameter in the passenger flow data are/is determined. Wherein, the attention number parameter is used for counting the number of pedestrians looking at the intelligent device. For example, in an office hall, based on the counted number of pedestrians looking at the intelligent robot, corresponding data analysis is performed to determine the possibility that the pedestrians want to interact with the intelligent robot in the current scene; or in a market, the number of people looking at the intelligent screen projection in the current scene is counted, and corresponding data analysis is carried out, so that the effect of attracting customers by playing advertisements through the intelligent screen projection in the current scene is evaluated. In the process of counting the attention number, the counting can be performed periodically, such as a day, a week, a month, etc.

In a possible implementation, the object detection frame containing the head in the image to be recognized is generally attributed to pedestrians entering the scene, and the pedestrians need to be counted in the passenger flow value parameter in the passenger flow data. Therefore, in the embodiment of the present invention, corresponding processing may be directly performed according to the obtained position information of the target detection frame, and the passenger flow data, that is, the passenger flow value parameter, may be determined.

In another possible implementation, when the acquired target detection frame includes a human face, a pedestrian to which the target detection frame belongs may be looking at the smart device, and the number of times that such a pedestrian looks at the smart device needs to be counted in the attention number parameter in the passenger flow data. Therefore, in the embodiment of the present invention, the passenger flow data, that is, the attention number parameter in the passenger flow data, may be determined according to whether the target detection frame includes the information of the human face. For example, the attention number parameter in the passenger flow value is determined according to the number of target detection frames containing human faces.

In addition, the number of pedestrians looking at the intelligent device is counted by the attention number parameter, and in an actual application scene, the same person may be stared at the intelligent device all the time or the same person may look at the intelligent device for multiple times. If the attention frequency parameter in the currently stored passenger flow data is updated directly according to each target detection frame containing the face in the image to be recognized, the same person is counted into the attention frequency parameter for many times, so that the accuracy of the counted attention frequency parameter is not high. Therefore, in order to improve the accuracy of the determined passenger flow data, in the embodiment of the present invention, corresponding processing may be performed according to the position information of the target detection frame and the information whether the target detection frame includes the face, so as to determine the passenger flow data, that is, to determine the attention number parameter in the passenger flow data.

Example 2: in order to accurately determine the passenger flow data, on the basis of the above embodiment, in an embodiment of the present invention, determining the passenger flow data includes:

determining identification information of a target detection frame; and if the identification information meets a preset first updating condition and the target detection frame contains a human face, updating the attention frequency parameter in the currently stored passenger flow data.

In order to avoid counting the times of seeing the intelligent device by the same person for multiple times into the attention time parameter, in the embodiment of the invention, after the target detection frame is obtained, the identification information of the target detection frame is determined, and in order to determine the identification information of the target detection frame, a tracking queue is preset, and the head frame of the tracked person and the identification information corresponding to each head frame of the tracked person are stored in the tracking queue. After the position information of each target detection frame containing the human head in the image to be recognized is acquired, aiming at each target detection frame, extracting the area of the target detection frame in the image to be recognized according to the position information of the target detection frame. And the electronic equipment performs corresponding processing based on the area of the target detection frame in the image to be identified and each tracked human head frame stored in the current tracking queue, and determines the identification information of the target detection frame. The subsequent electronic device can perform corresponding processing based on the identification information of the target detection box to determine the passenger flow data. And the tracked human head frame is the human head frame of the pedestrian in the acquired surrounding environment. When the initial tracking queue is set, the tracking queue can be empty, and in the subsequent data processing process, the tracking queue is updated in real time according to each target detection frame allocated with the identification information.

In one possible embodiment, the process of determining the identification information of the target detection box includes: respectively determining the similarity between the target detection frame and each tracked person head frame in the current tracking queue, determining identification information of the tracked person head frame with higher similarity to the target detection frame through Hungary algorithm and each similarity, and determining the identification information as the identification information of the target detection frame if the similarity between any tracked person head frame corresponding to the identification information and the target detection frame is greater than a set threshold value; otherwise, distributing new identification information for the target detection frame, and determining the newly distributed identification information as the identification information of the target detection frame.

When determining the similarity between the target detection frame and each tracked human head frame in the current tracking queue, the method for determining the similarity may be to calculate a region overlapping ratio between the target detection frame and each tracked human head frame in the current tracking queue, that is, to calculate an overlapping ratio between any tracked human head frame and a region of the target detection frame in the image to be recognized, or to calculate a spatial distance between the region of the target detection frame in the image to be recognized and each tracked human head frame in the current tracking queue, such as an euclidean distance, a chebyshev distance, and the like. The specific manner of determining the similarity is not specifically limited herein.

It should be noted that the identification information of the target detection box is used to uniquely identify the identity information of the object to which the target detection box belongs, and the identification information may be numbers, letters, special symbols, character strings, and the like, or may be in other forms, as long as the identification information can uniquely identify the object to which the target detection box belongs, which can be used as the identification information in the embodiment of the present invention.

In the embodiment of the present invention, in order to improve the accuracy of the determined attention number parameter, a first update condition is further preset, where the first update condition may be whether the obtained identification information of the target detection box is not matched with any statistical identification stored in the current statistical queue. The matching means that the statistical identifier which is the same as the identifier information of the target detection box is stored in the current statistical queue.

Specifically, based on the method in the above embodiment, after the identification information of the target detection frame is determined, matching the identification information of the target detection frame with any statistical identifier stored in the current statistical queue, and if it is determined that the identification information of the target detection frame is not matched with any statistical identifier stored in the current statistical queue, indicating that the identification information of the target detection frame meets a preset first update condition; otherwise, the identification information of the target detection frame is not satisfied with the preset first updating condition.

Wherein the statistical queue comprises identification information of pedestrians in the surrounding environment which have been counted in the attention number parameter. When the initial statistical queue is set, the statistical queue may be empty, and in the subsequent process of determining the attention number parameter, the statistical queue is updated in real time according to each target detection frame to which the identification information is allocated. In specific implementation, when it is determined that the target detection frame contains a human face and the identification information of the target detection frame meets a preset first updating condition, the attention frequency parameter in the currently stored passenger flow data is updated.

Further, after updating the attention number parameter in the currently stored passenger flow data, the identification information of the target detection box is updated to the current statistical queue, so that the attention number parameter in the currently stored passenger flow data is not updated according to the target detection box of the identification information any more in the following.

In another possible embodiment, the method further comprises: acquiring a first angle identification value corresponding to each target detection frame through a joint detection model, wherein the first angle identification value is used for identifying the angle value of the face in the target detection frame containing the face, or identifying that the target detection frame does not contain the face;

determining that the identification information meets a preset first updating condition, wherein the determining includes: and if the identification information is not matched with any statistical identification stored in the current statistical queue, and the first angle identification value corresponding to the target detection frame corresponding to the identification information meets a preset second updating condition, determining that the identification information meets the first updating condition.

In an actual application process, in an acquired image to be recognized, although the target detection frame includes a face, an object to which the face belongs is not necessarily to be seen from an intelligent device, for example, an object to which the target detection frame a including the face belongs looks sky, an object to which the target detection frame B including the face belongs turns around to look elsewhere, and the like. And generally look at the face of the smart device, the angle of the face is within a certain range. Therefore, in order to further improve the accuracy of the parameter of the counted attention number, in the embodiment of the present invention, the preset first update condition may further be that the identification information of the target detection frame is not matched with any statistical identifier stored in the current statistical queue, and the angle identification value (denoted as the first angle identification value) of the target detection frame corresponding to the identification information meets a preset third update condition. The third updating condition is whether the angle value of any target detection frame corresponding to the identification information is smaller than a preset angle threshold value.

The first angle identification value is used for identifying an angle value of a face in a target detection frame containing the face, or identifying that the target detection frame does not contain the face. For example, the first angle identification value corresponding to the target detection frame a containing the face is 30 degrees, the pitch angle for identifying the face in the target detection frame a is 30 degrees, and the first angle identification value corresponding to the target detection frame B not containing the face is null.

It should be noted that the angle value may be a pitch angle value of the human face, and/or a yaw angle value. Fig. 2 is a schematic diagram of each angle value of the face according to the embodiment of the present invention, where the angle value corresponding to Yaw is a Yaw angle value, and the angle value corresponding to Pitch is a Pitch angle value.

It should be noted that the first angle identification value for identifying that the target detection frame does not include a face may be any angle value greater than 180 degrees, or may be in other forms such as characters, and any other form may be used as long as the first angle identification value can be distinguished from the angle value of the face in the target detection frame including a face, and the first angle identification value for identifying that the target detection frame does not include a face may be used in the embodiments of the present invention.

In the embodiment of the invention, the image to be recognized is input into the joint detection model, and through the joint detection model, not only can the position information of each target detection frame containing the human head in the image to be recognized and the information whether each target detection frame contains the human face be obtained, but also the first angle identification value corresponding to each target detection frame can be obtained.

For each acquired target detection frame, determining, based on the method in the above embodiment, identification information of the target detection frame according to position information of the target detection frame, then determining whether the identification information is not matched with any statistical identifier stored in the current statistical queue, and determining whether a first angle identification value corresponding to the target detection frame corresponding to the identification information meets a preset second update condition, that is, determining whether the first angle identification value corresponding to any target detection frame corresponding to the identification information is smaller than a preset angle threshold. And if the identification information is not matched with any statistical identification stored in the current statistical queue and the first angle identification value corresponding to the target detection frame corresponding to the identification information meets a preset second updating condition, determining that the identification information meets the first updating condition.

The angle value identified by the first angle identification value corresponding to the target detection frame containing the human face may include a pitch angle value and a yaw angle value. When the preset angle threshold is set, the angle thresholds corresponding to the pitch angle value and the yaw angle value may be the same or different. The specific setting can be flexibly set according to the requirements, and is not particularly limited. However, it should be noted that no matter whether the angle thresholds corresponding to the pitch angle value and the yaw angle value are the same, the pitch angle value and the yaw angle value of the face in the target detection frame corresponding to the identification information both need to be smaller than the corresponding preset angle thresholds, and it is determined that the first angle identification value corresponding to the target detection frame corresponding to the identification information satisfies the preset second update condition.

For example, the preset angle thresholds corresponding to the pitch angle value and the yaw angle value are respectively 30 and 35, the pitch angle value 6 and the yaw angle value 12 of the human face in any target detection frame corresponding to a certain identification information a are determined, the pitch angle value 6 of the target detection frame is determined to be smaller than the corresponding angle threshold 30, the yaw angle value 12 of the target detection frame is determined to be smaller than the corresponding angle threshold 35, and then it is determined that the first angle identification value corresponding to the target detection frame corresponding to the identification information a meets the preset second update condition.

In an actual process, the acquired image to be recognized may include a face of a pedestrian turning around, the pedestrian does not actually look at the intelligent device, but after the position information of the target detection frame including the head of the pedestrian, the information of whether the target detection frame includes the face, and the first angle identification value corresponding to the target detection frame are acquired, based on the method of the above embodiment, after corresponding processing is performed, a first update condition is likely to be satisfied, which results in an inaccurate attention number parameter in the determined passenger flow data. Therefore, in order to improve the accuracy of the attention frequency parameter in the determined passenger flow data, in the embodiment of the present invention, the preset second update condition may be whether all the obtained images to be recognized in the continuously set number have the target detection frame corresponding to a certain identification information, and whether all the first angle identification values corresponding to the target detection frames corresponding to the identification information in the images to be recognized in the continuously set number are smaller than the preset angle threshold value. Specifically, determining that a first angle identification value corresponding to a target detection frame corresponding to the identification information meets a preset second update condition includes:

if the target detection frames corresponding to the identification information exist in the continuously set number of images to be recognized, and the first angle identification values corresponding to the target detection frames corresponding to the identification information in the continuously set number of images to be recognized are all smaller than the preset angle threshold value, it is determined that the first angle identification values corresponding to the target detection frames corresponding to the identification information meet the second updating condition.

In specific implementation, after the identification information of a certain target detection frame is determined, it is determined whether target detection frames corresponding to the identification information exist in the obtained images to be recognized in the continuously set number, and whether first angle identification values corresponding to the target detection frames corresponding to the identification information in the images to be recognized in the continuously set number are all smaller than a preset angle threshold value. If yes, the fact that the object to which the target detection frame corresponding to the identification information belongs is most likely to be the intelligent device is looked at is indicated, and it is determined that a first angle identification value corresponding to the target detection frame corresponding to the identification information meets a second updating condition; otherwise, it is determined that the first angle identification value corresponding to the target detection frame corresponding to the identification information does not satisfy the second update condition.

For example, the preset pitch angle value and yaw angle value corresponding angle threshold values are both 30, the set number is 3, for a certain target detection frame, the identification information of the target detection frame is determined to be a, the obtained 3 consecutive target detection frames to be recognized each include the identification information a, wherein the pitch angle value and yaw angle value corresponding to the target detection frame of the identification information a in the image to be recognized 1 are respectively 17 and 18, the pitch angle value and yaw angle value corresponding to the target detection frame of the identification information a in the image to be recognized 2 are respectively 24 and 18, the pitch angle value and yaw angle value corresponding to the target detection frame of the identification information a in the image to be recognized 3 are respectively 27 and 28, and the pitch angle value and yaw angle value corresponding to the target detection frame of the identification information a in the 3 consecutive images to be recognized 1, 2, 3 are determined to be smaller than the preset angle threshold values, determining that the first angle identification value corresponding to the target detection frame corresponding to the identification information a meets the second updating condition.

When the set number is set, different values can be set according to different scenes. The set number may be set smaller in order to count pedestrians looking at the smart device as much as possible, and may be set larger in order to improve the accuracy of the determined number of attentions parameter. In the specific implementation process, the setting can be flexibly performed according to the actual requirement, and is not specifically limited herein.

Example 3: in order to accurately determine the passenger flow value parameter in the passenger flow data, on the basis of the above embodiments, in an embodiment of the present invention, determining the passenger flow data includes:

determining identification information of a target detection frame; and if the identification information of the target detection frame meets a preset third updating condition, updating the passenger flow value parameter in the currently stored passenger flow data.

Generally, the images to be recognized acquired by the electronic device are acquired at preset time intervals, such as 50ms, 100ms, and the like, and the preset time intervals are not set to be large in order to ensure that the images of all the persons entering the shooting range can be acquired in real time. However, in the practical application process, the time spent by the same person from entering the shooting range to leaving the shooting range is generally greater than the preset time interval, so that the plurality of acquired images to be recognized all include the target detection frame of the same person. If the passenger flow value parameters in the passenger flow data are determined directly according to the number contained in each target detection frame obtained from the image to be identified, the same person can be counted for many times, so that the accuracy of determining the passenger flow value parameters in the passenger flow data is not high.

Therefore, in the embodiment of the present invention, before determining the passenger flow value parameter in the passenger flow data, the identification information of the target detection box is determined. And for each target detection frame, determining whether a preset third updating condition is met or not based on the identification information of the target detection frame, so as to determine whether the passenger flow value parameter in the currently stored passenger flow data is updated or not. The specific method for determining the identification information of the target detection frame is the same as that described in the above embodiments, and repeated parts are not described again.

In a possible implementation manner, the preset third updating condition may be whether the identification information of the target detection box is the same as any one of the identification information saved in the current trace queue. And if the identification information of the target detection frame is determined to be the same as any identification information stored in the current tracking queue, which indicates that the object to which the target detection frame belongs is counted in the passenger flow value parameter, determining that the identification information of the target detection frame does not meet a preset third updating condition, and not updating the passenger flow value parameter in the currently stored passenger flow data. And if the identification information of the target detection frame is determined to be different from any identification information stored in the current tracking queue, indicating that the object to which the target detection frame belongs is not counted in the passenger flow value parameter, determining that the identification information of the target detection frame meets a preset third updating condition, and updating the passenger flow value parameter in the currently stored passenger flow data.

In another possible implementation, since it may happen that the acquired image to be recognized is acquired at the door of an application scene, such as an office hall, a mall, a bus, etc., although the image to be recognized acquired at the door of the application scene may include a target detection frame of each object entering the application scene, it is very easy to include a target detection frame of an object passing through the door of the application scene. For example, in a scene of counting passenger flow data of a bus, an image to be recognized is generally collected at a door of the bus, and the collected image to be recognized includes not only a target detection frame of an object to be mounted on the bus but also a target detection frame of an object passing through the door of the bus.

Therefore, the preset third updating condition may also be whether the number of the tracked human head frames corresponding to the identification information is greater than a set number threshold, and whether the sum of the determined distances is greater than the set distance threshold based on the image information of every two tracked human head frames corresponding to the identification information adjacent in the acquisition time. Specifically, after the identification information of the target detection frame is acquired based on the above embodiment, it is determined whether the tracked person head frame corresponding to the identification information is stored in the current tracking queue, if the tracked person head frame corresponding to the identification information is stored, the tracked person head frame corresponding to the identification information is acquired, and then the moving distance corresponding to each two tracked person head frames is determined according to the image information of the two tracked person head frames adjacent to each other at the acquisition time. And then determining the sum of the distances according to each acquired moving distance. And judging whether the number of the tracked human head frames corresponding to the identification information is greater than a number threshold value or not, and whether the sum of the obtained distances is greater than a distance threshold value or not.

Further, if the number of the tracked person head boxes corresponding to the identification information is greater than a number threshold and the sum of the obtained distances is greater than a distance threshold, which indicates that the identification information meets a preset third updating condition, updating the passenger flow value parameter in the currently stored passenger flow data; if the number of the tracked person head boxes corresponding to the identification information is not greater than the number threshold, or the sum of the acquired distances is not greater than the distance threshold, indicating that the identification information does not satisfy the preset third updating condition, the process of updating the passenger flow value parameter in the currently stored passenger flow data is not executed.

When the corresponding moving distance of every two tracked human head frames is determined, the moving distance can be determined according to the position information of the pixel points at the set positions of the two tracked human head frames in the image to be identified. For example, the distance between the pixel points at the upper left corner of the two tracked human head frames and the coordinate value of the image to be recognized, the distance between the pixel points at the lower right corner of the tracked human head frames and the coordinate value of the image to be recognized, the distance between the pixel points at the middle point of the diagonal line of the tracked human head frames and the coordinate value of the image to be recognized, and the like are determined.

Wherein the number threshold is generally not greater than a quotient determined from a product of the maximum shooting distance in the preset shooting range and the time interval and the average speed. In order to improve the accuracy of the passenger flow value parameters in the determined passenger flow data, the number threshold value is not too small; in order to avoid the situation that the object is missed to count due to too fast walking speed of the object, the number threshold value is not suitable to be too large. In specific implementation, the setting can be flexibly performed according to actual requirements, and is not specifically limited herein.

The distance threshold is generally not greater than a maximum shooting distance in a preset shooting range. In order to improve the accuracy of the passenger flow value parameters in the determined passenger flow data, the distance threshold value is not too small; in order to avoid the situation that the object is missed to count due to the fact that the walking speed of the object is too high, the distance threshold value is not suitable to be too large. In specific implementation, the setting can be flexibly performed according to actual requirements, and is not specifically limited herein.

Example 4: on the basis of the above embodiments, in an embodiment of the present invention, obtaining, by a joint detection model, position information of each target detection frame including a human head in an image to be recognized and information of whether each target detection frame includes a human face, in order to obtain, by the joint detection model, position information of each target detection frame including a human head in an image to be recognized and information of whether each target detection frame includes a human face, includes:

extracting a network layer through the characteristics of the joint detection model to obtain the characteristic vector of the image to be identified; respectively inputting the feature vectors into a position detection layer, a head classification layer and a face classification layer of the joint detection model to obtain first position vectors corresponding to all detection frames, first head classification vectors corresponding to all detection frames and first face classification vectors corresponding to all detection frames, which are identified in the image to be identified; and acquiring the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face or not through an output layer of the joint detection model based on the first position vector, the first head classification vector and the first face classification vector.

In the embodiment of the invention, the network structure of the joint detection model comprises a feature extraction network layer, a position detection layer, a head classification layer, a face classification layer and an output layer. Fig. 6 is a schematic network structure diagram of a joint detection model according to an embodiment of the present invention, in which a position detection layer, a human head classification layer, and a human face classification layer of the joint detection model are respectively connected to a feature extraction network layer and an output layer. Specifically, after the image to be recognized is acquired, the image to be recognized is input into a joint detection model which is trained in advance, and the feature vector of the input image to be recognized can be acquired through a feature extraction network layer in the joint detection model. In order to obtain the feature vector of the image to be recognized conveniently, the feature extraction network layer is generally a network structure with small calculation amount, few parameters and friendly quantization, for example, network structures such as FBNet, mobilene with QF and the like.

In the embodiment of the invention, the feature extraction network layer in the joint detection model is simultaneously connected with a plurality of network layers, namely a position detection layer, a head classification layer and a face classification layer. In a specific implementation process, after a feature extraction network layer of a joint detection model acquires a feature vector of an image to be identified, the feature vector is respectively input into a position detection layer, a head classification layer and a face classification layer of the joint detection model; acquiring first position vectors corresponding to all detection frames identified in the image to be identified based on the characteristic vectors through a position detection layer of the joint detection model; acquiring a first head classification vector corresponding to each detection frame identified in the image to be identified based on the characteristic vector through a head classification layer of the joint detection model; acquiring first face classification vectors corresponding to all detection frames recognized in the image to be recognized based on the feature vectors through a face classification layer of the joint detection model; and performing corresponding processing through an output layer of the joint detection model based on the obtained first position vector, the first head classification vector and the first face classification vector to obtain the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face.

In a possible implementation manner, acquiring, by an output layer of the joint detection model, based on the first position vector, the first head classification vector, and the first face classification vector, position information of each target detection frame including a head in the image to be recognized and information of whether each target detection frame includes a face, includes:

sequentially determining the position information corresponding to each detection frame and a first index value according to the sequence based on the first position vector through an output layer of the joint detection model, wherein the first index value is used for identifying the position of the position information corresponding to the detection frame in the first position vector; sequentially determining whether each detection frame contains the information of the human head and a second index value according to the sequence based on the first human head classification vector, wherein the second index value is used for identifying the position of the detection frame containing the information of the human head in the first human head classification vector; sequentially determining whether each detection frame contains information of the human face and a third index value according to the sequence based on the first human face classification vector, wherein the third index value is used for identifying the position of the information of whether the detection frame contains the human face in the first human face classification vector; determining a target second index value corresponding to a target detection frame containing a human head from the second index values, determining a target first index value which is the same as the target second index value from the first index values, and determining a target third index value which is the same as the target second index value from the third index values; and determining the position information of the detection frame corresponding to the first index value of the target and the information whether the detection frame corresponding to the third index value of the target contains the human face as the position information of the target detection frame and the information whether the target detection frame contains the human face.

In an embodiment of the present invention, the first position vector includes position information of each detection frame identified in the image to be identified, and the first number of elements in the first position vector required for determining the position information of each detection frame is preconfigured, such as 4. Therefore, through the output layer of the joint detection model, the input first position vector can be sequentially divided into a plurality of sub-position vectors according to a preset first number, and an index value (denoted as a first index value) corresponding to each sub-position vector is determined, where a first number of elements included in each sub-position vector are position information of one detection frame, and the first index value corresponding to any sub-position vector is used to identify the position of the sub-position vector in the first position vector, that is, to identify the position of the position information corresponding to the detection frame in the first position vector.

The first head classification vector includes information whether each detection frame identified in the image to be identified contains a head, and a second number of elements in the first head classification vector required for determining whether each detection frame contains the head information is also pre-configured. Therefore, through the output layer of the joint detection model, the input first human head classification vector can be sequentially divided into a plurality of sub human head classification vectors according to a preset second number, and an index value (denoted as a second index value) corresponding to each sub human head classification vector is determined, wherein the second number of elements included in each sub human head classification vector is information whether a detection frame includes a human head, and the second index value corresponding to any sub human head classification vector is used for identifying the position of the sub human head classification vector in the first human head classification vector, that is, for identifying the position of the information whether the detection frame includes a human head in the first human head classification vector.

Similarly, the first face classification vector includes information whether each detection frame identified in the image to be identified includes a face, and a third number of elements in the first face classification vector required for determining whether each detection frame includes the face information is also preset. Therefore, through the output layer of the joint detection model, the input first face classification vector may be sequentially divided into a plurality of sub-face classification vectors according to a preset third number, and an index value (for convenience of description, denoted as a third index value) corresponding to each sub-face classification vector is determined, where an element of the third number included in each sub-face classification vector is information whether a detection frame includes a face, and the third index value corresponding to any sub-face classification vector is used to identify a position of the sub-face classification vector in the first face classification vector, that is, to identify a position of the information whether the detection frame includes a face in the first face classification vector.

For each of the obtained second index values, if it is determined that the detection frame corresponding to the second index value contains a human head, determining that the second index value is a target second index value, determining the detection frame corresponding to the target second index value as a target detection frame, then searching for a first index value (denoted as a target first index value) identical to the target second index value from each of the obtained first index values, searching for a third index value (denoted as a target third index value) identical to the target second index value from each of the obtained third index values, and finally determining position information of the detection frame corresponding to the target first index value, information of whether the detection frame corresponding to the target third index value contains a human face, as position information of the target detection frame, and information of whether the target detection frame contains a human face.

In another possible implementation manner, in order to obtain the first angle identification value corresponding to each target detection frame containing a human head in the image to be identified, the joint detection model further includes an angle detection layer, and the angle detection layer is connected to the feature extraction network layer and the output layer in the joint detection model respectively. After the feature vector of the image to be recognized is acquired through the feature extraction network layer in the joint detection model, the feature vector is also input to the angle detection layer of the joint detection model, so that the first angle identification vector corresponding to each detection frame recognized in the image to be recognized is acquired. Therefore, the method respectively inputs the feature vectors into a position detection layer, a head classification layer and a face classification layer of the joint detection model to obtain a first position vector corresponding to each detection frame recognized in the image to be recognized, a first head classification vector corresponding to each detection frame and a first face classification vector corresponding to each detection frame, and further comprises the following steps: inputting the characteristic vectors into an angle detection layer of the joint detection model to obtain first angle identification vectors corresponding to the detection frames;

through the output layer of joint detection model, based on first position vector, first head classification vector and first face classification vector, whether the information that contains the people's face in obtaining the position information that contains every target detection frame of head in the image of waiting to discern and every target detection frame includes the people's face includes: through an output layer of the joint detection model, based on the first position vector, the first head classification vector, the first face classification vector and the first angle identification vector, the position information of each target detection frame containing the head in the image to be recognized, the information whether each target detection frame contains the face or not and the first angle identification value corresponding to each target detection frame are obtained.

In the embodiment of the present invention, after the angle detection layer of the joint detection model obtains the first angle identification vector, the first angle identification vector is also input to the output layer of the joint detection model, so as to obtain the first angle identification value corresponding to each target detection frame containing the human head in the image to be recognized.

Specifically, the first angle identification vector includes first angle identification values corresponding to the detection frames identified in the image to be identified, and the fourth number of elements in the first angle identification vector required for determining the first angle identification values corresponding to the detection frames is also preconfigured. Therefore, through the output layer of the joint detection model, according to a preset fourth number, the input first angle identification vector may be sequentially divided into a plurality of sub-angle identification vectors, and an index value (denoted as a fourth index value) corresponding to each sub-angle identification vector is determined, where an element of the fourth number included in each sub-angle identification vector is the first angle identification value corresponding to one detection frame, and the fourth index value corresponding to any sub-angle identification vector is used to identify a position of the sub-angle identification vector in the first angle identification vector, that is, to identify a position of the first angle identification value corresponding to the detection frame in the first angle identification vector.

Further, after the target second index value corresponding to the target detection frame including the human head is determined from each second index value based on the above embodiment, for each target second index value, a fourth index value (denoted as a target fourth index value) identical to the target second index value is determined from each acquired fourth index value, and the position information of the detection frame corresponding to the target first index value, the information whether the detection frame corresponding to the target third index value includes the human face, and the first angle identification value corresponding to the detection frame corresponding to the target fourth index value are determined as the position information of the target detection frame, the information whether the target detection frame includes the human face, and the first angle identification value corresponding to the target detection frame.

Fig. 3 is a schematic diagram of a specific data processing flow provided in an embodiment of the present invention, where the flow includes:

s301: and respectively acquiring the position information of each target detection frame containing the head in the image to be recognized, the information whether each target detection frame contains the face or not and a first angle identification value corresponding to each target detection frame through a joint detection model.

There may be a plurality of target detection frames included in the image to be recognized obtained in the above step, and the following processing is performed on each obtained target detection frame:

s302: and determining the identification information of the target detection frame.

If the passenger flow data contains the attention frequency parameter, continuously executing S303-S307; if the passenger flow data contains the passenger flow value parameter, the steps S308 to S310 are continuously executed.

S303: and judging whether the target detection frame contains a human face, if so, executing S304, otherwise, executing S307.

S304: and judging that the identification information is not matched with any statistical identification stored in the current statistical queue, if so, executing S305, otherwise, executing S307.

S305: and judging whether the target detection frames corresponding to the identification information exist in the images to be recognized in the continuously set number, and whether the first angle identification values corresponding to the target detection frames corresponding to the identification information in the images to be recognized in the continuously set number are all smaller than a preset angle threshold value, if so, executing S306, otherwise, executing S307.

S306: and updating the attention number parameter in the currently stored passenger flow data.

S307: the attention number parameter in the currently saved passenger flow data is not updated.

S308: and judging whether the identification information of the target detection frame is the same as the identification information of any one tracked human head frame in the current tracking queue, if so, executing S309, otherwise, executing S310.

The method comprises the steps of determining the motion trail of an object to which a head frame of a tracked person of identification information belongs according to each image to be identified of the head frame of the tracked person containing the identification information aiming at each identification information stored in a tracking queue, and accordingly achieving short-time tracking.

S309: and updating the passenger flow value parameter in the currently stored passenger flow data.

S310: the passenger flow value parameter in the currently stored passenger flow data is not updated.

Example 5: in order to improve the efficiency of data processing, the embodiment of the invention also provides a training method of the joint detection model. As shown in fig. 4, the method includes:

s401: the method comprises the steps of obtaining any sample image in a sample set, wherein the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value, the first identification value corresponds to each sample human head frame, the first identification value is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face.

S402: and acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through the original joint detection model.

S403: and training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

The training method of the joint detection model provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be a server and the like. The device for training the original joint detection model may be the same as or different from the electronic device for data processing in the above embodiments.

In order to improve the efficiency of data processing, the original joint detection model may be trained according to any sample image in a pre-acquired sample set. The sample image is marked with position information of a sample head frame (marked as first position information), an identification value (marked as a first identification value) which is corresponding to the sample head frame and contains a human head, and an identification value (marked as a second identification value) whether the sample head frame contains a human face.

The first identification value and the second identification value may be both expressed by numbers, for example, the second identification value including a face is "1", the second identification value not including a face is "0", and the first identification value and the second identification value may also be expressed by other forms such as character strings. However, in order to distinguish the first identification value from the second identification value, the specific content or representation form of the first identification value and the second identification value is different, for example, the first identification value including a human head is "a", the second identification value including a human face is "1", and the second identification value not including a human face is "0". In specific implementation, the setting can be flexibly performed according to actual requirements, and is not specifically limited herein.

In addition, in order to increase the diversity of the sample images, the angles of the faces of the person in each sample image should be as different as possible for a plurality of sample images including the same person, for example, the face of the person x in the sample image a is the front face of the person, the face of the person x in the sample image b is the side face of the person turning 45 degrees to the right, and the face of the person x in the sample image b is the side face of the person turning 45 degrees to the left.

In the embodiment of the present invention, the position information (denoted as the second position information) of each sample detection frame containing the human head in the sample image and the information whether each sample detection frame contains the human face or not may be obtained through the original joint detection model. Specifically, a network layer is extracted through features of an original joint detection model to obtain feature vectors of a sample image, and then the feature vectors are respectively input to a position detection layer, a head classification layer and a face classification layer of the original joint detection model to obtain position vectors (marked as second position vectors) corresponding to detection frames recognized in the sample image, head classification vectors (marked as second head classification vectors) corresponding to the detection frames and face classification vectors (marked as second face classification vectors) corresponding to the detection frames. And finally, acquiring second position information of each sample detection frame containing the head in the sample image and information whether each sample detection frame contains the face or not through an output layer of the original joint detection model based on the second position vector, the second head classification vector and the second face classification vector.

And subsequently training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

It should be noted that, for convenience of describing the training process of the original joint detection model, the embodiment of the present invention describes in detail the training process of any acquired sample image. In the actual training process, the above operation is performed on each sample image in the sample set, and when a preset convergence condition is met, it is determined that the training of the joint detection model is completed.

The preset convergence condition can be satisfied, for example, that the sample images in the sample set pass through the joint detection model, the number of correctly identified sample images is greater than a set number, or the number of iterations for training the joint detection model reaches a set maximum number of iterations. The specific implementation can be flexibly set, and is not particularly limited herein.

In a possible implementation manner, when the original joint detection model is trained, the sample images in the sample set may be divided into training sample images and test sample images, the original joint detection model is trained based on the training sample images, and then the reliability of the trained joint detection model is verified based on the test sample images.

Example 6: on the basis of the above embodiments, in the embodiment of the present invention, a third angle identification value corresponding to each sample head frame is further marked in a sample image, and the third angle identification value is used to identify the angle value of the face in the sample head frame including the face, or identify that the sample head frame does not include the face;

the method further comprises the following steps: acquiring a second angle identification value corresponding to the sample detection frame through the original joint detection model, wherein the second angle identification value is used for identifying the angle value of the face in the sample detection frame containing the face, or identifying that the sample detection frame does not contain the face;

training the original joint detection model, and further comprising: and training the original joint detection model according to the second angle identification value corresponding to the sample detection frame and the third angle identification value corresponding to the corresponding sample human head frame.

In the above embodiment, when determining whether the identification information of the target detection frame meets the preset first update condition, the first angle identification value corresponding to the target detection frame corresponding to the identification information needs to be obtained. Therefore, an angle identification value (denoted as a third angle identification value) corresponding to each sample human head frame is also marked in any sample image in the sample set. The third angle identification value is used for identifying the angle value of the face in the sample head frame containing the face, or identifying that the sample head frame does not contain the face.

In order to further acquire a joint detection model with higher precision, in the embodiment of the present invention, the angle value includes at least one of a yaw angle value, a pitch angle value, and a roll angle value, for example, the yaw angle value, the pitch angle value, and the roll angle value of the face in the sample frame of the human head may all be used as supervision information to train the original joint detection model. As shown in fig. 2, the angle identification value corresponding to Roll is a Roll angle value.

After any sample image in the sample set is obtained, an angle identification value (denoted as a second angle identification value) corresponding to a sample detection frame in the sample image can be obtained through the original joint detection model. The second angle identification value is used for identifying the angle value of the face in the sample detection frame containing the face, or identifying that the sample detection frame does not contain the face. Subsequently, when the original joint detection model is trained, the original joint detection model can be trained according to the second angle identification value corresponding to the obtained sample detection frame and the third angle identification value corresponding to the corresponding sample human head frame.

Specifically, the process of obtaining the second angle identification value corresponding to the sample detection frame through the original joint detection model includes: after the feature vector of the sample image is obtained through the feature extraction network layer of the original joint detection model, the feature vector is also input to the angle detection layer of the original joint detection model so as to obtain second angle identification vectors corresponding to the detection frames identified in the sample image. And acquiring second position information of each sample detection frame containing the head, information whether each sample detection frame contains the face or not and a second angle identification value corresponding to each sample detection frame in the sample image based on the second position vector, the second head classification vector, the second face classification vector and the second angle identification vector through an input layer of the original joint detection model.

And subsequently, training the original joint detection model according to the second angle identification value corresponding to the sample detection frame and the third angle identification value corresponding to the corresponding sample human head frame.

Example 7: because the face in the target detection frame containing the face must contain the key points on the face, and the position information of the key points on the face is favorable for determining the angle value of the face in the target detection frame containing the face, in order to obtain the key point position vector corresponding to each sample detection frame through the joint detection model, the original joint detection model is trained according to the key point position vector corresponding to the sample detection frame and the sample key point position vector corresponding to the corresponding pre-marked sample human head frame, so as to further improve the accuracy of the trained joint detection model by adding the monitoring information for training the joint detection model, on the basis of the above embodiments, the sample image is also marked with the sample key point position vector corresponding to each sample human head frame;

obtaining a second angle identification value corresponding to each sample detection frame in the sample image through the original joint detection model, and further comprising: acquiring a key point position vector corresponding to each sample detection frame in a sample image through an original joint detection model;

training the original joint detection model, and further comprising: and training the original joint detection model according to the key point position vector corresponding to the sample detection frame and the sample key point position vector corresponding to the corresponding sample human head frame.

In the practical application process, the detection frame generally includes the key points on the face, and the angle value of the face in the detection frame including the face is determined according to the position information of the key points on the face, for example, the position information of the key points such as the nose tip, the lip peak, the eye head, and the like, and the accuracy of the angle value of the face has strong correlation with the extracted position information of the key points. Therefore, in order to further improve the accuracy of the trained joint detection model, in the embodiment of the present invention, the position information of the key points of the face in the detection frame containing the face may be used as the supervision information to train the original joint detection model.

In a specific implementation process, a sample image is also marked with a sample key point position vector corresponding to each sample head frame, and the sample key point position vector comprises position information of key points of a human face in the sample head frame. In order to obtain the keypoint position vector corresponding to each detection frame in the sample image through the original joint detection model, the original joint detection model further comprises a keypoint detection layer. As shown in fig. 6, the keypoint detection layer is respectively connected to the feature extraction network layer and the output layer in the original joint detection model, so that the keypoint position vector corresponding to each sample detection frame in the sample image is obtained through the original detection model, thereby improving the precision of training the original joint detection model.

Specifically, the process of obtaining the position vector of the key point corresponding to each sample detection frame in the sample image through the original joint detection model includes: after the feature vector of the sample image is obtained through the feature extraction network layer of the original joint detection model, the feature vector is also input to the key point detection layer of the original joint detection model, so as to obtain the total key point position vector corresponding to each detection frame identified in the sample image. And acquiring second position information of each sample detection frame containing the human head in the sample image, information whether each sample detection frame contains the human face, a second angle identification value corresponding to each sample detection frame and a key point position vector corresponding to each sample detection frame through an input layer of the original joint detection model based on the second position vector, the second human head classification vector, the second human face classification vector, the second angle identification vector and the total key point position vector.

Fig. 5 is a schematic diagram of each key point of a sample human head frame according to an embodiment of the present invention, where key points are marked at positions of a nose tip, a lip peak, a lip angle, an eye head, and the like of a human face in the sample human head frame, and a position vector of the sample key point corresponding to the sample human head frame can be determined according to position information of the key points.

The number of the key points marked on the sample head frame containing the face can be set to different values according to different scenes. If it is desired to reduce the time taken for training of the original detection model, the number of keypoints can be set smaller; the number of keypoints can be set larger if a higher accuracy of the acquired raw detection model is desired. However, the calculation amount is not too large, and the calculation amount is very large when the original detection model is trained, and the trained original detection model is not easy to obtain. Preferably, the number of the key points is generally 96 points, 106 points, etc.

Subsequently, when the original joint detection model is trained, the original joint detection model can be trained through the acquired key point position vector corresponding to the sample detection frame and the sample key point position vector corresponding to the corresponding sample human head frame.

In one possible implementation, the training of the original joint detection model includes:

for each sample detection frame, if the sample detection frame is matched with any sample head frame, determining a position loss value according to the second position information of the sample detection frame and the first position information of the matched sample head frame; determining a first human head loss value according to whether the sample detection frame contains human head information and a first identification value corresponding to the matched sample human head frame; determining a face loss value according to whether the sample detection frame contains face information and a second identification value corresponding to the matched sample head frame; determining an angle loss value according to a second angle identification value corresponding to the sample detection frame and a third angle identification value corresponding to the matched sample human head frame; determining a key point loss value according to the key point position vector corresponding to the sample detection frame and the sample key point position vector corresponding to the matched sample human head frame; determining a sub-loss value according to the position loss value, the first human head loss value, the human face loss value, the angle loss value and the key point loss value; if the sample detection frame is not matched with any sample human head frame, determining a second human head loss value according to whether the sample detection frame contains human head information and a preset first numerical value; determining a sub-loss value according to the second head loss value; and training the original joint detection model according to the sum of the sub-loss values corresponding to each sample detection frame.

In the embodiment of the present invention, when determining the sub-loss value according to the determined position loss value, the first head loss value, the face loss value, the angle loss value, and the key point loss value, the sub-loss value may be determined directly according to the determined position loss value, the first head loss value, the face loss value, the angle loss value, and the sum of the key point loss values, or the sub-loss value may be determined after performing a certain algorithm process on the position loss value, the first head loss value, the face loss value, the angle loss value, and the key point loss value, for example, the sub-loss value may be determined by first determining the product of the position loss value and the corresponding weight value, the product of the first head loss value and the corresponding weight value, the product of the face loss value and the corresponding weight value, the product of the angle loss value and the corresponding weight value, and the product of the key point loss value and the corresponding weight value, and determining the sub-loss value according to the, a sub-loss value is determined.

Because different information has different influences on the training of the original joint detection model, the weight values corresponding to the position loss value, the first head loss value, the face loss value, the angle loss value and the key point loss value can be the same or different. Each weight value may be set by an adaptive algorithm, or may be set by an artificial empirical value, which is not specifically limited herein.

The third angle identification value corresponding to the sample human head frame not containing the human face is only used for identifying that the sample human head frame does not contain the human face. Therefore, when it is determined that the sample frame matched with the sample frame does not include a human face in the sample frame matched with the third angle identification value identification corresponding to the sample frame matched with the sample frame, the angle loss value is determined without the second angle identification value corresponding to the sample frame and the third angle identification corresponding to the sample frame matched with the sample frame, and the preset second numerical value can be directly determined as the angle loss value. Specifically, according to a second angle identification value corresponding to the sample detection frame and a third angle identification value corresponding to the matched sample human head frame, determining an angle loss value, including:

and if the third angle identification value corresponding to the matched sample human head frame identifies that the matched sample human head frame does not contain a human face, determining the preset second numerical value as an angle loss value.

In order to improve the accuracy of the trained joint detection model, the preset second value is generally a smaller value such as "0", "0.1", and the like.

Similarly, the sample key point position vector corresponding to the sample head frame not containing the face is only used for identifying that the sample head frame does not contain the face. Therefore, when it is determined that the sample frame matched with the sample frame does not include a human face, the loss value of the key point is determined without the key point position vector corresponding to the sample frame and the sample key point position vector corresponding to the sample frame, and the preset third value can be directly determined as the loss value of the key point. Specifically, according to the keypoint location vector corresponding to the sample detection frame and the sample keypoint location vector corresponding to the matched sample human head frame, determining a keypoint loss value includes: and if the matched sample head frame does not contain the human face, the sample key point position vector corresponding to the matched sample head frame identifies that the matched sample head frame does not contain the human face, and a preset third numerical value is determined as a key point loss value.

In order to improve the accuracy of the trained joint detection model, the preset third value is generally a smaller value such as "0", "0.1", and the like.

In another possible implementation, for each sample detection frame, matching the sample detection frame with any sample human head frame, and if the sample detection frame is not matched with any sample human head frame, indicating that the sample detection frame is a detection frame that does not include a human head, determining a human head loss value (for convenience of description, recorded as a second human head loss value) directly according to whether the sample detection frame includes human head information and a preset first numerical value; and determining a sub-loss value according to the second head loss value.

In order to improve the accuracy of the trained joint detection model, the first value is generally a small value such as "0", "0.1", and the like. In the embodiment of the present invention, the first numerical value, the second numerical value, and the third numerical value may be the same or different, and are not limited herein.

When determining the sub-loss value according to the second head loss value, the second head loss value may be directly determined as the sub-loss value, or the sub-loss value may be determined after performing a certain algorithm processing on the second head loss value, for example, the sub-loss value is determined according to a product of the second sub-loss value and a corresponding weight value. And training the original joint detection model according to the sum of each acquired sub-loss value so as to update the value of the parameter in the original joint detection model. In specific implementation, when the original joint detection model is trained according to the sum of each sub-loss value, a gradient descent algorithm can be adopted to perform back propagation on the gradient of the parameter in the original joint detection model, so that the value of the parameter of the original joint detection model is updated.

Example 8: fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention, where the apparatus includes:

an obtaining unit 71, configured to obtain, through a joint detection model, position information of each target detection frame that includes a human head in an image to be recognized, and information of whether each target detection frame includes a human face;

the processing unit 72 is configured to determine the passenger flow data according to the position information of the target detection frame and/or information whether the target detection frame includes a human face.

In a possible implementation, the processing unit 72 is specifically configured to: determining identification information of a target detection frame; and if the identification information meets a preset first updating condition and the target detection frame contains a human face, updating the attention frequency parameter in the currently stored passenger flow data.

In a possible implementation manner, the obtaining unit 71 is further configured to obtain, through a joint detection model, a first angle identification value corresponding to each target detection frame, where the first angle identification value is used to identify an angle value of a face in a target detection frame that includes the face, or identify that the target detection frame does not include the face;

the processing unit 72 is specifically configured to: and if the identification information is not matched with any statistical identification stored in the current statistical queue, and the first angle identification value corresponding to the target detection frame corresponding to the identification information meets a preset second updating condition, determining that the identification information meets the first updating condition.

In a possible implementation, the processing unit 72 is specifically configured to: if the target detection frames corresponding to the identification information exist in the continuously set number of images to be recognized, and the first angle identification values corresponding to the target detection frames corresponding to the identification information in the continuously set number of images to be recognized are all smaller than the preset angle threshold value, it is determined that the first angle identification values corresponding to the target detection frames corresponding to the identification information meet the second updating condition.

In a possible implementation, the obtaining unit 71 is specifically configured to: extracting a network layer through the characteristics of the joint detection model to obtain the characteristic vector of the image to be identified; respectively inputting the feature vectors into a position detection layer, a head classification layer and a face classification layer of the joint detection model to obtain first position vectors corresponding to all detection frames, first head classification vectors corresponding to all detection frames and first face classification vectors corresponding to all detection frames, which are identified in the image to be identified; and acquiring the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face or not through an output layer of the joint detection model based on the first position vector, the first head classification vector and the first face classification vector.

In a possible implementation, the obtaining unit 71 is specifically configured to:

sequentially determining the position information corresponding to each detection frame and a first index value according to the sequence based on the first position vector through an output layer of the joint detection model, wherein the first index value is used for identifying the position of the position information corresponding to the detection frame in the first position vector; sequentially determining whether each detection frame contains the information of the human head and a second index value according to the sequence based on the first human head classification vector, wherein the second index value is used for identifying the position of the detection frame containing the information of the human head in the first human head classification vector; sequentially determining whether each detection frame contains information of the human face and a third index value according to the sequence based on the first human face classification vector, wherein the third index value is used for identifying the position of the information of whether the detection frame contains the human face in the first human face classification vector; determining a target second index value corresponding to a target detection frame containing a human head from the second index values, determining a target first index value which is the same as the target second index value from the first index values, and determining a target third index value which is the same as the target second index value from the third index values; and determining the position information of the target detection frame and the information whether the target detection frame comprises the face or not according to the position information of the detection frame corresponding to the target first index value and the information whether the target detection frame comprises the face or not according to the information whether the target detection frame comprises the face or not.

In a possible embodiment, the processing unit 72 is further configured to determine identification information of the target detection box; and if the identification information of the target detection frame meets a preset third updating condition, updating the passenger flow value parameter in the currently stored passenger flow data.

Example 9: fig. 8 is a schematic structural diagram of a training apparatus for a joint detection model according to an embodiment of the present invention, where the apparatus includes:

the first obtaining module 81 is configured to obtain any sample image in the sample set, where the sample image is marked with first position information of a head frame of each sample, a first identification value and a second identification value, where the first identification value is used to identify that the head frame of the sample contains a human head, and the second identification value is used to identify whether the head frame of the sample contains a human face;

a second obtaining module 82, configured to obtain, through the original joint detection model, second position information of each sample detection frame that includes a human head in the sample image, and information of whether each sample detection frame includes a human face;

the training module 83 is configured to train the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample human head frame, whether the sample detection frame includes the information of the human head and the first identification value corresponding to the corresponding sample human head frame, and whether the sample detection frame includes the information of the human face and the second identification value corresponding to the corresponding sample human head frame.

In a possible implementation manner, the sample image is further marked with a third angle identification value corresponding to each sample head frame, and the third angle identification value is used for identifying an angle value of a human face in the sample head frame containing the human face, or identifying that the sample head frame does not contain the human face;

the second obtaining module 82 is further configured to obtain, through the original joint detection model, a second angle identification value corresponding to the sample detection frame, where the second angle identification value is used to identify an angle value of a face in the sample detection frame including the face, or identify that the sample detection frame does not include the face;

the training module 83 is further configured to train the original joint detection model according to the second angle identification value corresponding to the sample detection frame and the third angle identification value corresponding to the sample human head frame.

In a possible implementation manner, a sample key point position vector corresponding to each sample head frame is marked in the sample image;

the second obtaining module 82 is further configured to obtain, through the original joint detection model, a key point position vector corresponding to each sample detection frame in the sample image;

the training module 83 is further configured to train the original joint detection model according to the keypoint location vector corresponding to the sample detection frame and the sample keypoint location vector corresponding to the sample human head frame.

In a possible implementation, the training module 83 is specifically configured to:

for each sample detection frame, if the sample detection frame is matched with any sample human head frame, determining a position loss value according to the second position information of the sample detection frame and the first position information of the matched sample human head frame; determining a first human head loss value according to whether the sample detection frame contains human head information and a first identification value corresponding to the matched sample human head frame; determining a face loss value according to whether the sample detection frame contains face information and a second identification value corresponding to the matched sample head frame; determining an angle loss value according to a second angle identification value corresponding to the sample detection frame and a third angle identification value corresponding to the matched sample human head frame; determining a key point loss value according to the key point position vector corresponding to the sample detection frame and the sample key point position vector corresponding to the matched sample human head frame; determining a sub-loss value according to the position loss value, the first human head loss value, the human face loss value, the angle loss value and the key point loss value; if the sample detection frame is not matched with any sample human head frame, determining a second human head loss value according to whether the sample detection frame contains human head information and a preset first numerical value; determining a sub-loss value according to the second head loss value; and training the original joint detection model according to the sum of the sub-loss values corresponding to each sample detection frame.

and if the matched sample head frame does not contain the human face, the sample key point position vector corresponding to the matched sample head frame identifies that the matched sample head frame does not contain the human face, and a preset third numerical value is determined as a key point loss value.

Example 10: as shown in fig. 9, which is a schematic structural diagram of an electronic device according to an embodiment of the present invention, on the basis of the foregoing embodiments, the electronic device includes: the system comprises a processor 91, a communication interface 92, a memory 93 and a communication bus 94, wherein the processor 91, the communication interface 92 and the memory 93 are communicated with each other through the communication bus 94; the memory 93 has stored therein a computer program which, when executed by the processor 91, causes the processor 91 to perform the steps of:

acquiring the position information of each target detection frame containing the human head in the image to be recognized and the information whether each target detection frame contains the human face or not through a joint detection model; and determining passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains the human face.

Because the principle of the electronic device for solving the problems is similar to the data processing method, the implementation of the electronic device may refer to the implementation of the method, and repeated details are not repeated.

Example 11: as shown in fig. 10, which is a schematic structural diagram of an electronic device according to an embodiment of the present invention, on the basis of the foregoing embodiments, the electronic device includes: the system comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 are communicated with each other through the communication bus 1004;

the memory 1003 has stored therein a computer program which, when executed by the processor 1001, causes the processor 1001 to perform the steps of: acquiring any sample image in a sample set, wherein the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value corresponding to each sample human head frame, the first identification value is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face or not; acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through an original joint detection model; and training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

Because the principle of solving the problem of the electronic device is similar to the training method of the joint detection model, the implementation of the electronic device can refer to the implementation of the method, and repeated details are not repeated.

The communication bus mentioned in the above embodiments may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface 1002 is used for communication between the electronic apparatus and other apparatuses. The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor. The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Example 12: on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program is run on the processor, the processor is caused to execute the following steps:

Since the principle of solving the problem of the computer-readable storage medium is similar to the data processing method in the above embodiment, the specific implementation may refer to the implementation of the data processing method, and repeated details are not repeated.

Example 13: on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program is run on the processor, the processor is caused to execute the following steps:

acquiring any sample image in a sample set, wherein the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value corresponding to each sample human head frame, the first identification value is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face or not; acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through an original joint detection model; and training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

Since the principle of the computer-readable storage medium to solve the problem is similar to the training method of the joint detection model in the above embodiments, the specific implementation may refer to the implementation of the data processing method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of data processing, the method comprising:

acquiring the position information of each target detection frame containing a human head in an image to be recognized and the information whether each target detection frame contains a human face or not through a joint detection model;

and determining passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains a human face.

2. The method of claim 1, wherein the determining passenger flow data comprises:

determining identification information of the target detection frame;

and if the identification information meets a preset first updating condition and the target detection frame contains a human face, updating the attention frequency parameter in the currently stored passenger flow data.

3. The method of claim 2,

the method for detecting the image includes the steps of obtaining position information of each target detection frame including a human head in an image to be identified and information whether each target detection frame includes a human face through a combined detection model, and further includes the following steps: acquiring a first angle identification value corresponding to each target detection frame through the joint detection model, wherein the first angle identification value is used for identifying an angle value of a human face in the target detection frame containing the human face, or identifying that the target detection frame does not contain the human face;

determining that the identification information meets a preset first updating condition, including: and if the identification information is not matched with any statistical identification stored in the current statistical queue, and the first angle identification value corresponding to the target detection frame corresponding to the identification information meets a preset second updating condition, determining that the identification information meets the first updating condition.

4. The method according to claim 3, wherein determining that the first angle identifier value corresponding to the target detection frame corresponding to the identifier information satisfies a preset second update condition comprises:

and if the target detection frames corresponding to the identification information exist in the images to be recognized in the continuously set number, and the first angle identification values corresponding to the target detection frames corresponding to the identification information in the images to be recognized in the continuously set number are all smaller than a preset angle threshold value, determining that the first angle identification values corresponding to the target detection frames corresponding to the identification information meet the second updating condition.

5. The method according to claim 1, wherein the obtaining, by the joint detection model, the position information of each target detection frame containing the head of the person in the image to be recognized and the information of whether each target detection frame contains the face of the person comprises:

acquiring a feature vector of the image to be identified through a feature extraction network layer of the joint detection model;

respectively inputting the feature vectors into a position detection layer, a head classification layer and a face classification layer of the joint detection model to obtain first position vectors corresponding to the detection frames recognized in the image to be recognized, first head classification vectors corresponding to the detection frames and first face classification vectors corresponding to the detection frames;

and acquiring the position information of each target detection frame containing the head in the image to be recognized and the information whether each target detection frame contains the face or not through an output layer of the joint detection model based on the first position vector, the first head classification vector and the first face classification vector.

6. A method for training a joint detection model, the method comprising:

acquiring any sample image in a sample set, wherein the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value, the first identification value is corresponding to each sample human head frame and is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face or not;

acquiring second position information of each sample detection frame containing a human head in the sample image and information whether each sample detection frame contains a human face or not through an original joint detection model;

7. A data processing apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring the position information of each target detection frame containing a human head in an image to be recognized and the information whether each target detection frame contains a human face or not through a joint detection model;

and the processing unit is used for determining passenger flow data according to the position information of the target detection frame and/or the information whether the target detection frame contains the human face.

8. An apparatus for training a joint detection model, the apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring any sample image in a sample set, and the sample image is marked with first position information of each sample human head frame, a first identification value and a second identification value corresponding to each sample human head frame, the first identification value is used for identifying whether the sample human head frame contains a human head, and the second identification value is used for identifying whether the sample human head frame contains a human face;

the second acquisition module is used for acquiring second position information of each sample detection frame containing the human head in the sample image and information whether each sample detection frame contains the human face or not through an original joint detection model;

and the training module is used for training the original joint detection model according to the second position information of the sample detection frame and the first position information of the corresponding sample head frame, whether the sample detection frame contains the information of the head and a first identification value corresponding to the corresponding sample head frame, and whether the sample detection frame contains the information of the face and a second identification value corresponding to the corresponding sample head frame.

9. An electronic device, characterized in that the electronic device comprises at least a processor and a memory, the processor being adapted to carry out the steps of the data processing method according to any of claims 1-5 or 6 when executing a computer program stored in the memory.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, carries out the steps of the data processing method according to any one of claims 1-5, or 6.