CN117238039B

CN117238039B - Multitasking human behavior analysis method and system based on top view angle

Info

Publication number: CN117238039B
Application number: CN202311523413.3A
Authority: CN
Inventors: 赵惠; 张鹏飞; 苏江
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-03-19
Anticipated expiration: 2043-11-16
Also published as: CN117238039A

Abstract

The invention discloses a multitasking human behavior analysis method and system based on a top view angle, and relates to the technical field of computer vision. The method comprises the steps of collecting a target top view scene picture, and labeling human body characteristic data in the target top view scene picture; dividing the marked human body characteristic data into a training set and a verification set; constructing a multi-task human body behavior analysis model, wherein the multi-task human body behavior analysis model comprises a main network and a plurality of pre-measuring heads connected with the main network; the training set is input into a multi-task human body behavior analysis model to perform model training, the trained multi-task human body behavior analysis model is utilized to predict left and right shoulder point position coordinates of a human body, human body orientation is calculated based on the left and right shoulder point position coordinates of the human body, and meanwhile, human body frame, head top point position coordinates and human body characterization vectors are output for pedestrian tracking. The invention can improve the tracking recall rate and the tracking speed of pedestrians.

Description

Multitasking human behavior analysis method and system based on top view angle

Technical Field

The invention relates to the technical field of computer vision, in particular to a multitasking human behavior analysis method and system based on a top view angle.

Background

In the field of computer vision, object detection is a basic task with the aim of detecting the position of an object in an image, which is called human detection when the detected object is a human. Human body detection, human body orientation estimation and human body identification are very important subjects in the field of computer vision.

Currently, in the prior art, chinese patent application No. 2022106674341 discloses a pedestrian cross-mirror positioning method, system and medium based on a top view camera, wherein the method comprises: acquiring an original image of each camera in a target area; detecting the position of the head of the person in the original image, and obtaining the position of the foot according to regression of the position of the head; mapping the position of the foot in each original image to the corresponding position of the correction chart preset by the corresponding camera; mapping the corresponding positions of the feet on the correction chart to the corresponding positions of a local map constructed in advance based on the target area, and further realizing the positioning of pedestrians on the local map. According to the method, each camera correction chart and the whole local map are pre-built in the initialization stage, a convenient and rapid mode is provided for the follow-up real-time mapping from an original image to the correction chart and the mapping from the correction chart to the local map and even track tracking, and meanwhile misjudgment of positioning and tracking due to illumination change, pedestrian shielding, wearing similarity and the like is avoided. However, this method directly detects the head position of a person for positioning, which has a disadvantage in that the head of the person is not visible, but most of the person whose body is visible cannot be detected.

The Chinese patent with the application number of 2021110350656 discloses a multi-target tracking method based on coarse-to-fine shielding treatment, and proposes a coarse-to-fine shielding treatment method, wherein the coarse treatment of shielding targets and the full treatment of non-shielding targets are completed in the first step, the fine treatment of shielding targets is completed in the second step, and finally the results of the two steps are integrated to obtain final pedestrian positions and corresponding apparent feature vectors, so that the method can be well suitable for pedestrian problems in public environments with various periods and pedestrian densities. However, in the method, the calculation of the appearance characterization vector of the non-occlusion target is based on the grid, and when a plurality of anchors of one grid have different targets, the ids cannot be well allocated.

The Chinese patent with application number 201610024401X discloses a human body orientation angle real-time detection method based on ellipse matching, which comprises the steps of firstly obtaining parameters of an asymmetric ellipse model of a reference foreground area from a reference image, then acquiring a shoulder cross section point set under any orientation angle, generating an asymmetric ellipse model point set based on the asymmetric ellipse model, and finally obtaining a human body orientation angle by matching the shoulder cross section point set between two adjacent frames and the asymmetric ellipse model point set; the method can realize real-time and accurate detection of the orientation of the human body to 1 degree. However, this method has drawbacks in that the method is complicated and the cost is high.

Therefore, how to improve the accuracy and recall of human body detection, the accuracy of human body orientation prediction and the accuracy of appearance characterization vectors, and at the same time reduce model training and prediction costs are the problems that need to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method and a system for analyzing multi-task human behavior based on top view angle, which are used for at least partially solving the problems in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in one aspect, the invention discloses a method for analyzing multitasking human behavior based on a top view angle, which comprises the following steps:

1) Data acquisition and labeling:

acquiring a target top view scene picture, and labeling human body characteristic data in the target top view scene picture, wherein the human body characteristic data comprises a human body id, a human body frame, head top point position coordinates, left shoulder point position coordinates and right shoulder point position coordinates; and dividing the marked human body characteristic data into a training set and a verification set.

2) Model construction:

the method comprises the steps of constructing a multi-task human body behavior analysis model, wherein the multi-task human body behavior analysis model comprises a main network and a plurality of prediction heads connected with the main network, and each prediction head comprises a plurality of point prediction heads, a plurality of frame prediction heads and an appearance characterization prediction head.

3) Model training:

the training set is input into a multitasking human behavior analysis model to carry out model training, id is allocated to the positive anchor in the training process, and the appearance characterization vector output by the appearance characterization pre-measurement head is input into a softmax function.

4) Model prediction:

predicting the left and right shoulder point position coordinates of the human body in the top view scene picture of the target to be detected by using the trained multitask human body behavior analysis model, calculating the human body orientation based on the left and right shoulder point position coordinates of the human body, and simultaneously outputting the human body frame, the top head point position coordinates and the human body characterization vector for pedestrian tracking.

Further, the frame prediction head is used for predicting human frame coordinates and confidence degrees, wherein each grid in the frame prediction head predicts a plurality of anchor points anchors, and each anchor point anchor is used for predicting four-dimensional frame coordinates, one-dimensional category confidence degrees and one-dimensional target confidence degrees.

Further, each grid in the point prediction heads corresponds to a frame prediction head and is used for predicting two-dimensional coordinates and one-dimensional confidence of three human key points corresponding to the anchor points; the three human body key points comprise a head vertex, a left shoulder point and a right shoulder point.

Further, the frame coordinates are supervised by using a CIOU loss function, the frame category confidence and the target confidence are supervised by using a BCE loss function, the key point confidence is supervised by using a BCE loss function, the key point position is supervised by using an OKS loss function, and the appearance characterization is supervised by using a CE loss function.

On the other hand, the invention also discloses a multitasking human behavior analysis system based on the top view angle, which comprises,

the data acquisition module is used for acquiring a target top view scene picture, marking human body characteristic data in the target top view scene picture, and constructing a training set and a verification set, wherein the human body characteristic data comprises a human body id, a human body frame, a head top point position coordinate, a left shoulder point position coordinate and a right shoulder point position coordinate.

The model construction module is used for constructing a multi-task human body behavior analysis model, the multi-task human body behavior analysis model comprises a main network and a plurality of prediction heads connected with the main network, and each prediction head comprises a plurality of point prediction heads, a plurality of frame prediction heads and an appearance representation prediction head.

The model training module is used for training the multitasking human behavior analysis model by utilizing the training set, distributing id to the positive anchor in the training process, and inputting the appearance characterization vector output by the appearance characterization pre-measuring head into the softmax function.

The model prediction module is used for predicting the left and right shoulder point position coordinates of the human body in the top view scene picture of the target to be detected by using the trained multitask human body behavior analysis model, calculating the human body orientation based on the left and right shoulder point position coordinates of the human body, and outputting the human body frame, the top head point position coordinates and the human body characterization vector for pedestrian tracking.

Further, each grid in the frame prediction head predicts a plurality of anchor points anchors, and each anchor point anchor is used for predicting four-dimensional frame coordinates, one-dimensional category confidence and one-dimensional target confidence.

Compared with the prior art, the invention discloses a multitasking human behavior analysis method and system based on a top view angle, which has the following beneficial effects:

the invention can output the human boundary frame and the head vertex for positioning the human, and compared with the method for detecting only the human head area or the human head vertex, the method can detect the true positive example that the human head is not seen but most of the human body is visible, and improves the tracking recall rate;

compared with the method for detecting only human bounding boxes, the method for detecting the human bounding boxes by using the head top point output can be combined with the depth map to reduce background false positive examples, enhance the generalization capability of the model and improve the model precision.

According to the invention, the EMBedding branch of the JDE is changed into the feature of the corresponding prediction frame obtained by utilizing the RoI Align, and the id is allocated to the positive anchor during training, so that the problem that the id cannot be allocated because one grid corresponds to a plurality of anchors, and the accuracy of the feature vector for pedestrian re-identification is improved.

The invention additionally utilizes the output left shoulder point and the output right shoulder point to estimate the human body orientation so as to solve the problem that the general motion direction can not be estimated according to the frame position by partial human body in-situ rotation, thereby approximating the human body orientation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic overall flow chart of a method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a multi-task human behavior analysis system based on a top view perspective according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a top view scene of a target according to an embodiment of the present invention;

fig. 4 is a schematic diagram of estimating a human body orientation by using left shoulder points and right shoulder points according to an embodiment of the present invention.

Fig. 5 is a graph of the results of human body detection, head-shoulder points and orientation prediction for multiple targets according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention discloses a multitasking human behavior analysis method based on a top view angle, which is shown in fig. 1 and comprises the following steps:

1) Data acquisition and labeling:

2) Model construction:

the method comprises the steps of constructing a multi-task human body behavior analysis model, wherein the multi-task human body behavior analysis model comprises a main network and a plurality of prediction heads connected with the main network, and each prediction head comprises a plurality of point prediction heads, a plurality of frame prediction heads and an appearance representation prediction head.

In the embodiment of the invention, the multitask human behavior analysis model integrally adopts a JDE model, a backbone network can adopt a back bone+neg part of a yolov7 network, and a neg part of the yolov7 adopts a PAFPN structure, which is an advanced version of the FPN structure.

In other embodiments, the backbone network may also employ conventional FPN network structures or other variants to obtain feature maps of different scales.

3) Model training:

4) Model prediction:

In the above method, the frame prediction head is used for predicting human frame coordinates and confidence degrees, wherein each grid in the frame prediction head predicts a plurality of anchor points anchors, and each anchor point anchor is used for predicting four-dimensional frame coordinates, one-dimensional category confidence degrees and one-dimensional target confidence degrees.

In the method, each grid in the point prediction heads corresponds to the frame prediction head and is used for predicting two-dimensional coordinates and one-dimensional confidence of three human key points corresponding to the anchor points; the three human body key points comprise a head vertex, a left shoulder point and a right shoulder point.

In the method, the frame coordinates are supervised by adopting a CIOU loss function, the frame category confidence and the target confidence are supervised by adopting a BCE loss function, the key point confidence is supervised by adopting a BCE loss function, the key point position is supervised by adopting an OKS loss function, and the appearance characterization is supervised by adopting a CE loss function.

Example 2

The embodiment of the invention discloses a multitasking human behavior analysis system based on a top view angle, which comprises the following components: the data acquisition module is used for acquiring a target top view scene picture, marking human body characteristic data in the target top view scene picture, and constructing a training set and a verification set, wherein the human body characteristic data comprises a human body id, a human body frame, a head top point position coordinate, a left shoulder point position coordinate and a right shoulder point position coordinate.

The specific structure of the multitasking human behavior analysis model in the system is shown in fig. 2, and the operation process is as follows:

taking the top view picture of the target scene shown in fig. 3 as input, and through a main network, in the embodiment, the main network adopts a backup and a neg part of a yolov7 network, and the top view picture of the target scene is input into the main network to acquire three scale characteristic pictures of large scale, medium scale and small scale. And connecting a frame prediction head and a point prediction head in each of the three scale feature maps, wherein the frame prediction head is used for predicting frame coordinates and confidence degrees, and the point prediction head is used for predicting coordinates and confidence degrees of three human body key points (head top point, left shoulder point and right shoulder point).

And mapping frame coordinates output by the three frame prediction heads into a small-scale feature map, cutting out a corresponding region by adopting an ROI alignment algorithm, and sending the cut-out corresponding region into an appearance characterization prediction head to obtain an appearance characterization vector.

In this embodiment, the feature maps of different scales are output for detecting targets of different scales. The downsampling multiple of the small-scale feature map is small, such as 640x640 for network input, and downsampling is 8 times, so as to obtain an 80x80 feature map, and small targets are properly detected on the small-scale feature map. The downsampling multiple of the large-scale feature map is large, downsampling is performed by 32 times, a 20x20 feature map is obtained, a small-scale target can not be seen under the large-scale feature map, the small-scale target can not be detected, and the large-scale feature map is suitable for detecting the large target. The three scale feature maps can obtain the coordinates and the confidence of the key points of the human body corresponding to the targets with different scales.

In the embodiment of the invention, a data acquisition module marks human body id, human body frame, head top point position coordinates, left shoulder point position coordinates, right shoulder point position coordinates and the like in an acquired target top view scene picture, the marked human body frame is named GT (ground gruth), and during training, all prediction frames generated by three size feature images are subjected to positive and negative sample matching with GT, which is equivalent to dividing some prediction frames into positive samples (the samples are responsible for predicting GT) according to a certain rule (for example, intersection ratio and center point position), and some frames are negative samples, and then loss is calculated. After training, all predicted frames generated by the model are passed through the image, and the rest frames are filtered through confidence and nms to be used as final output frames of the model.

As shown in fig. 4, the human body orientation is estimated using the left shoulder point and the right shoulder point, and specific examples of use are as follows:

data acquisition and labeling: and collecting top view scene data or similar top view scene video data of the target, and sampling and labeling frames. Each picture needs to be marked with a human body frame (upper left corner coordinate, lower right corner coordinate), a head top point position, a left shoulder point position, a right shoulder point position and a person id, and a training set and a test set are manufactured.

Model construction: according to the network structure constructed in fig. 2, it should be noted that, during training, the appearance characterization prediction head needs to add a softmax function to convert into an id prediction problem after obtaining an appearance characterization vector. The frame prediction head predicts a plurality of anchors (such as three anchors) per grid, and each anchor predicts four-dimensional frame coordinates, one-dimensional category confidence and one-dimensional target confidence. Each grid of the point prediction head corresponds to the frame prediction head, and two-dimensional coordinates and one-dimensional confidence degrees of three human body key points (head top point, left shoulder point and right shoulder point) of the corresponding anchor are predicted.

Model training: the predicted human frame center point is relative to the anchor center, and the predicted length and width of the frame are standardized according to the height and width of the anchor. The human body frame may be represented by center point coordinates (cx, cy) and width and height (w, h) of the frame. The frame prediction is generally based on an anchor, the anchor is a frame with a predefined position and size, the network output (or network direct prediction) is an offset relative to the anchor, and finally the predicted coordinates of the center point of the human frame and the width and height are calculated according to the anchor and the offset.

The predicted 3 human keypoints (vertex, left shoulder, right shoulder) are relative to the anchor center. The method has the advantages that the positive anchors are assigned with the ids during training, the problem that the ids cannot be assigned due to the fact that one grid corresponds to a plurality of anchors is avoided, and the accuracy of the appearance characterization vector is improved. The frame coordinates are supervised by CIOU loss, the frame category confidence and the target confidence are BCE loss, the key point confidence is BCE loss, and the key point position is OKS loss. Appearance characterization uses CE loss. The total loss is:。

where M is the number of species of scale (here there are three scales large, medium and small),respectively representing boundary box loss, box confidence loss, box category loss, key point confidence loss, key point position loss and appearance characterizationLoss. />For each loss-related weighting parameter, a learnable parameter is modeled.

And (3) predicting: after model training is completed, removing softmax after the appearance characterization pre-measuring head, and directly obtaining the appearance characterization vector. The key point confidence is greater than the confidence threshold, which may be set to 0.5, before the key point is considered trusted. And calculating the center points of the left shoulder point and the right shoulder point, and rotating the vector from the center point to the right shoulder point by 90 degrees anticlockwise to approximate the human body orientation.

The appearance representation prediction head originally outputs an appearance representation vector, the appearance representation vector can be understood to be the characteristic of a human body in a human body frame on an image, and the appearance representation vector is obtained by converting a softmax into an id of the human body to calculate loss during training. In the prediction, the appearance characterization vector is needed, and the calculation loss is not needed, so that the softmax part in training is not needed. The output of the multitask human behavior analysis system not only comprises the orientation relation, but also outputs a human frame, head top point coordinates, left and right shoulder point coordinates and appearance characterization vectors, wherein the human frame and the head top point coordinates can be used for pedestrian positioning, the appearance characterization vectors can be used for calculating similarity to allocate id to each person during tracking, the left and right shoulder points can be used for calculating orientation, and the orientation can also be used for id allocation during tracking.

The invention can detect a plurality of targets in the target top view scene picture at the same time, and predicts the human body orientation by using the left shoulder point and the right shoulder point of each target. As shown in fig. 5, multi-target pedestrian tracking requires distinguishing between different targets, and thus requires assigning an id to each person in the target scene. The tracking algorithm can utilize the detection result of each frame of human body frame, the head top point result, the orientation detection result and the appearance representation vector result of each human body frame provided by the invention to compare the similarity (comprising the frame intersection ratio, the orientation angle similarity and the human feature vector similarity) of the detection results of the images of the previous frame and the next frame to judge that each human body frame of the next frame is the same person as the same human body frame of the previous frame, and the same id is given.

In the invention, all real positive samples in human body detection are marked human body frames, the predicted human body frames and marked human body frames IOU are higher than a certain threshold value, and can be regarded as predicted real positive samples, and the recall rate is equal to the total number of the predicted real positive samples divided by the number of the marked human body frames. The detection indexes adopted by the invention are mAP (accuracy rate reflecting target detection) and mAR (recall rate reflecting target detection) series indexes proposed by the COCO (Common Objects in COntext) data set. AR is the maximum recall that a fixed number of boxes are detected in each picture. mAR is an average of multiple classes AR, with mar=ar since there is only one class (person). Different mAP and mAR can be calculated by different thresholds, so the series index is said. As [email protected], indicating that the threshold is 0.5 calculated mAP; mAP represents the average of 10 different thresholds mAP at 0.05 intervals, where the thresholds are 0.5 to 0.95, and mAR is similar. The higher the threshold value is, the higher the predicted frame and real frame overlap ratio requirement is, and the better and more accurate the detection effect is.

In the embodiment of the invention, when the human body targets with the heads not displayed but visible at other parts of the body exist in the target picture to be detected, the data collection process covers the relevant samples and also marks the relevant samples as the human body samples and participates in the training process, so that the model also has the recognition capability on the targets.

Because the human body definition comprises the human body head non-display but human body part visible sample, and the corresponding sample is added to training, compared with other models defined by human heads, the human body definition can detect the human body sample which is not displayed but has human body visible, thereby improving the human recall rate.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The multitasking human behavior analysis method based on the top view angle is characterized by comprising the following steps:

1) Data acquisition and labeling:

acquiring a target top view scene picture, and labeling human body characteristic data in the target top view scene picture, wherein the human body characteristic data comprises a human body id, a human body frame, head top point position coordinates, left shoulder point position coordinates and right shoulder point position coordinates; dividing the marked human body characteristic data into a training set and a verification set;

2) Model construction:

constructing a multi-task human body behavior analysis model, wherein the multi-task human body behavior analysis model comprises a main network and a plurality of prediction heads connected with the main network, and the prediction heads comprise a plurality of point prediction heads, a plurality of frame prediction heads and an appearance characterization prediction head;

3) Model training:

inputting a training set into a multitasking human behavior analysis model to perform model training, assigning id to a positive anchor in the training process, and inputting an appearance characterization vector output by an appearance characterization pre-measurement head into a softmax function;

4) Model prediction:

predicting the left and right shoulder point position coordinates of a human body in a top view scene picture of a target to be detected by using a trained multitask human body behavior analysis model, calculating the human body orientation based on the left and right shoulder point position coordinates of the human body, and simultaneously outputting human body frame, top head point position coordinates and human body characterization vectors for pedestrian tracking;

detecting a plurality of targets in the target top view scene picture, and predicting the human body orientation by using left and right shoulder points of each target; distinguishing different targets requires assigning each person in the target scene an id: the method specifically comprises the following steps: and comparing the similarity of the front and back frame image detection results by using the human body frame detection result, the head top point result, the orientation detection result and the appearance representation vector result of each human body frame, wherein the similarity comprises a frame intersection ratio, an orientation angle similarity and a human feature vector similarity, judging that the same human body frame in each human body frame and the previous frame of the next frame is the same human body, and giving the same id.

2. The method of claim 1, wherein the frame prediction head is configured to predict frame coordinates and confidence levels of a human body, wherein each grid in the frame prediction head predicts a plurality of anchor points, each anchor point being configured to predict four-dimensional frame coordinates, one-dimensional category confidence levels, and one-dimensional target confidence levels.

3. The method for analyzing the human behavior based on the top view angle according to claim 2, wherein each grid in the point prediction heads corresponds to a frame prediction head and is used for predicting two-dimensional coordinates and one-dimensional confidence of three human key points corresponding to an anchor point anchor; the three human body key points comprise a head vertex, a left shoulder point and a right shoulder point.

4. The method of claim 1, wherein the frame coordinates are supervised using a CIOU loss function, the frame class confidence and the target confidence are supervised using a BCE loss function, the key point confidence is supervised using a BCE loss function, the key point location is supervised using an OKS loss function, and the appearance characterization is supervised using a CE loss function.

5. A multi-task human behavior analysis system based on a top view angle is characterized by comprising,

the data acquisition module is used for acquiring a target top view scene picture, marking human body characteristic data in the target top view scene picture, and constructing a training set and a verification set, wherein the human body characteristic data comprises a human body id, a human body frame, a head top point position coordinate, a left shoulder point position coordinate and a right shoulder point position coordinate;

the model construction module is used for constructing a multi-task human body behavior analysis model, the multi-task human body behavior analysis model comprises a main network and a plurality of prediction heads connected with the main network, and the prediction heads comprise a plurality of point prediction heads, a plurality of frame prediction heads and an appearance representation prediction head;

the model training module is used for training the multitasking human behavior analysis model by utilizing a training set, distributing id to the positive anchor in the training process, and inputting the appearance characterization vector output by the appearance characterization pre-measurement head into a softmax function;

the model prediction module is used for predicting the left and right shoulder point position coordinates of the human body in the top view scene picture of the target to be detected by utilizing the trained multitask human body behavior analysis model, calculating the human body orientation based on the left and right shoulder point position coordinates of the human body, and simultaneously outputting a human body frame, top head point position coordinates and a human body characterization vector for pedestrian tracking;

detecting a plurality of targets in the target top view scene picture, and predicting the human body orientation by using left and right shoulder points of each target; different targets are distinguished, and id is required to be allocated to each person in a target scene; and comparing the similarity of the front and back frame image detection results by using the human body frame detection result, the head top point result, the orientation detection result and the appearance representation vector result of each human body frame, wherein the similarity comprises a frame intersection ratio, an orientation angle similarity and a human feature vector similarity, judging that the same human body frame in each human body frame and the previous frame of the next frame is the same human body, and giving the same id.

6. The system of claim 5, wherein each grid in the frame pre-measurement head predicts a plurality of anchor points, each anchor point for predicting four-dimensional frame coordinates, one-dimensional category confidence and one-dimensional target confidence.

7. The system for analyzing the human behavior based on the top view angle according to claim 6, wherein each grid in the point prediction heads corresponds to a frame prediction head and is used for predicting two-dimensional coordinates and one-dimensional confidence of three human key points corresponding to an anchor point anchor; the three human body key points comprise a head vertex, a left shoulder point and a right shoulder point.

8. The system of claim 5, wherein the frame coordinates are supervised using a CIOU loss function, the frame class confidence and target confidence are supervised using a BCE loss function, the key point confidence is supervised using a BCE loss function, the key point location is supervised using an OKS loss function, and the appearance characterization is supervised using a CE loss function.