CN113392697A

CN113392697A - Human body action recognition method based on bag-of-words model

Info

Publication number: CN113392697A
Application number: CN202110451802.4A
Authority: CN
Inventors: 黄慧; 李愈; 马燕
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-09-14
Anticipated expiration: 2041-04-26
Also published as: CN113392697B

Abstract

The invention provides a human body action recognition method based on a bag-of-words model, which comprises the following steps: collecting human body joint point information, and preprocessing the collected human body joint point information; extracting motion sequence spatial features and time features according to the preprocessed joint point information, and dividing a data set into a training set and a test set; respectively coding the time characteristics and the space characteristics of the training set, and counting the coding results to finally obtain a joint frequency histogram of the training set; and then training the classifier by using the training set, and testing the classifier by using the test set to obtain the automatic identification of the action type. The invention discloses a human body action recognition method based on a bag-of-words model, which embeds time characteristics into action description, separately considers the time characteristics and the space characteristics of the action, describes the action independently, and provides a stable coding method to construct stable time bag-of-words and space bag-of-words.

Description

Human body action recognition method based on bag-of-words model

Technical Field

The invention relates to the field of motion recognition, in particular to a human body motion recognition method based on a bag-of-words model.

Background

The bag-of-words model method is initially used in text classification and then gradually applied in the field of motion recognition. When the action is recognized, the traditional visual word bag is used for recognizing the action, and the method mainly comprises the following steps:

firstly, preprocessing data and detecting a moving target;

then, extracting the characteristics of the action;

and finally, based on a standard action image library established by the visual word bag, realizing classification and identification of the action.

The above-mentioned action recognition using the conventional visual word bag has the following problems: 1. the traditional bag-of-words model ignores the time characteristic of the action, and the histogram is taken as the description vector of the action, so that the similar situation of each static posture in the action can only be reflected, the execution sequence of the static posture cannot be reflected, and the recognition effect on the reverse-sequence action is poor. 2. The initial clustering center of the K-means clustering is randomly selected, the clustering effect is unstable, the visual dictionary effect is unstable, and multiple experiments are needed to obtain an accurate effect.

Disclosure of Invention

In view of the above defects in the prior art, the technical problems to be solved by the present invention are that the traditional bag-of-words model ignores the action time characteristics, is not good for identifying the reverse-order action, and is unstable in algorithm identification effect. The invention provides a human body action recognition method based on a bag-of-words model, and provides a new time characteristic descriptor, wherein time characteristics are embedded into action description. And the time characteristic and the spatial characteristic of the action are considered separately and described separately, so that a stable coding method is further provided, and the problem of instability of a visual dictionary constructed by the conventional clustering method is solved. And finally, improving the classifier, and distinguishing the action with larger difference from the action with smaller difference by using different classifiers.

In order to achieve the purpose, the invention provides a human body action recognition method based on a bag-of-words model, which comprises the following steps:

collecting human body joint point information, and preprocessing the collected human body joint point information;

extracting motion sequence spatial features and time features according to the preprocessed joint point information, and dividing a data set into a training set and a test set;

symmetrically expanding the training set data;

respectively coding the time features and the space features of the training set, and counting a time feature histogram and a space feature histogram of the training set according to a coding result;

respectively coding the time features and the space features of the test set, and counting a time feature histogram and a space feature histogram of the test set according to a coding result;

obtaining a joint frequency histogram of the training set according to the time characteristic histogram and the space characteristic histogram of the training set, and obtaining a joint frequency histogram of the test set according to the time characteristic histogram of the test set and the space characteristic histogram of the test set;

and then training the classifier by using the training set, and testing the classifier by using the test set to obtain the automatic identification of the action type.

Further, collecting human body joint point information, and preprocessing the collected human body joint point information, specifically comprising:

human body joint point information acquired by using Kinect equipment;

and sequentially carrying out origin normalization, direction normalization and scale normalization on the collected human body joint point information.

Further, the processed three-dimensional coordinate number of the joint point is used as a space feature descriptor;

extracting 28 space angles of human joints for each frame, calculating the interframe difference value between the 28 angles and the previous frame, and taking the interframe difference value of the limb key angle of the joint point as a temporal feature descriptor.

Further, dividing the data set, obtaining a training set and a test set by a method of five-fifth division, and expanding the data of the training set;

and symmetrically turning the data in the training set, exchanging three-dimensional coordinates of joint points on the left side and the right side by taking the trunk of the human body as a central axis, and adding the symmetrical action of the original action in the training set into the training set.

Further, each frame in the action sequence is regarded as a data point, time features and space features are extracted from each data point, the time features and the space features of the training set data are respectively clustered, and a time feature label and a space feature label are obtained from each data point.

Further, performing motion coding in a training set to obtain a temporal feature histogram and a spatial feature histogram of the training set, specifically comprising the following steps:

the training set comprises a plurality of action sequences, each action sequence comprises a plurality of posture frames, the time characteristics of all the frames in the training set are used as time domain characteristics, the space characteristics are used as space domain characteristics, the time domain and the space domain are respectively clustered by utilizing a hierarchical clustering method, each cluster after clustering is regarded as a visual word, and a time bag and a space bag are respectively obtained;

after each frame in the training set obtains the time label and the action label, respectively counting the time label and the space label of each action sequence to obtain a time characteristic histogram and a space characteristic histogram of each action sequence in the training set.

Further, performing motion coding in the test set to obtain a temporal feature histogram and a spatial feature histogram of the test set, specifically including the following steps:

respectively calculating the average distance from the time characteristic of each frame of data in the test set to each cluster in the time word bag, taking the cluster type with the minimum average distance as the time characteristic label of the frame, and obtaining the space characteristic label in the same way;

after each frame in the test set obtains the time label and the space label, respectively counting the time label and the space label of each action sequence to obtain a time characteristic histogram and a space characteristic histogram of each action sequence in the test set.

Further, obtaining a spatiotemporal joint histogram of the training set according to the temporal feature histogram and the spatial feature histogram of the training set, and obtaining a spatiotemporal joint histogram of the test set according to the temporal feature histogram and the spatial feature histogram of the test set; and taking the space-time joint histogram of the action sequence as a final expression vector.

And further, a hierarchical classification method is constructed by adopting an SVM classifier to classify the actions, firstly, a first-layer classifier is used for classifying the similar actions into large classes, then, a second-layer classifier is used for classifying the small classes on the basis of the large classes, and finally, a classification result is obtained.

Technical effects

According to the human body action recognition method based on the bag-of-words model, an effective time descriptor is constructed, the change condition of each main angle along with time in the action is reflected, and the recognition accuracy of the reverse-order action is improved; corresponding descriptors are respectively constructed for the time characteristics and the space characteristics of the actions, so that the phenomenon that the time-space mixed characteristics cannot highlight the difference of the actions in time and space is avoided; constructing stable time word bags and space word bags, and enhancing the stability of the classification effect; by adopting the two layers of classifiers, the problem that the one layer of classifier can not efficiently distinguish the actions with larger differences and the actions with smaller differences at the same time is solved, and the classification efficiency is improved.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a flow chart of a human body motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 2 is a diagram illustrating joint activities of a human body motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating angles between major joints of a human motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of the main activity parts of a human body motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating angles between main moving parts and coordinate axes of a human body motion recognition method based on a bag-of-words model according to a preferred embodiment of the present invention;

FIG. 6 is a schematic diagram of joint markers of a human body motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 7 is a schematic diagram of gesture encoding of a human body motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

FIG. 8 is a time feature histogram of a human motion recognition method based on bag-of-words model according to a preferred embodiment of the present invention;

fig. 9 is a spatial feature histogram of a human body motion recognition method based on a bag-of-words model according to a preferred embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular internal procedures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

As shown in fig. 1, the present invention provides a human body motion recognition method based on a bag-of-words model, comprising the following steps:

step 100, collecting human body joint point information, preprocessing the collected human body joint point information, and constructing a time characteristic descriptor and a space characteristic descriptor by using the processed data;

step 200, extracting motion sequence spatial features and time features according to the preprocessed joint point information, and dividing a data set into a training set and a test set:

and obtaining a training set and a test set by adopting a five-fifth method, symmetrically overturning the data in the training set for expanding the data of the training set, namely taking the trunk of the human body as a central axis, exchanging three-dimensional coordinates of joint points on the left side and the right side, and adding the symmetrical action of the original action in the training set into the training set.

Step 300, respectively coding the time features and the space features of the training set, and counting a time feature histogram and a space feature histogram of the training set according to a coding result;

step 400, respectively encoding the time features and the space features of the test set, and counting a time feature histogram and a space feature histogram of the test set according to an encoding result;

500, obtaining a training set space-time joint histogram according to the time characteristic histogram and the space characteristic histogram of the training set; obtaining a space-time joint histogram of the test set according to the time characteristic histogram and the space characteristic histogram of the test set;

and 600, training the classifier by using a training set, testing the classifier by using a testing set, and obtaining the automatic identification of the action type.

Wherein, step 100 specifically comprises:

human body joint point information acquired by using Kinect equipment;

carrying out origin normalization, direction normalization and scale normalization operation on the collected human body joint point information in sequence;

taking the three-dimensional coordinates of the joint points subjected to the operations of origin normalization, direction normalization and scale normalization as spatial feature descriptors; specifically, the origin normalization is to perform coordinate system conversion on space coordinates, and convert a coordinate system with a Kinect camera space as an origin to a coordinate system with a hip center point as an origin; the direction normalization is to rotate the human skeleton to enable the human body to face the X axis; the scale normalization is to adjust the skeleton of each subject to the same size. The original point normalization, the direction normalization and the scale normalization are performed on the training set and the test set, and the operation steps are the same.

Selecting 28 space angles representative of human body motion as limb key angles, describing the limb key angles, and using interframe difference values of the limb key angles of the joint points as time feature descriptors.

There are many ways to construct temporal profiles, such as:

1. taking the interframe difference value of the three-dimensional coordinates of the joint points as a time characteristic;

2. taking the direction angle and the elevation angle of the difference vector of two adjacent frames of the joint point as time characteristics;

3. and taking the main movable limb as a vector, and taking the interframe difference value of the direction angle and the elevation angle of the limb vector as a time characteristic.

Step 300-step 500 specifically includes:

taking each action sequence as the ordered arrangement of a plurality of posture frames, wherein the training set comprises a plurality of action sequences, taking the time characteristics of all the frames in the training set as the characteristics of a time domain and the characteristics of a space domain, respectively clustering the time domain and the space domain by using a hierarchical clustering method, regarding each cluster after clustering as a visual word, respectively obtaining a time bag and a space bag, and counting the time characteristic histogram and the space characteristic histogram of each action sequence in the training set;

respectively calculating the average distance from the time characteristic of each frame of data in the test set to each cluster in the time word bag, taking the cluster type with the minimum average distance as the frame time characteristic label, and obtaining a space characteristic label in the same way;

after each frame in the test set obtains a time label and an action label, respectively counting the time label and the space label of each action sequence to obtain a time characteristic histogram and a space characteristic histogram of the test set;

and combining the time characteristic histogram and the space characteristic histogram to obtain a joint frequency histogram which is used as a final expression vector of the action sequence. And (3) constructing a joint histogram, namely, independently considering the time characteristic and the spatial characteristic, respectively constructing a time characteristic frequency histogram and a spatial characteristic frequency histogram, and then connecting the time characteristic frequency histogram and the spatial characteristic frequency histogram in series to obtain the joint frequency histogram. And taking the joint histogram of each action sequence in the training set and the test set as a final representation vector.

Step 600, specifically, an SVM classifier is adopted to construct a hierarchical classification method to classify the actions, firstly, a first-layer classifier is used to classify the similar actions into large classes, then, a second-layer classifier is used to classify the small classes based on the large classes, and finally, a classification result is obtained.

There are also a number of ways in classifying actions, such as:

1. according to the characteristics of different joint points, different weights are given to the different joint points, and during space-time feature extraction, according to the characteristics of the different joint points, the three-dimensional coordinates and the joint angle difference value of the different joint points are multiplied by different weight values to obtain time weighting features and space weighting features, and then follow-up operation is continuously executed;

2. according to the influence on the motion types, the joint points are classified into grades, the joint points which play a key role in the motion types are recorded as first-stage joint points, the other joint points are recorded as second-stage joint points, a first-layer SVM classifier is trained according to a space-time joint frequency histogram of the first-stage joint points to classify the motions into large classes, and then a second-layer SVM classifier is trained according to a space-time joint frequency histogram of the second-stage joint points to classify into small classes.

The following will illustrate specific steps of a human body motion recognition method based on a bag-of-words model by taking a specific example:

the data adopted by the embodiment of the invention is derived from human body joint point information acquired by Kinect equipment, and the acquired joint point data needs to be preprocessed before motion characteristics are extracted. The preprocessing operation comprises origin point normalization, direction normalization and scale normalization. After the origin point normalization processing, the origin point of the human body skeleton is converted to the center of the hip, the connecting line of the left hip and the right hip is parallel to the X axis, and the length of each part of each human body skeleton is scaled to be the same as the length of the reference size. In the scale normalization, the embodiment proposes a self-defined reference size, that is, the average value of the corresponding parts of all the experimental subjects is used as the length of the reference size of the limb part. After preprocessing, dividing the data set into a training set and a test set, dividing by adopting a five-fifth strategy, symmetrically turning the actions of the training set, and bringing the symmetrical actions of the actions in the training set into the training set. And respectively performing feature extraction, action coding and combined frequency histogram construction in a training set and a test set, then training a classifier by using training set data, and testing by using test set data to realize automatic identification of action types. The method comprises the following concrete steps:

1. after data normalization is carried out, the three-dimensional coordinate data of the 20 joint points after normalization are used as space feature descriptors;

the calculation formula of the coordinates of each joint point after the coordinate system conversion is as follows:

wherein P is_t ⁱ(x_t,y_t,z_t) Is the original spatial position, x, of the joint point i in the t-th frame_tIs the abscissa, y_tIs ordinate, z_tIs the distance of the point from the camera, P_t ⁰(x_t,y_t,z_t) The spatial position of the hip joint in the t-th frame,

the spatial position of the joint after origin normalization.

2. The schematic diagram of the human joint movement is shown in fig. 2, and according to the joint movement characteristics, 4 joint points of shoulder, elbow, wrist, hip and knee with larger rotation angles are selected as main joint points of the human movement, and the limb supporting the main joint points to move is taken as a main movable limb. Then, based on the selected main joint points and limb parts, 28 angles with distinctiveness are constructed as action key angles, and the key angles are divided into the following two types:

1) the four main movable joint points of the left elbow, the right elbow, the left knee and the right knee form 4 angles with adjacent joint points. The selected angle is shown in fig. 3.

2) The left and right big arms, small arms, thighs and crus have 8 main moving parts which form included angles with three coordinate axes. The 8 main active sites selected are shown in FIG. 4, and the right arm is taken as an example to show the angle formed by the two major active sites and the coordinate axis, as shown in FIG. 5.

For convenience of description, the joint points are symbolized, a schematic diagram of joint point marks is shown in fig. 6, and 28 limb key angles are expressed according to the method, as shown in table 1. The inter-frame difference values of the 28 limb key angles are used as a temporal feature descriptor of each gesture in the action sequence.

TABLE 1

3. Regarding each frame in the action sequence as a data point, and obtaining the time characteristic and the space characteristic of each data point according to the method; firstly, a stable hierarchical clustering method is used for clustering the time characteristic and the spatial characteristic of training set data respectively, and a time characteristic label and a spatial characteristic label are obtained for each data point. In a time domain and a space domain of a training set, each clustered cluster is regarded as a visual word, and a time word bag and a space word bag are respectively obtained; and then calculating the average distance between the time characteristic of each data point in the test set and each cluster in the time word bag, and coding the time characteristic to the cluster with the minimum average distance, wherein the coding method of the space word bag is the same. Thereby obtaining a temporal label and a spatial label for each static gesture frame in the test set. The action code diagram is shown in FIG. 7;

4. and respectively counting the occurrence frequency of each code word in the time word bag and the space word bag of the training set and the test set, and respectively constructing a time characteristic histogram and a space characteristic histogram of each action sequence, as shown in fig. 8 and 9. Then combining the two frequency histograms to serve as a final expression vector of the action sequence;

5. the method comprises the steps of adopting an SVM classifier to construct a hierarchical classification method to classify actions, firstly classifying similar actions into the same large class, setting a primary class label of each action sequence, then subdividing the action sequences in each large class, and setting a secondary class label of each action. In the first-layer classifier, an SVM classifier is trained by taking the joint frequency histogram of a training set as a characteristic and taking the first-level category as a label. And respectively training a two-layer classifier in each class, and training the SVM classifier in the two-layer classifier by taking the joint frequency histogram of the training set as a characteristic and taking the two-level class as a label. During testing, a first-level classifier is used to obtain a first-level class number of a test sequence, and then a corresponding second-level classifier is used to obtain a specific class of a test set.

The invention relates to a human body action recognition method based on a bag-of-words model, which constructs an effective time descriptor, can reflect the change condition of each main angle along with time in the action and improves the recognition accuracy rate of the reverse-order action; corresponding descriptors are respectively constructed for the time characteristics and the space characteristics of the actions, so that the phenomenon that the time-space mixed characteristics cannot highlight the difference of the actions in time and space is avoided; constructing stable time word bags and space word bags, and enhancing the stability of the classification effect; by adopting the two layers of classifiers, the problem that the one layer of classifier can not efficiently distinguish the actions with larger differences and the actions with smaller differences at the same time is solved, and the classification efficiency is improved.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A human body action recognition method based on a bag-of-words model comprises the following steps:

2. The human body motion recognition method based on the bag-of-words model as claimed in claim 1, wherein collecting human body joint point information, and preprocessing the collected human body joint point information specifically comprises:

human body joint point information acquired by using Kinect equipment;

3. The human body motion recognition method based on the bag-of-words model as claimed in claim 2, characterized in that the processed three-dimensional coordinates of the joint points are used as a spatial feature descriptor;

4. The human body motion recognition method based on the bag-of-words model as claimed in claim 3, characterized in that the data set is divided, a method of five or five parts is adopted to obtain a training set and a test set, and the training set data is expanded;

5. The bag-of-words model-based human motion recognition method as claimed in claim 4, wherein each frame in the motion sequence is regarded as a data point, and each data point extracts a temporal feature and a spatial feature; and clustering the time characteristic and the spatial characteristic of the training set data respectively, and obtaining a time characteristic label and a spatial characteristic label for each data point.

6. The human body motion recognition method based on the bag-of-words model as claimed in claim 5, wherein constructing the temporal feature histogram and the spatial feature histogram of the training set specifically comprises the following steps:

7. The human body motion recognition method based on the bag-of-words model as claimed in claim 6, wherein obtaining the time feature histogram and the spatial feature histogram of the test set specifically comprises the following steps:

8. The human body motion recognition method based on the bag-of-words model as claimed in claim 7, wherein a spatiotemporal joint histogram of the training set is obtained according to the temporal feature histogram and the spatial feature histogram of the training set, and a spatiotemporal joint histogram of the test set is obtained according to the temporal feature histogram and the spatial feature histogram of the test set; and taking the space-time joint histogram of the action sequence as a final expression vector.

9. The human body motion recognition method based on the bag-of-words model as claimed in claim 8, wherein the actions are classified by using an SVM classifier to construct a hierarchical classification method, wherein the actions are classified by using a first-layer classifier to classify similar actions into large classes, and then by using a second-layer classifier to classify small classes based on the large classes, and finally, a classification result is obtained.