CN114187650A

CN114187650A - Action recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114187650A
Application number: CN202111276949.0A
Authority: CN
Inventors: 张纯阳
Original assignee: Lumi United Technology Co Ltd
Current assignee: Lumi United Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-15

Abstract

The embodiment of the application provides an action recognition method, which comprises the following steps: acquiring an image to be identified; extracting target image features from the image to be recognized, wherein the target image features are associated with key parts of target objects for executing actions and associated parts having association relations with the key parts; detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result; and identifying the action executed by the target object in the image to be identified according to the part detection result to obtain an action identification result. By adopting the method provided by the application, the problem of low accuracy of motion recognition in the prior art can be effectively solved.

Description

Action recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a motion recognition method and apparatus, an electronic device, and a storage medium.

Background

Currently, motion recognition is mainly applied in interaction/monitoring scenes such as public places, hospitals, security, control and the like, so as to provide a more intuitive, natural and friendly interaction/monitoring mode which is more familiar to people. It can be understood that, especially in some interactive scenarios, for example, in a home scenario, in order to better control the smart home devices, more complex actions may be designed for the home user, so as to distinguish different smart home devices or control the same smart home device to perform different operations. Of course, there are some interactive scenes, such as public places like museums and exhibition halls, and since visitors are untrained users, the designed actions are mostly simple and intuitive.

The existing action recognition usually adopts a detection and classification method, however, no matter complex actions or simple and intuitive actions, when a key part (such as a hand) for executing the action is detected, especially under the condition of weak light or the existence of some objects (such as hand toys) similar to the key part, the false detection of the key part is easy to occur, and further, when the action is classified, the false recognition of the action is easy to occur based on the false detection result, and finally, the phenomenon of false triggering occurs in an interactive scene.

Therefore, the existing action recognition still has the defect of low accuracy.

Disclosure of Invention

Embodiments of the present application provide a method and an apparatus for motion recognition, an electronic device, and a storage medium, which can solve the problem of low accuracy of motion recognition in the related art.

The technical scheme adopted by the application is as follows:

according to an aspect of an embodiment of the present application, a motion recognition method includes: acquiring an image to be identified; extracting target image features from the image to be recognized, wherein the target image features are associated with key parts of target objects for executing actions and associated parts having association relations with the key parts; detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result; and identifying the action executed by the target object in the image to be identified according to the part detection result to obtain an action identification result.

According to an aspect of an embodiment of the present application, a motion recognition apparatus includes: the image acquisition module is used for acquiring an image to be identified; the feature extraction module is used for extracting target image features from the image to be recognized, wherein the target image features are associated with key parts of target objects for executing actions and associated parts having association relations with the key parts; the part detection module is used for detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result; and the action recognition module is used for recognizing the action executed by the target object in the image to be recognized according to the part detection result to obtain an action recognition result.

In one exemplary embodiment, the site detection module includes: the key part detection unit is used for detecting a key part of the action executed by the target object in the image to be recognized according to the characteristics of the target image to obtain a key part detection result; the associated part detection unit is used for detecting an associated part which has an association relation with the key part in the image to be identified according to the characteristics of the target image to obtain an associated part detection result; a part detection result determination unit configured to determine the part detection result based on the key part detection result and the associated part detection result.

In one exemplary embodiment, the association site detecting unit includes: the position determining subunit is used for determining the relative position of the associated part in the image to be identified relative to the key part according to the target image feature; a relevant part determining subunit, configured to determine the relevant part detection result based on a relative position of the relevant part with respect to the key part.

In one exemplary embodiment, the site detection module includes: the position positioning unit is used for positioning the region of the key part of the target object executing the action in the image to be identified based on the target image characteristics to obtain the position of the key part of the target object executing the action; a key part determining unit, configured to determine a key part detection result according to a position of a key part at which the target object performs an action; and the result determining unit is used for determining the part detection result according to the key part detection result.

In one exemplary embodiment, the action recognition module includes: the region determining unit is used for determining the region of the part of the target object executing the action in the image to be recognized according to the part detection result; the type prediction unit is used for performing motion type prediction on the motion executed by the target object in the image to be recognized based on the region of the part of the target object executing the motion in the image to be recognized to obtain the motion type of the target object executing the motion; and the action recognition result determining unit is used for determining the action recognition result according to the action type of the action executed by the target object.

In an exemplary embodiment, the action recognition result is obtained by calling an action recognition model; the motion recognition model is obtained through a training module; the training module is used for training an initial machine learning model based on at least one sample image; the training module comprises at least: a first supervised training unit, configured to perform supervised training on the machine learning model for key part detection and associated part detection based on a first label and a second label carried by the sample image in a part detection stage, where the first label is used to indicate a position of a key part of the sample object performing an action, and the second label is used to indicate a relative position of the associated part with respect to the key part.

In an exemplary embodiment, the first supervised training unit comprises: the characteristic extraction subunit is used for extracting the characteristics of the sample image from the sample image; the first prediction subunit is used for predicting the position of a key part of the sample object executing the action in the sample image according to the sample image characteristics to obtain first prediction information, and predicting the relative position of a related part in the sample image relative to the key part to obtain second prediction information; a first loss determination subunit operable to determine a first loss based on a difference between the first label and the first prediction information, and determine a second loss based on a difference between the second label and the second prediction information; and a first parameter adjusting subunit, configured to adjust a parameter of the machine learning model corresponding to the part detection stage according to the first loss and the second loss.

In an exemplary embodiment, the training module further comprises: a second supervised training unit, configured to, in an action classification stage, perform supervised training on the machine learning model for action recognition based on a third label carried by the sample image and a supervised training result in the part detection stage, where the third label is used to indicate an action category of the sample object for performing an action; the second supervised training unit comprises: the second prediction subunit is used for predicting the action type of the sample object in the sample image for executing the action according to the first prediction information and the second prediction information in the supervised training result to obtain third prediction information; a second loss determination subunit for determining a third loss based on a difference between the third tag and the third prediction information; a second parameter adjusting subunit, configured to adjust a parameter of the machine learning model corresponding to the action classification phase according to the third loss, and/or the first loss and the second loss in the supervised training result; and the model generation subunit is used for obtaining the machine learning model after the training is finished as the action recognition model when the training stopping condition is met.

In an exemplary embodiment, the apparatus further comprises: a loss feedback module for feeding back the third loss to the site detection stage; the first parameter adjustment subunit includes: a joint adjusting subunit, configured to adjust a parameter of the machine learning model corresponding to the part detection stage according to the first loss, the second loss, and the third loss.

In an exemplary embodiment, the first prediction information is obtained by a critical site detection branch prediction in the machine learning model, and the second prediction information is obtained by an associated site detection branch prediction in the machine learning model; the joint adjustment subunit includes: a first branch parameter adjusting subunit, configured to adjust a parameter of the critical portion detection branch according to the first loss, the second loss, and the third loss; and a second branch parameter adjusting subunit, configured to adjust a parameter of the relevant portion detection branch according to the second loss and the third loss.

According to an aspect of an embodiment of the present application, an electronic device includes: the system comprises at least one processor, at least one memory and at least one communication bus, wherein the memory is stored with computer programs, and the processor reads the computer programs in the memory through the communication bus; the computer program, when executed by a processor, implements the action recognition method as described above.

According to an aspect of an embodiment of the present application, a storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the motion recognition method as described above.

According to an aspect of an embodiment of the present application, a computer program product includes a computer program, the computer program is stored in a storage medium, a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device realizes the training method of the motion recognition model as described above when executing the computer program.

The beneficial effect that technical scheme that this application provided brought is:

the method comprises the steps of extracting target image features from an image to be recognized based on the acquired image to be recognized, closely associating key parts of actions executed by a target object and associated parts having an association relation with the key parts by the target image features, improving the precision of part detection results obtained by detecting the parts of the actions executed by the target object in the image to be recognized based on the target image features, and improving the precision of action recognition results obtained by recognizing the actions executed by the target object in the image to be recognized based on the part detection results.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic illustration of an implementation environment according to the present application;

FIG. 2 is a flow diagram illustrating a method of motion recognition in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of training a motion recognition model in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram of a human joint point shown in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a training process and a prediction process of a motion recognition model, according to an exemplary embodiment;

FIG. 6 is a diagram illustrating one embodiment of step 250 in the corresponding embodiment of FIG. 2;

FIG. 7 is a diagram illustrating one embodiment of step 270 of the corresponding embodiment of FIG. 2;

FIG. 8 is a block diagram illustrating a motion recognition device in accordance with an exemplary embodiment;

FIG. 9 is a diagram illustrating a hardware architecture of a server in accordance with an illustrative embodiment;

FIG. 10 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The following is a description and explanation of several terms involved in the present application:

and (3) action recognition: two main tasks are typically involved: part detection and motion classification. Here, the part detection means that a key part for performing an action may appear at a position of an image (or a certain area), and an area of the key part in the image is framed with a positioning box (abbreviated as a Bbox). The action classification refers to which category the action executed by the key part framed in the image (or a certain area) belongs.

Deep learning model: a machine learning model is to train a deep neural network by using a supervised and unsupervised learning method, wherein the deep neural network is a neural network structure comprising a plurality of hidden layers. With the great progress of deep learning on the image classification task, the current mainstream recognition algorithm is mainly based on a deep learning model.

Precision: the statistics from the perspective of the prediction result means how many of the data predicted as positive samples are actual positive samples, that is, the proportion of the "found pairs", and the calculation formula is as follows: the prediction is positive and actually positive/prediction is positive, the larger the value the better, 100% being ideal.

The recall ratio is as follows: the statistics from the real sample image set means how many actual positive samples are found out from all the positive samples, that is, the proportion of "found full", and the calculation formula is as follows: the prediction is positive and actually positive/all positive samples, the larger the value the better, 100% being ideal.

As mentioned above, motion recognition is mainly applied in interaction/monitoring scenarios such as public places, hospitals, security, control, etc., so as to provide a more intuitive, natural and friendly interaction/monitoring method. For example, in a monitoring scene formed in a public place such as a museum and an exhibition hall, whether a visitor has actions related to an unlawful action such as smoking and photographing or not is monitored, or whether actions related to smart home equipment such as raising heads, raising hands and squatting are related to a home user is identified in an interactive scene such as a smart home scene.

In order to enable actions in these interaction/monitoring scenarios to be robustly identified and to minimize false alarms, existing action identification usually employs a detection + classification method. For example, a part detection model and an action classification model are adopted to respectively detect key parts for executing actions in an interaction/monitoring scene and identify action categories for executing the actions. Specifically, in the training process, a part detection model and a motion classification model are trained by using the image labeled with the region where the key part is located and the motion category. In the prediction process, a part detection model detects a key part for executing the action, and the action classification stage is carried out based on the detected key part, so that the action type of the action executed by the key part is output, and the whole action recognition process is completed.

In the action recognition process, no matter aiming at more complex actions or simple and intuitive actions, although the action recognition function can be realized, under the condition of weak light or the existence of some objects (such as hand toys) similar to key parts (such as hands), the false detection of the key parts is easy to occur, so that the false recognition of the actions is easy to occur based on the false detection result when the actions are classified, and finally the phenomenon of false triggering occurs in an interactive scene.

From the above, the existing motion recognition still has the limitation of low accuracy.

In order to overcome the above problems, the present application provides a motion recognition method, which can effectively reduce the false detection rate of non-critical parts, thereby avoiding the occurrence of false triggering in an interactive scene, and accordingly, the motion recognition method is suitable for a motion recognition apparatus, which can be deployed in an electronic device equipped with a von neumann architecture, for example, the electronic device can be a Personal Computer (PC), a notebook computer, a server, and the like.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment in which a target recognition method is involved. In one embodiment, the implementation environment is suitable for a smart home scenario, and as shown in fig. (a), the implementation environment includes a gateway 110, an image capturing device 130 disposed in the gateway 110, a server 150, and a number of control devices 170.

Specifically, the image capturing device 130 may be a video camera, a camera, or other electronic devices such as a smart phone and a tablet computer configured with a camera, which is not limited herein.

The server 150 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. For example, in the present implementation environment, the server 150 provides an action recognition service.

The image capturing device 130 is disposed in the gateway 110, and communicates with the gateway 110 through a communication module configured in itself, so as to implement interaction with the gateway 110. In one embodiment, image capture device 130 is deployed in gateway 110 by accessing gateway 110 through a local area network. The process of accessing the gateway 110 by the image capturing device 130 through the local area network includes first establishing a local area network by the gateway 110, and the image capturing device 130 accesses the local area network established by the gateway 110 by connecting to the gateway 110. The local area network includes: bluetooth, WIFI, ZIGBEE, or LORA.

The server 150 establishes a communication connection with the gateway 110 in advance, and in one embodiment, the server 150 establishes a communication connection with the gateway 110 through 2G/3G/4G/5G, WIFI, etc. Data transmission with the gateway 110 is realized through this communication connection. For example, the transmitted data includes at least an image to be recognized and the like.

Similar to the image capturing device 130, the control device 170 is also disposed in the gateway 110, and communicates with the gateway 110 through its own configured communication module, and is controlled by the gateway 110. In one embodiment, control device 170 accesses gateway 110 via a local area network and is thus deployed in gateway 110. The process of accessing the gateway 110 by the control device 170 through the local area network includes first establishing a local area network by the gateway 110, and the control device 170 connecting to the gateway 110 to access the local area network established by the gateway 110. The local area network includes: bluetooth, ZIGBEE, or LORA.

Among these, the control device 170 includes but is not limited to: the intelligent television, the intelligent printer, the intelligent fax machine, the intelligent audio amplifier, the intelligent camera, the intelligent air conditioner, the intelligent desk lamp, the intelligent door lock or the human body sensor, the door and window sensor, the temperature and humidity sensor, the water logging sensor, the vision sensor, the natural gas alarm, the smoke alarm, the buzzer alarm, the wall switch, the wall socket, the wireless switch, the wireless wall switch, the magic cube controller, the curtain motor and the like, wherein the parts are not limited specifically.

In this implementation environment, if a certain home user wants to control the device 170 to perform a related operation, an action corresponding to the related operation may be performed in a smart home scene, at this time, the image capturing device 130 may capture and capture the action performed by the home user to generate an image to be recognized, and send the image to be recognized to the server 150 through the gateway 110, so that the server 150 provides an action recognition service.

After receiving the image to be recognized, the server 150 can detect and recognize the image to be recognized, obtain a motion recognition result, and send the motion recognition result to the gateway 110.

After the server 150 sends the action recognition result, the gateway 110 forwards the action recognition result to the corresponding control device 170, so as to trigger the control device 170 to perform corresponding operations according to the indication of the action recognition result, thereby implementing simple and efficient device control.

Of course, in other embodiments, the action recognition result may be directly transmitted from the server 150 to the control device 170, and this is not a specific limitation.

Unlike in fig. (a), in fig. (b), the implementation environment further includes a user terminal 190. The user terminal 190 may be operated by a client having a device control function, and may be an electronic device such as a desktop computer, a notebook computer, a tablet computer, and a smart phone, which is not limited herein. The client has a device control function, and may be in the form of an application program or a web page, and accordingly, a user interface for controlling the device of the client may be in the form of a program window or a web page, which is not limited herein.

A communication connection is pre-established between the user terminal 190 and the gateway 110, and data transmission between the user terminal 190 and the gateway 110 is realized through the communication connection. For example, the transmitted data may be the result of execution of the control device 170 or the like.

In this implementation environment, after the control device 170 performs the relevant operation, the execution result thereof may be fed back to the gateway 110. At this time, the gateway 110 can send the execution result to the user terminal 190 for the user terminal 190 to display the execution result, so as to provide a natural and more familiar interaction manner for the home user.

It should be noted that, in an embodiment, the user terminal 190 may also serve as the image capturing device 130, at this time, a home user may capture and capture an image to be recognized by using a camera module configured in the user terminal 190, and initiate an action recognition request to the server 150, so that the server 150 provides an action recognition service, thereby implementing trigger control on the control device 170.

In another embodiment, the motion recognition may be completed through a motion recognition model, which may be obtained by offline training of the server 150 and issued to the user terminal 190, at this time, the user terminal 190 serving as the image capturing device also has a motion recognition capability, so that a motion recognition service may be provided for the captured and captured image to be recognized, thereby implementing trigger control on the control device 170.

Referring to fig. 2, in an exemplary embodiment, a motion recognition method is provided, which is described by taking an example of applying the method to an electronic device. The electronic device may specifically be the server in fig. 1, and may also be the user terminal in fig. 1.

The motion recognition method may include the steps of:

step 210, acquiring an image to be identified.

The image to be recognized comprises a dynamic image and a static image. A moving image is a video including a plurality of frames of images, and a still image is a photograph or a picture including one frame of image, as opposed to a moving image. Based on this, the motion recognition in the present embodiment may be performed based on a video including a plurality of frames of images, or may be performed based on a photograph or a picture including one frame of image.

The image to be recognized is generated by shooting the environment where the target object is located. With the image acquisition device arranged around the environment where the target object is located, the image of the environment can be shot and acquired, and an image related to the target object, for example, an image of the target object executing the action can be acquired. The environment of the target object may be a relatively private scene such as a bedroom, a living room, a hotel room, or a relatively open scene such as a movie theater, a library, a conference room, which is not limited in this embodiment.

In addition, the image capturing device may be a camera, or even other electronic devices such as a smart phone and a tablet pc, which are disposed around the environment where the target object is located and have image capturing and capturing functions, for example, the environment around the target object may refer to a corner of a ceiling inside a building, a lamp post outside the building, or a control device such as an intelligent robot, and accordingly, the image to be recognized may be any image inside the building or any image outside the building, which is not limited herein.

The above-mentioned target object refers to various objects to be motion-recognized in the image to be recognized, for example, at least one of a person, an animal, an intelligent robot, and the like that perform a gesture, a squat, a jump, and the like in the image to be recognized.

And step 230, extracting the target image characteristics from the image to be identified.

The target image feature associates a key part of the target object performing the action with an associated part having an association relationship with the key part.

In other words, the target image features not only can reflect the key part of the target object executing the action, but also can reflect the associated part having an association relation with the key part, so that objects which do not have the associated part but are similar to the key part can be distinguished in the subsequent part detection stage, and the accuracy of part detection can be improved.

And 250, detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result.

In one embodiment, the part detection result is obtained by detecting a key part of the action performed by the target object in the image to be recognized based on the target image feature.

In one embodiment, the part detection result is obtained by detecting a key part for performing an action on the target object in the image to be recognized and an associated part having an association relationship with the key part, respectively, based on the target image feature. The method specifically comprises the following steps: detecting a key part of a target object executing action in an image to be recognized according to the characteristics of the target image to obtain a key part detection result; detecting a relevant part having a relevant relation with the key part in the image to be recognized according to the characteristics of the target image to obtain a relevant part detection result; and determining a part detection result according to the key part detection result and the related part detection result.

And correspondingly, the part detection result is used for indicating whether the key part of the action executed by the target object is detected in the image to be recognized. In one embodiment, the part detection result is represented by the position of a key part of the action performed by the target object.

And 270, identifying the action executed by the target object in the image to be identified according to the part detection result to obtain an action identification result.

And identifying, namely classifying the action executed by the target object in the image to be identified to obtain the action category of the action executed by the target object.

In one embodiment, the action recognition result is the action category of the target object performing the action.

Of course, in other embodiments, a reliability representing the credibility of the action type and a corresponding set value are further provided, and if the reliability is greater than or equal to the set value, the action recognition model takes the action type of the action executed by the target object as an action recognition result; otherwise, if the reliability is less than the set value, the action recognition model cannot output an action recognition result. The set value can be set according to the actual requirements of the application scene, so that the accuracy and the recall rate of the action recognition model are kept balanced. For example, for an application scenario with high accuracy requirement, a relatively high setting value is set; for application scenarios with high recall rate requirements, a relatively low set value is set.

Through the process, the target image characteristics of closely associating the key part and the associated part of the action executed by the target object are utilized to realize part detection and action recognition, so that the detection precision of the part detection is favorably improved, the false detection of an object which does not have the associated part but is similar to the key part is reduced, the accuracy of the action recognition is favorably improved, and the false triggering of the action recognition on the control equipment is reduced.

The motion recognition process can be realized by a motion recognition algorithm or by calling a motion recognition model. The motion recognition model is obtained by training an initial machine learning model based on at least one sample image.

Now, the training method of the motion recognition model will be described in detail as follows:

referring to fig. 3, in an exemplary embodiment, a method for training a motion recognition model is provided, which is described by taking the method as an example for being applied to an electronic device. The electronic device may specifically be the server in fig. 1.

The training method of the motion recognition model can comprise the following steps:

step 310, a training set is obtained.

Wherein the training set comprises at least one sample image, the sample image referring to the labeled recognized image.

First, the sample image includes a moving image and a still image. A moving image is a video including a plurality of frames of images, and a still image is a photograph or a picture including one frame of image, as opposed to a moving image. Based on this, the motion recognition in the present embodiment may be performed based on a video including a plurality of frames of images, or may be performed based on a photograph or a picture including one frame of image.

The sample image may be obtained from an image captured by the image capturing device in real time, or may be an image captured by the image capturing device in a historical time period stored in the server in advance, for example, a video or a photo of a user performing an action captured and captured by a camera in a smart home scene, or may be obtained from an image stored in the third-party image database in advance, for example, a desktop wallpaper including an animal performing an action, and the like, which is not limited herein.

Secondly, the sample image at least comprises key parts and associated parts of the sample object for executing the action, so as to facilitate the training of a subsequent action recognition model. The sample object refers to various objects that perform actions in the sample image, for example, at least one of a person, an animal, an intelligent robot, and the like that perform actions such as a gesture, a squat, a jump, and the like in the sample image.

Correspondingly, the key part is used for representing a key joint point in the corresponding sample image when the sample object performs the action. And the associated part has an associated relation with the key part and is used for representing at least one other joint point which has an associated relation with the key joint point in the corresponding sample image when the sample object executes the action. Fig. 4 is a schematic diagram of joint points of a human body, which is illustrated by taking a sample object as a human body, and as shown in fig. 4, when the human body performs a gesture, a hand

joint point

4 or 7 is taken as a key joint point, and then, other joint points having an association relationship with the hand

joint point

4 or 7 may be a wrist

joint point

3 or 6, a shoulder

joint point

2 or 5, an elbow

joint point

3 or 6, or a head joint point 1, and thus, the association relationship may specifically refer to a connection relationship between joint points of the human body.

And thirdly, labeling means adding a mark in the sample image so that the sample image carries a label.

In one embodiment, the specimen image carries at least a first label indicating the location of a critical part of the specimen object for performing the action. Specifically, in the sample image, the area where the key portion of the sample object performing the action is located, and the position of the key portion of the sample object performing the action is obtained as the first label.

As can be seen from the above, the labeling of the first label essentially means that the region where the key part is located is framed by the positioning frame in the sample image, and the positioning frame is regarded as the first label. In one embodiment, the location information/location box may be represented by coordinates. The shape of the positioning frame may be rectangular, circular, triangular, etc., which is not limited herein.

In one embodiment, the specimen image carries at least a second label indicating the relative position of the associated site having an associated relationship with the key site. Specifically, the second label is generated according to the relative position of the associated part in the sample image with respect to the key part.

Here, the relative position refers to a distance between the central position of the key portion in the sample image and the central position of the key portion, and then, regarding the label of the second label, the area where the relevant portion is located may be framed by a positioning frame in the sample image according to the relative position, and the positioning frame may be used as the second label, and of course, the relative position may also be directly used as the second label, which is not limited herein.

In one embodiment, the sample image carries at least a third label indicating an action category for the sample object to perform the action.

For example, if a photo containing a cat performing a jumping action, the sample object is a cat, the performed action is a jump, and the action type is a jump, then, by labeling, for example, adding a mark in the form of text "jump" to the photo containing a cat performing a jumping action, the "jump" mark is regarded as a third label, thereby forming a sample image carrying the third label. Of course, in other embodiments, the label of the third label may be a mark in the form of a number, a letter, a character, a figure, a color, and the like, which is not specifically limited herein.

Therefore, based on the acquired training set, not only the first label but also the second label is introduced, so that the key part and the associated part of the sample object executing the action are closely associated together through the combination of the first label and the second label, and therefore, the object which does not have the associated part but is similar to the key part is prevented from being detected as the key part, and the false detection rate of the non-key part is effectively reduced.

In the step 330, in the part detection stage, the machine learning model is supervised and trained for key part detection and associated part detection based on the first label and the second label carried by the sample image.

As previously mentioned, motion recognition includes two main tasks: part detection and motion classification. Based on this, whether it is a training process or a prediction process, there are two phases: the method comprises a part detection stage and an action classification stage, wherein the part detection stage is used for executing part detection in an action recognition task, and the action classification stage is used for executing action classification in the action recognition task, so that the accuracy of action recognition is fully guaranteed.

The inventor realizes that in the action recognition process, under the condition that the light is weak or some objects (such as hand toys) similar to key parts (such as hands) exist, the false detection of the key parts is easy to occur, therefore, in the training process of the part detection stage, on the premise of carrying out the supervision training of the key part detection based on the first label, the supervision training of carrying out the related part detection based on the second label is introduced, so as to improve the accuracy of the part detection stage in the action recognition process by the assistance of the related parts to the key parts, and further be beneficial to subsequently improving the accuracy of the action classification stage in the action recognition process.

Based on this, the training process of the part detection phase is now explained as follows:

specifically, sample image features are extracted from a sample image;

predicting the position of a key part of a sample object executing action in the sample image according to the characteristics of the sample image to obtain first prediction information, and predicting the relative position of the related part in the sample image relative to the key part to obtain second prediction information;

determining a first loss based on a difference between the first tag and the first prediction information and a second loss based on a difference between the second tag and the second prediction information;

and adjusting parameters of the machine learning model corresponding to the part detection stage according to the first loss and the second loss.

Therefore, the supervision training result in the part detection stage can be obtained, and the training in the subsequent action classification stage is facilitated.

As can be seen from the above, in the part detection stage, in the training process, under the premise that the supervised training for the detection of the key part is performed, the supervised training for the detection of the associated part is added, so that the key part and the associated part of the sample object performing the action are closely associated, and the detection of an object which does not have the associated part but is similar to the key part as the key part is avoided, so that the false detection rate of the non-key part is effectively reduced, and the accuracy of the part detection is fully ensured.

And 350, in the action classification stage, performing supervised training on the machine learning model for action recognition based on the third label carried by the sample image and the supervised training result in the part detection stage.

Specifically, according to first prediction information and second prediction information in a supervised training result, predicting an action type of a sample object executing action in a sample image to obtain third prediction information;

determining a third loss based on a difference between the third label and the third prediction information;

and adjusting parameters of the machine learning model corresponding to the action classification stage according to the third loss and the first loss and the second loss in the supervised training result. Of course, in other embodiments, the parameters of the machine learning model corresponding to the action classification phase may also be adjusted based on the third loss alone.

Therefore, the supervised training result of the part detection stage is introduced to participate in the supervised training related to the action recognition, so that more part characteristics which are closer to the sample object can be obtained in the action recognition process, and the accuracy of the recognition of the action executed by the key part can be improved.

And step 370, when the training stopping condition is met, obtaining the machine learning model which completes the training as the action recognition model.

That is, when the part detection stage completes training and the motion classification stage completes training, or it can be understood that the part detection stage converges and the motion classification stage converges, the training stop condition is satisfied, and at this time, the initial machine learning model converges, i.e., the machine learning model completing training is obtained.

Further, in an embodiment, the trained machine learning model is subjected to model performance evaluation, and the motion recognition model is obtained according to a result of the model performance evaluation, specifically, the accuracy and the recall ratio corresponding to the trained machine learning model are calculated, and if the accuracy and the recall ratio both meet a set condition, the machine learning model subjected to the model performance evaluation converges to obtain the motion recognition model.

Based on this, the motion recognition model has a motion recognition capability of performing a motion with respect to the target object.

Through the process, the incidence relation between the key part and the incidence part of the sample object executing action is fully considered, so that the false detection rate of the non-key part is reduced, the phenomenon of false triggering in an interactive scene is avoided, and the problem of low accuracy of action identification in the prior art can be effectively solved.

As shown in fig. 5, in one embodiment, the machine learning model may be divided into a key part detection branch and an associated part detection branch during the part detection phase of the training process. That is, in the training process, the first prediction information is obtained by the key part detection branch prediction in the machine learning model, and the second prediction information is obtained by the associated part detection branch prediction in the machine learning model.

Now, with reference to fig. 5, the training process of the key part detection branch and the associated part detection branch is described in detail as follows:

the method comprises the steps of firstly, constructing an initial machine learning model, and determining a first parameter of a key part detection branch and a second parameter of a relevant part detection branch in the machine learning model. The machine learning model includes, but is not limited to, a deep learning model.

And secondly, updating the first parameter and the second parameter according to the first label and the second label. The method specifically comprises the following steps:

firstly, sample image features are extracted from one sample image at least carrying a first label and a second label, the position of a key part of a sample object executing action in the one sample image is predicted according to the sample image features, first prediction information is obtained, the relative position of a related part in the one sample image relative to the key part is predicted, and second prediction information is obtained.

And randomly initializing the first parameters, and determining a first loss function corresponding to the machine learning model based on the randomly initialized first parameters and the first prediction information. And similarly, randomly initializing the second parameter, and determining a second loss function corresponding to the machine learning model based on the randomly initialized second parameter and the second prediction information. Wherein the first/second loss functions include, but are not limited to: cosine loss functions, cross entropy functions, intra-class distribution functions, inter-class distribution functions, activation classification functions, and the like.

Next, a first loss of the first loss function is calculated, i.e., the first loss is determined based on a difference between the first label and the first prediction information. Similarly, a second loss of the second loss function is calculated, i.e., the second loss is determined based on a difference between the second label and the second prediction information.

And thirdly, adjusting the first parameter and the second parameter according to the first loss and the second loss.

In one embodiment, the first parameter is adjusted according to the first loss and the second loss, and the second parameter is adjusted according to the second loss until the training stop condition is satisfied.

In one embodiment, a first joint loss is calculated based on the first loss and the second loss, for example, the first joint loss is the first loss + the second loss, and the first parameter and the second parameter are further adjusted based on the first joint loss until the training stop condition is satisfied.

The training stop condition will be described below by taking the first joint loss as an example.

And if the first joint loss does not reach the minimum value, judging that the training is not finished in the part detection stage, adjusting the first parameter and the second parameter according to the first joint loss, and reconstructing a first loss function and a second loss function corresponding to the machine learning model according to the adjusted first parameter, the adjusted second parameter and the next sample image carrying the first label and the second label so as to continuously judge whether the obtained first joint loss reaches the minimum value.

Conversely, if the first joint loss reaches a minimum value, the training stop condition is deemed satisfied.

Of course, an iteration threshold can be set according to the actual needs of the application scene, so as to accelerate the training efficiency in the part detection stage. Meanwhile, different iteration thresholds are set, and different requirements of position detection on accuracy can be met. For example, a larger iteration threshold may be beneficial to improve the accuracy of the site detection.

Based on this, when the first joint loss reaches the minimum value or the iteration number reaches the iteration threshold, the training stopping condition is considered to be satisfied, and then the training of the part detection stage is stopped.

Under the action of the embodiment, in the part detection stage, the number of the associated part detection branches is increased in the training process, so that the key part and the associated part of the sample object executing the action are closely associated, and the detection of an object which does not have the associated part but is similar to the key part as the key part is avoided, so that the false detection rate of the non-key part is effectively reduced, and the accuracy of the part detection stage is fully guaranteed.

In an exemplary embodiment, the training process of the action classification phase may include the following steps:

in the first step, based on the initial machine learning model, a third parameter of the action classification stage in the machine learning model is determined. The machine learning model includes, but is not limited to, a deep learning model.

And secondly, updating the third parameter according to the supervision training result and the third label in the part detection stage. The method specifically comprises the following steps:

firstly, a supervised training result of a part detection stage is obtained for one sample image carrying a third label, so as to predict an action type of a sample object executing an action in the one sample image according to first prediction information and second prediction information in the supervised training result, and obtain third prediction information.

And randomly initializing the third parameter, and determining a third loss function corresponding to the machine learning model based on the randomly initialized third parameter and the third prediction information. Wherein the third loss function includes, but is not limited to: cosine loss functions, cross entropy functions, intra-class distribution functions, inter-class distribution functions, activation classification functions, and the like.

Next, a third loss of a third loss function is calculated, i.e. the third loss is determined based on the difference between the third label and the third prediction information.

Third, the third parameter is adjusted according to the third loss.

In one embodiment, the third parameter is adjusted according to the third loss until the training stop condition is satisfied.

In one embodiment, based on the foregoing supervised training result, a first loss and a second loss are obtained to calculate a second combined loss according to the first loss, the second loss, and a third loss, for example, the second combined loss is the first loss + the second loss + the third loss, and then the third parameter is adjusted according to the second combined loss until the training stop condition is satisfied.

The training stop condition will be described below by taking the second linkage loss as an example.

And if the second joint loss does not reach the minimum value, judging that the training is not finished in the action classification stage, adjusting a third parameter according to the second joint loss, and reconstructing a third loss function corresponding to the machine learning model according to the adjusted third parameter and a next sample image carrying a third label so as to continuously judge whether the obtained second joint loss reaches the minimum value.

Otherwise, if the second coupling loss reaches the minimum value, the training stop condition is considered to be satisfied.

Of course, the iteration threshold can also be set according to the actual needs of the application scenario, so as to accelerate the training efficiency of the action classification stage. Meanwhile, different iteration thresholds are set, and different requirements of action recognition on accuracy can be met. For example, a larger iteration threshold may be beneficial to improve the accuracy of motion recognition.

Based on this, when the second combination loss reaches the minimum value or the iteration number reaches the iteration threshold, the training of the action classification stage is stopped if the training stopping condition is satisfied.

It should be noted that the motion classification stage may set the same iteration threshold as the portion detection stage, or may set a different iteration threshold, which is not limited herein.

Referring back to fig. 5, in one embodiment, during the training process, the third loss is fed back from the motion classification stage to the part detection stage. That is, during the training process, the parameters of the part detection stage may be adjusted according to the first loss, the second loss, and the third loss. The method specifically comprises the following steps: and adjusting a first parameter of the key part detection branch according to the first loss, the second loss and the third loss, and adjusting a second parameter of the associated part detection branch according to the second loss and the third loss. The updating process of the first parameter and the second parameter is as described above, and is not described herein again.

In the process, the supervised training of the key part detection branch based on the first label on the first parameter, the supervised training of the associated part detection branch based on the second label on the second parameter, and the supervised training of the action classification stage based on the part detection stage and the action classification stage of the third label on the third parameter are realized, so that the action recognition process not only can obtain the key part characteristics of the sample object execution action, but also can obtain the associated part characteristics of the sample object execution action, and further the detection accuracy of the key part and the recognition accuracy of the executed action are favorably improved.

In an exemplary embodiment, the prediction process of the part detection phase, as shown in fig. 6, step 250 may include the following steps:

and 251, positioning the region of the key part of the target object executing the action in the image to be recognized based on the characteristics of the target image to obtain the position of the key part of the target object executing the action.

In one embodiment, the location of the critical portion is associated with a localization box, thereby indicating the location and/or size of the critical portion in the image to be identified. The shape of the positioning frame may be a rectangle, a circle, a triangle, etc., which is not limited herein, and the position of the positioning frame in the image to be recognized may be represented by coordinates.

For example, in the image to be recognized, the positioning frame is the maximum circumscribed rectangle of the key part in the image to be recognized, that is, the positioning frame may be represented as (xmin1, ymin1, xmax1, ymax1), thereby indicating that the position of the key part in the image to be recognized is between xmin1 and xmax1 of the x axis and between ymin1 and ymax1 of the y axis, wherein, (xmin1, ymax1) represents the top left corner vertex of the positioning frame, and (xmax1, ymin1) represents the bottom right corner vertex of the positioning frame.

Step 253, determining a key part detection result according to the position of the key part of the action executed by the target object.

And the key part detection result is used for indicating whether the key part of the target object executing the action is detected in the image to be recognized or not.

For example, assuming that the target object is a person and the key part is a hand, if the image to be recognized includes a hand on which the person performs an action, the key part detection result is used to indicate that the hand on which the person performs the action is detected in the image to be recognized. Alternatively, if the image to be recognized contains a "hand toy" similar to a hand of a person performing an action, the key part detection result is used to indicate that the hand of the person performing the action is not detected in the image to be recognized.

And 255, determining a part detection result according to the key part detection result.

As can be seen from this, in the present embodiment, the part detection result also substantially indicates whether a key part of the target object performing the action is detected in the image to be recognized, that is, the part detection result is { key part detection result }.

Of course, in other embodiments, the part detection result may be obtained by using the related part detection result to assist the key part detection result, that is, the part detection result is { key part detection result & related part detection result }, which is not specifically limited in this embodiment. For example, for a "hand toy" in which the image to be recognized contains a hand similar to a hand on which a person performs an action, since there is no associated part, at this time, even if the key part detection result indicates that a hand on which a person performs an action is detected in the image to be recognized and the associated part detection result indicates that there is no associated part in the image to be recognized, finally, the part detection result indicates that a hand on which a person performs an action is not detected in the image to be recognized.

In one embodiment, the obtaining of the detection result about the associated part may comprise the steps of: determining the relative position of the associated part in the image to be recognized relative to the key part according to the characteristics of the target image; based on the relative position of the associated part with respect to the key part, an associated part detection result is determined. The specific process is similar to the above-mentioned acquisition of the second tag, and is not described herein again.

In an exemplary embodiment, the prediction process of the action classification phase, as shown in fig. 7, step 270 may include the following steps:

and step 271, determining the area of the action-executing part of the target object in the image to be recognized according to the part detection result.

Step 273, performing motion type prediction on the motion executed by the target object in the image to be recognized based on the region of the part of the target object executing the motion in the image to be recognized, and obtaining the motion type of the target object executing the motion.

The motion category prediction can be realized by a classifier (such as a softmax function) arranged in a motion recognition model, that is, the probability that a target object in the image to be recognized performs a motion belonging to different motion categories is calculated based on the classifier.

For example, assuming that the target object is a human, the motion categories include at least head up, head down, "V" gestures, jumps, and the like.

Then, probabilities that the motions performed by the person in the image to be recognized belong to motions of head up, head down, "V" gesture, jump, and the like, respectively, are calculated as P1, P2, P3, and P4, respectively. If P1 is maximum, then the action performed by the person is head up; similarly, if P2 is maximum, it indicates that the action performed by the person is a head-down; if P3 is at its maximum, then the action performed by the person is a "V" gesture; if P4 is maximum, it indicates that the action performed by the person is a jump.

Of course, in other embodiments, a special action category, that is, a non-critical part, may be set, and then, when the probability that the action performed by the person belongs to the action category is the highest, it actually indicates that the critical part of the action performed by the person is not detected in the image to be recognized, so that the phenomenon of false triggering caused by action misrecognition in the interactive scene can be further reduced.

And 275, determining an action recognition result according to the action type of the action executed by the target object.

Specifically, the action recognition result is { action category }.

Of course, in other embodiments, there are provided a reliability indicating the degree of reliability of the motion type and a set value corresponding thereto, and if the reliability is lower than the set value, the motion recognition model cannot output the motion recognition result. The set value can be set according to the actual requirements of the application scene, so that the accuracy and the recall rate of the action recognition model are kept balanced. For example, for an application scenario with high accuracy requirement, a relatively high setting value is set; for application scenarios with high recall rate requirements, a relatively low set value is set. Through the cooperation of the above embodiments, the prediction process for the image to be recognized, i.e., the key part detection branch + action classification stage of the part detection stage, is realized.

It can be known from the above that, unlike the introduction of the associated part detection branch in the training process, in the prediction process, two main tasks of the motion recognition are realized only through the key part detection branch, that is, the associated part detection branch is deleted in the prediction process, on one hand, although the associated part detection branch is deleted in the prediction process, since the associated part detection branch has participated in the training process, the prediction result in the part detection stage can already have more part features that are closer to the target object, thereby facilitating more accurate key part detection, on the other hand, the associated part detection branch is deleted in the prediction process, which not only can save the overhead of system memory, but also can effectively reduce the calculated amount and complexity of the motion recognition, and is beneficial to improving the motion recognition efficiency.

Further, in an exemplary embodiment, referring back to FIG. 5, the training set includes not only positive samples, but also negative samples that are fed back during the prediction process.

The negative sample at least contains non-critical parts which are not related to the critical parts of the target object for executing the action, such as the 'hand toy' contained in the image to be recognized. It is also understood that negative examples are used to indicate non-critical parts that are not related to critical parts of the sample object performing the action, as compared to critical parts and their associated parts that indicate that the sample object performs the action.

In conjunction with fig. 2, if the part detection result indicates that a key part of the target object performing the action is detected in the image to be recognized, step 270 is executed, and the action classification phase is entered from the part detection phase.

On the contrary, if the part detection result indicates that the key part of the target object performing the action is not detected in the image to be recognized, the image to be recognized of the key part of the target object performing the action is taken as a negative sample and fed back to the training process of the action recognition model, that is, added to the training set, as shown in fig. 5.

Then, during the training process of the motion recognition model, negative samples can be input into the initial machine learning model to participate in the training. In one embodiment, the non-critical part contained in the negative examples is detected as a non-critical part in the part detection stage, that is, the negative examples participate in the training as "negative examples" in the part detection stage, and the training in the motion classification stage may not be involved. In one embodiment, the non-critical part contained in the negative sample is detected as the critical part in the part detection stage, and the error is corrected to be misrecognized in the action classification stage, that is, the negative sample is taken as the "positive sample" in the part detection stage and taken as the "negative sample" in the action classification stage.

It is assumed here that the negative sample is taken as a "positive sample" to participate in training in the part detection stage (the training process is as described above and is not described in detail), and is taken as a "negative sample" to participate in training in the action classification stage, so that the training process for action recognition may include the following steps:

And secondly, updating the third parameter according to the fourth label. Here, the fourth label is a label carried by a negative exemplar for indicating a non-critical part that is not related to a critical part of the action performed by the specimen object, and the negative exemplar is a specimen image that contains at least the non-critical part that is not related to the critical part of the action performed by the specimen object. The method specifically comprises the following steps:

firstly, based on at least one negative sample carrying a fourth label, predicting the action type of the sample object in the negative sample to execute the action, and obtaining fourth prediction information.

And randomly initializing the third parameter, and determining a fourth loss function corresponding to the machine learning model based on the randomly initialized third parameter and the fourth prediction information. Wherein the fourth loss function includes, but is not limited to: cosine loss functions, cross entropy functions, intra-class distribution functions, inter-class distribution functions, activation classification functions, and the like.

Next, a fourth loss of a fourth loss function is calculated, i.e., the fourth loss is determined based on a difference between the fourth label and the fourth prediction information.

And thirdly, adjusting the third parameter according to the fourth loss until the training stopping condition is met. The process of satisfying the specific training stop condition is the aforementioned update process of the third parameter, and is not described here again.

Through the process, compared with the positive sample, the features in the prediction process are closer to and more similar to the features of the positive sample, the negative sample is introduced into the training process of the motion recognition model, especially the training in the motion classification stage, so that the features in the prediction process are further away from and more dissimilar to the features of the negative sample, the subsequent motion recognition model can be favorably and quickly eliminated from the image to be recognized which does not contain the key part, the phenomenon of false triggering caused by motion misrecognition in an interactive scene is avoided, and the accuracy of motion recognition is further favorably improved.

The following are embodiments of the device of the present application, which may be used in a training device for executing a motion recognition model according to the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, refer to the embodiments of the training method of the motion recognition model referred to in the present application.

Referring to fig. 8, in an exemplary embodiment, a motion recognition device 700 includes, but is not limited to: an image acquisition module 710, a feature extraction module 730, a site detection module 750, and an action recognition module 770.

The image obtaining module 710 is configured to obtain an image to be identified.

The feature extraction module 730 is configured to extract a target image feature from the image to be recognized, where the target image feature is associated with a key portion of the target object for executing the action and an associated portion having an association relationship with the key portion.

And the part detection module 750 is configured to detect a part of the target object performing an action in the image to be recognized according to the feature of the target image, so as to obtain a part detection result.

And the action recognition module 770 is configured to recognize an action executed by the target object in the image to be recognized according to the part detection result, so as to obtain an action recognition result.

It should be noted that, when the motion recognition device provided in the above embodiment performs motion recognition, only the division of the above functional modules is illustrated, and in practical applications, the functions may be distributed to different functional modules according to needs, that is, the internal structure of the motion recognition device is divided into different functional modules to complete all or part of the functions described above.

In addition, the embodiments of the motion recognition apparatus and the method provided by the above embodiments belong to the same concept, and the specific manner in which each module executes operations has been described in detail in the method embodiments, and is not described herein again.

FIG. 9 illustrates a structural schematic of a server in accordance with an exemplary embodiment. The server is suitable for use in the server 150 of the implementation environment shown in fig. 1.

It should be noted that the server is only an example adapted to the application and should not be considered as providing any limitation to the scope of use of the application. Nor should the server be interpreted as having a need to rely on or have to have one or more components of the exemplary server 2000 illustrated in fig. 9.

The hardware structure of the server 2000 may be greatly different due to the difference of configuration or performance, as shown in fig. 9, the server 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one Central Processing Unit (CPU) 270.

Specifically, the power supply 210 is used to provide operating voltages for the various hardware devices on the server 2000.

The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices. For example, interaction between gateway 110 and server 150 in the implementation environment shown in FIG. 1 occurs.

Of course, in other examples of the present application, the interface 230 may further include at least one serial-to-parallel conversion interface 233, at least one input/output interface 235, at least one USB interface 237, and the like, as shown in fig. 9, which is not limited thereto.

The storage 250 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon include an operating system 251, an application 253, data 255, etc., and the storage manner may be a transient storage or a permanent storage.

The operating system 251 is used for managing and controlling each hardware device and the application 253 on the server 2000, so as to implement the operation and processing of the mass data 255 in the memory 250 by the central processing unit 270, which may be Windows server, Mac OS XTM, unix, linux, FreeBSDTM, and the like.

The application 253 is a computer program that performs at least one specific task on the operating system 251, and may include at least one module (not shown in fig. 9), each of which may include a computer program for the server 2000. For example, the action recognition device is considered as an application 253 deployed in the server 2000.

The data 255 may be photos, pictures, videos stored in a disk, or may be sample images, positive and negative samples, tags, etc., and stored in the memory 250.

The central processor 270 may include one or more processors and is configured to communicate with the memory 250 through at least one communication bus to read the computer programs stored in the memory 250, and further implement operations and processing on the mass data 255 in the memory 250. The training method of the motion recognition model is accomplished, for example, by the central processor 270 reading a form of a series of computer programs stored in the memory 250.

Furthermore, the present application can be implemented by hardware circuits or by hardware circuits in combination with software, and therefore, the implementation of the present application is not limited to any specific hardware circuits, software, or a combination of the two.

Referring to fig. 10, in an embodiment of the present application, an electronic device 4000 is provided, where the electronic device 4000 may include:

in fig. 10, the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003.

Processor 4001 is coupled to memory 4003, such as via communication bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Communication bus 4002 may include a path that carries information between the aforementioned components. The communication bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

A computer program is stored in the memory 4003, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002.

The computer program realizes the motion recognition method in each of the above embodiments when executed by the processor 4001.

In addition, in the embodiments of the present application, a storage medium is provided, and a computer program is stored on the storage medium, and when being executed by a processor, the computer program realizes the motion recognition method in the embodiments described above.

A computer program product is provided in an embodiment of the present application, the computer program product comprising a computer program stored in a storage medium. The processor of the computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the motion recognition method in each of the embodiments described above.

Compared with the prior art, the technical scheme that this application provided brings has the beneficial effect:

1. introducing an associated part having an association relation with the key part, participating in training of the action recognition model, and closely associating the key part with the associated part, so that the false detection rate of non-key parts is reduced, the accuracy of the action recognition model is improved, and the false triggering rate caused by action false recognition in an interactive scene is reduced;

2. only the associated part detection branch is added in the training process, and the associated part detection branch is deleted in the prediction process, so that the system resources consumed by the action recognition model are unchanged in the prediction process, and no extra calculation amount is added, thereby not only being beneficial to saving the system resources, but also being beneficial to improving the action recognition efficiency and fully ensuring the high-efficiency operation of the whole system;

3. the detection of the key parts is assisted by the associated parts with larger scale and easier resolution, so that the key parts can be accurately detected under the condition that the key parts are difficult to detect, such as weak light, small remote key parts and the like, and the recall rate of the motion recognition model is improved, thereby being beneficial to improving the use experience of users in an interactive scene.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method of motion recognition, the method comprising:

acquiring an image to be identified;

extracting target image features from the image to be recognized, wherein the target image features are associated with key parts of target objects for executing actions and associated parts having association relations with the key parts;

detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result;

and identifying the action executed by the target object in the image to be identified according to the part detection result to obtain an action identification result.

2. The method as claimed in claim 1, wherein the detecting a portion of the image to be recognized where the target object performs an action according to the target image feature to obtain a portion detection result comprises:

based on the target image characteristics, positioning the region of the key part of the target object executing the action in the image to be recognized to obtain the position of the key part of the target object executing the action;

determining a key part detection result according to the position of a key part of the action executed by the target object;

and determining the part detection result according to the key part detection result.

3. The method as claimed in claim 1, wherein the detecting a portion of the image to be recognized where the target object performs an action according to the target image feature to obtain a portion detection result comprises:

detecting a key part of the target object executing action in the image to be recognized according to the target image characteristics to obtain a key part detection result;

detecting a relevant part having a relevant relationship with the key part in the image to be recognized according to the target image characteristics to obtain a relevant part detection result;

and determining the part detection result according to the key part detection result and the related part detection result.

4. The method as claimed in claim 3, wherein the detecting, according to the target image feature, the associated portion having an association relationship with the key portion in the image to be recognized to obtain an associated portion detection result includes:

determining the relative position of the associated part in the image to be recognized relative to the key part according to the target image characteristics;

determining the associated part detection result based on the relative position of the associated part with respect to the key part.

5. The method of claim 1, wherein the recognizing the motion performed by the target object in the image to be recognized according to the part detection result to obtain a motion recognition result comprises:

determining the region of the action-executing part of the target object in the image to be recognized according to the part detection result;

based on the area of the part of the target object for executing the action in the image to be recognized, performing action type prediction on the action executed by the target object in the image to be recognized to obtain an action type of the action executed by the target object;

and determining the action recognition result according to the action type of the action executed by the target object.

6. The method of any one of claims 1 to 5, wherein the action recognition result is obtained by calling an action recognition model;

the motion recognition model is obtained by training an initial machine learning model based on at least one sample image; the training at least comprises:

in the part detection stage, the machine learning model is subjected to supervised training for key part detection and associated part detection based on a first label and a second label carried by the sample image, wherein the first label is used for indicating the position of a key part of the sample object for executing action, and the second label is used for indicating the relative position of the associated part relative to the key part.

7. The method of claim 6, wherein the supervised training of the machine learning model for key and associated site detection based on the first and second labels carried by the sample images during the site detection phase comprises:

extracting sample image features from the sample image;

predicting the position of a key part of the sample object executing the action in the sample image according to the sample image characteristics to obtain first prediction information, and predicting the relative position of a related part in the sample image relative to the key part to obtain second prediction information;

8. The method of claim 7, wherein the training further comprises:

in an action classification stage, performing supervised training on the machine learning model for action recognition based on a third label carried by the sample image and a supervised training result in the part detection stage, wherein the third label is used for indicating an action category of the sample object for executing the action;

in the action classification stage, the machine learning model is supervised and trained on action recognition based on a third label carried by the sample image and a supervision and training result in the part detection stage, and the method comprises the following steps:

predicting the action type of the sample object in the sample image to execute the action according to the first prediction information and the second prediction information in the supervised training result to obtain third prediction information;

adjusting parameters of the machine learning model corresponding to the action classification phase according to the third loss and/or the first loss and the second loss in the supervised training result;

and when the training stopping condition is met, obtaining a machine learning model which completes training and serving as the action recognition model.

9. The method of claim 8, wherein the method further comprises:

feeding back the third loss to the site detection stage;

the adjusting parameters of the machine learning model corresponding to the part detection phase according to the first loss and the second loss comprises:

adjusting parameters of the machine learning model corresponding to the part detection phase according to the first loss, the second loss and the third loss.

10. The method of claim 9, wherein the first prediction information is derived from a critical site detection branch prediction in the machine learning model, and the second prediction information is derived from an associated site detection branch prediction in the machine learning model;

the adjusting parameters of the machine learning model corresponding to the part detection phase according to the first loss, the second loss and the third loss comprises:

adjusting parameters of the key part detection branch according to the first loss, the second loss and the third loss; and

and adjusting the parameters of the detection branch of the relevant part according to the second loss and the third loss.

11. An action recognition device, comprising:

the image acquisition module is used for acquiring an image to be identified;

the feature extraction module is used for extracting target image features from the image to be recognized, wherein the target image features are associated with key parts of target objects for executing actions and associated parts having association relations with the key parts;

the part detection module is used for detecting the part of the target object executing the action in the image to be recognized according to the characteristics of the target image to obtain a part detection result;

and the action recognition module is used for recognizing the action executed by the target object in the image to be recognized according to the part detection result to obtain an action recognition result.

12. An electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the action recognition method of any one of claims 1 to 10.

13. A storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the action recognition method according to any one of claims 1 to 10.