CN116403285A

CN116403285A - Action recognition method, device, electronic equipment and storage medium

Info

Publication number: CN116403285A
Application number: CN202310431061.2A
Authority: CN
Inventors: 李龙腾; 卢飞翔; 吕以豪; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-07

Abstract

The disclosure provides an action recognition method, an action recognition device, electronic equipment, a storage medium and a program product, and relates to the technical field of artificial intelligence, in particular to the technical field of visual computers, the technical field of deep learning and the technical field of three-dimensional reconstruction. The specific implementation scheme is as follows: determining a skeleton key point group sequence of a target object based on the image sequence; determining a bone included angle sequence of the target object based on the bone key point group sequence; and performing motion recognition on the target object based on the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence, and determining a motion recognition result of the target object, wherein the image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

Description

Action recognition method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of vision computer technology, the field of deep learning technology, and the field of three-dimensional reconstruction technology, and more particularly, to a motion recognition method, apparatus, electronic device, storage medium, and program product.

Background

Computer vision technology is a science that studies how to "see" a computer. Computer vision techniques may be applied in scenes such as image recognition, image semantic understanding, image retrieval, three-dimensional object reconstruction, virtual reality, motion recognition, etc. For each scene, how to use the computer vision technology to make the generated result reasonable and accurate is worth exploring.

Disclosure of Invention

The present disclosure provides an action recognition method, apparatus, electronic device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an action recognition method including: determining a skeleton key point group sequence of a target object based on the image sequence; determining a bone angle sequence of the target object based on the bone key point group sequence; and performing motion recognition on the target object based on an image feature sequence, a bone key point feature sequence and a bone included angle feature sequence, and determining a motion recognition result of the target object, wherein the image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

According to another aspect of the present disclosure, there is provided an action recognition apparatus including: the key point determining module is used for determining a skeleton key point group sequence of the target object based on the image sequence; the included angle determining module is used for determining a skeleton included angle sequence of the target object based on the skeleton key point group sequence; and an action recognition module, configured to perform action recognition on the target object based on an image feature sequence, a bone key point feature sequence, and a bone angle feature sequence, and determine an action recognition result of the target object, where the image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone angle feature sequence is generated based on the bone angle sequence.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as disclosed herein.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer as described above to perform a method as disclosed herein.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as disclosed herein.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which action recognition methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of motion recognition according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a network structure diagram of an action recognition model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram for determining a sequence of bone key points groups, according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of an action recognition device according to an embodiment of the present disclosure; and

fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a method of motion recognition, according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to an embodiment of the present disclosure, an action recognition method may include: based on the image sequence, a sequence of skeletal keypoints sets of the target object is determined. And determining a bone included angle sequence of the target object based on the bone key point group sequence. And performing motion recognition on the target object based on the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence, and determining a motion recognition result of the target object. The image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 1 schematically illustrates an exemplary system architecture to which the action recognition method and apparatus may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the action recognition method and apparatus may be applied may include a terminal device, but the terminal device may implement the action recognition method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include an information acquisition device 101, a network 102, and a server 103. Network 102 is the medium used to provide a communication link between information gathering device 101 and server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and the like.

A user can interact with the server 103 through the network 102 using the information acquisition device 101 to receive or transmit video or the like. Various communication client applications may be installed on the information gathering device 101, such as instant messaging tools, mailbox clients, and/or social platform software, to name a few.

The information gathering device 101 may be a variety of electronic devices having a display screen and supporting web browsing including, but not limited to, one or more of an event camera, a depth camera, a binocular camera, a monocular camera.

The server 103 may be a server providing various services, such as a background management server (merely an example) providing support for a user to utilize video transmitted by the information acquisition apparatus 101. The background management server may analyze the received video and the like.

The action recognition method provided by the embodiments of the present disclosure may be generally performed by the server 103. Accordingly, the action recognition device provided by the embodiments of the present disclosure may be generally disposed in the server 103. The action recognition method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the information acquisition device 101 and/or the server 103. Accordingly, the action recognition device provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 103 and capable of communicating with the information acquisition device 101 and/or the server 103.

For example, in a sports venue, event cameras may be hung on top of the sports venue, and binocular cameras and depth cameras may be mounted on the sides of the venue and around the playing venue. The server may receive video of the event camera, the binocular camera, and the depth camera. And acquiring an image sequence from the video, and determining a skeleton key point group sequence of the target object. And determining a bone included angle sequence of the target object based on the bone key point group sequence. And performing motion recognition on the target object based on the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence, and determining a motion recognition result of the target object. Or by a server or cluster of servers capable of communicating with the information acquisition device 101 and/or the server 103 and ultimately determining a motion recognition result with respect to the target object, e.g. an athlete.

It should be understood that the number of information gathering devices, networks, and servers in fig. 1 are merely illustrative. There may be any number of information gathering devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.

Fig. 2 schematically illustrates a flow chart of a method of motion recognition according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, a sequence of bone key points of the target object is determined based on the image sequence.

In operation S220, a sequence of bone angles of the target object is determined based on the sequence of bone key points.

In operation S230, motion recognition is performed on the target object based on the image feature sequence, the bone key point feature sequence, and the bone angle feature sequence, and a motion recognition result of the target object is determined.

According to an embodiment of the present disclosure, the image feature sequence is generated based on the image sequence, the bone key feature sequence is generated based on the bone key group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

According to an embodiment of the present disclosure, the image sequence may include a plurality of frame images having a time sequence. For example, the image sequence includes an image acquired at time t1, an image acquired at time t2, and an image acquired at time t 3. time t1, time t2, and time t3 are three consecutive times.

According to embodiments of the present disclosure, a sequence of bone keypoint groups of a target object may be determined based on the sequence of images. The sequence of bone keypoint groups may comprise a plurality of bone keypoint groups having a time sequence in one-to-one correspondence with the sequence of images. Each bone keypoint group comprises a plurality of bone keypoints, e.g. bone keypoints, each bone keypoint may comprise three-dimensional keypoint coordinates.

For example, based on an image sequence including an image acquired at time t1, an image acquired at time t2, and an image acquired at time t3, a bone key group of the target object at time t1, a bone key group at time t2, and a bone key group at time t3 may be determined, forming a bone key group sequence. Each bone keypoint group may include an arm keypoint, a shoulder keypoint, a wrist keypoint, a leg keypoint, an ankle keypoint, and a crotch keypoint of a target object such as a human body.

According to embodiments of the present disclosure, a sequence of bone angles for a target object may be determined based on a sequence of bone key points sets. For example, for each bone key group, a coordinate vector for each segment of bone is derived based on the three-dimensional key coordinates of each bone key. Based on the coordinate vector of each segment of bone, a bone angle is determined. For example, the bone keypoint group includes three-dimensional keypoint coordinates of each of the right shoulder keypoint, the right elbow keypoint, and the right hand keypoint, and a coordinate vector of the bone between the right shoulder and the right elbow is determined based on the three-dimensional keypoint coordinates of each of the right shoulder keypoint and the right elbow keypoint. The coordinate vector of the bone between the right elbow and the right hand is determined based on the three-dimensional keypoint coordinates of the right elbow keypoint and the right hand keypoint, respectively. The bone angle of the right shoulder-right elbow-right hand is determined based on the coordinate vector of the bone between the right shoulder-right elbow and the coordinate vector of the bone between the right elbow-right hand. Similarly, based on the bone key point group sequence, a bone included angle sequence corresponding to the bone key point group sequence one by one can be determined. The sequence of bone angles may include, but is not limited to, a plurality of bone angles having a timing relationship, and the sequence of bone angles may also include a plurality of bone angle groups having a timing relationship, each bone angle group including a plurality of bone angles.

According to an embodiment of the present disclosure, performing motion recognition on a target object based on an image feature sequence, a bone key point feature sequence, and a bone angle feature sequence, determining a motion recognition result of the target object may include: and inputting the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence into the motion recognition model to obtain a motion recognition result. But is not limited thereto. May further include: and matching the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence with the template feature sequence to determine a matching result. And under the condition that the matching result is determined to be used for representing the matching between the template feature sequence and the template feature sequence, taking the action mode matched with the template feature sequence as an action recognition result.

According to a related example, the motion recognition result of the target object may be determined based on one or both of an image feature sequence, a bone key point feature sequence, and a bone angle feature sequence. Compared with the mode, the motion recognition result of the target object is determined based on the image feature sequence, the bone key point feature sequence and the bone included angle feature sequence, and various forms of data can be combined, so that the data is rich in variety and large in information quantity, and the accuracy of the motion recognition result is improved.

According to an embodiment of the present disclosure, for operation S230 shown in fig. 2, performing motion recognition on a target object based on an image feature sequence, a bone key point feature sequence, and a bone angle feature sequence, determining a motion recognition result of the target object may include: for each image feature in the image feature sequence, determining a target bone key point feature matched with the image feature from the bone key point feature sequence according to a time sequence relation, and determining a target bone included angle feature matched with the image feature from the bone included angle feature sequence. And generating a target comprehensive feature based on the image feature, the target skeleton key point feature and the target skeleton included angle feature to obtain a target comprehensive feature sequence. And performing action recognition on the target object based on the target comprehensive feature sequence, and determining an action recognition result of the target object.

According to an embodiment of the present disclosure, a target bone keypoint feature that matches the image feature is determined from a sequence of bone keypoint features according to a time-series relationship. For example, for an image feature at time t1, a bone keypoint feature at time t1 may be determined from a sequence of bone keypoint features as a target bone keypoint feature.

According to an embodiment of the present disclosure, a target bone angle feature that matches the image feature is determined from the sequence of bone angle features according to a time sequence relationship. For example, for the image feature at time t1, the bone-angle feature at time t1 may be determined from the bone-angle feature sequence as the target bone-angle feature.

According to an embodiment of the present disclosure, generating a target composite feature based on the image feature, the target bone keypoint feature, and the target bone angle feature may include: and splicing the image features, the target bone key point features and the target bone included angle features to generate target comprehensive features. But is not limited thereto. May further include: and splicing the image features, the target bone key point features and the target bone included angle features to generate initial comprehensive features. And extracting the features of the initial comprehensive features to obtain target comprehensive features.

According to an embodiment of the present disclosure, performing motion recognition on a target object based on a target integrated feature sequence, determining a motion recognition result of the target object may include: and inputting the target comprehensive characteristic sequence into the action recognition module to obtain an action recognition result. The action recognition module may include a codec (transducer) and a classifier, but is not limited thereto, and may also include an LSTM (Long Short-Term Memory) and a classifier. The classifier may include a full connectivity layer and an activation function. The network configuration of the motion recognition module is not limited as long as the motion recognition module can obtain a motion recognition result based on the target integrated feature sequence.

According to the embodiment of the disclosure, the target object is identified based on the target comprehensive feature sequence, so that the time sequence features with relevance in the target comprehensive feature sequence can be utilized while combining a plurality of features of different types of image features, target skeleton key point features and target skeleton included angle features. Therefore, the action recognition result is accurate and effective, and the processing efficiency is high.

According to an embodiment of the present disclosure, before performing operation S210 as shown in fig. 2, determining a bone key point group sequence of a target object based on an image sequence, the action recognition method may further include the following operations.

For example, feature extraction is performed on the image sequence and the event data sequence with the same time sequence as the image sequence, so as to obtain an image feature sequence. And extracting features of the bone included angle sequence to obtain a bone included angle feature sequence. And extracting features of the skeleton key point group sequences to obtain skeleton key point feature sequences.

According to embodiments of the present disclosure, the images in the image sequence may comprise color images, such as RGB images. The sequence of event data may correspond one-to-one to the sequence of images. Each event data may be acquired by an event camera. Based on the event mechanism, for each pixel location captured, event data for the pixel is generated where the pixel value changes beyond a predetermined pixel threshold. The pixel value may include a gray value, a luminance value, or a light intensity of the pixel. The event data may include: pixel coordinates of the event, a timestamp of the event occurrence, and a polarity of the event occurrence. The polarity of the occurrence of an event may be used to characterize a change in pixel value, for example, to characterize a polarity of a change in pixel value from low to high or to characterize a polarity of a change in pixel value from high to low.

According to an embodiment of the present disclosure, feature extraction is performed on an image sequence and an event data sequence to obtain an image feature sequence, which may include: and inputting the image sequence and the event data sequence into an image feature extraction module to obtain an image feature sequence. The image feature extraction module may include, but is not limited to, a RestNet (residual network). Any network structure that can be used to extract image features is sufficient.

According to an alternative embodiment of the present disclosure, the image sequence and the event data sequence may be feature extracted by the operation "feature extraction on image sequence to obtain image feature sequence" replacement operation "to obtain image feature sequence.

Compared with an image feature sequence obtained by only extracting features of the image sequence, the image feature sequence obtained by the feature extraction method provided by the embodiment of the disclosure can be used for denoising the image sequence by using the event data sequence, highlighting the features of a dynamic target object in the image sequence, and filtering out the features of a static object in the image sequence so that the features in the image feature sequence are effective and accurate.

According to an embodiment of the present disclosure, feature extraction is performed on a bone angle sequence to obtain a bone angle feature sequence, which may include: inputting the bone included angle sequence into a bone included angle feature extraction module to obtain a bone included angle feature sequence. The bone angle feature extraction module may include a multi-layer perceptron (MLP, multilayer Perceptron), but is not limited thereto, as long as it is a network structure or algorithm capable of converting bone angle data into feature vectors.

According to an embodiment of the present disclosure, feature extraction is performed on a bone key point group sequence to obtain a bone key point feature sequence, which may include: inputting the skeleton key point group sequence into a key point feature extraction module to obtain a skeleton key point feature sequence. The key Point feature extraction module may include Point Net (Point cloud segmentation network), but is not limited thereto, as long as it is a network structure or algorithm capable of feature extraction of key Point data.

According to the embodiment of the disclosure, different characteristic extraction modes are selected according to different data, so that the pertinence is high and the processing efficiency is high.

Fig. 3 schematically illustrates a network structure diagram of an action recognition model according to an embodiment of the present disclosure.

As shown in fig. 3, the action recognition method may be performed using an action recognition model. The motion recognition model may include an image feature extraction module M310, a bone angle feature extraction module M320, a keypoint feature extraction module M330, and a motion recognition module M340.

The image sequence 310 and the event data sequence 320 may be input into an image feature extraction module M310, resulting in an image feature sequence 330. The bone angle sequence 340 is input to the bone angle feature extraction module M320 to obtain a bone angle feature sequence 350. The bone key point group sequence 360 is input into the key point feature extraction module M330 to obtain a bone key point feature sequence 370. And splicing the image features, the target bone key point features and the target bone included angle features to generate target comprehensive features. Thereby yielding a target integrated feature sequence 380. The target integrated feature sequence 380 is input to the motion recognition module M340, resulting in a motion recognition result 390.

According to an embodiment of the present disclosure, for operation S210 as shown in fig. 2, obtaining a bone key point group sequence of a target object based on an image sequence may include: and obtaining a skeleton key point group sequence of the target object based on the image sequence and the event data sequence with the same time sequence as the image sequence.

According to an embodiment of the present disclosure, obtaining a bone key point group sequence of a target object based on an image sequence and an event data sequence identical to the image sequence timing may include: and inputting the image sequence and the event data sequence into a second image feature extraction module to obtain a second image feature sequence. And inputting the second image characteristic sequence into a key point identification module to obtain a skeleton key point group sequence.

According to an embodiment of the present disclosure, the second image feature extraction module may be the same as or different from the image feature extraction module provided by the embodiment of the present disclosure. The image sequence may be noise-reduced by using the event data sequence, as long as the feature extraction of the image is enabled.

According to embodiments of the present disclosure, the keypoint identification module may include a visual codec, but is not limited thereto, and may also be a lightweight visual codec or a Mobile Pose (a lightweight Pose estimation) model.

According to an alternative embodiment of the present disclosure, inputting the second image feature sequence into the keypoint identification module to obtain the skeletal keypoint group sequence may include: and inputting the second image characteristic sequence into the gesture simulation model to obtain a model parameter sequence. And inputting the model parameter sequence into the gesture model to obtain an updated gesture simulation model sequence. And obtaining a skeleton key point group sequence based on the updated gesture simulation model sequence. But is not limited thereto. The second image feature sequence can also be directly input into a key point recognition module to obtain a skeleton key point group sequence.

According to embodiments of the present disclosure, the model parameters may include pose parameters and shape parameters. The pose simulation model may include a preset three-dimensional model, the three-dimensional model of which morphology changes are controlled by a fixed number of parameters. For example, the model parameter sequence may be input into a gesture simulation model to obtain an updated gesture simulation model sequence. The updated gesture simulation model sequence presents a form matched with each bone included angle of the target object in the image sequence. In the case where the target object is a human object, the posture simulation model may include an SMPL (skin Multi-person linear model) model. And obtaining a skeleton key point group sequence based on the updated gesture simulation model sequence.

According to the embodiment of the disclosure, the skeleton key point group sequence of the target object is determined based on the image sequence and the event data sequence which is the same as the image sequence in time sequence, the image sequence can be subjected to noise reduction processing by using the event data sequence, and noise in the image sequence is removed, so that the obtained skeleton key point group sequence has high precision.

According to an embodiment of the present disclosure, for operation S210 shown in fig. 2, determining a bone key group sequence of a target object based on an image sequence and an event data sequence identical to the image sequence timing may further include a correction operation.

For example, an initial sequence of bone key points of the target object is obtained based on the sequence of images and the sequence of event data that is time-sequential to the sequence of images. Based on the image sequence and the depth data sequence, a sequence of auxiliary skeletal keypoint groups of the target object is determined. A bone key set sequence is determined based on the initial bone key set sequence and the auxiliary bone key set sequence.

According to the embodiment of the disclosure, the obtained skeleton key point group sequence can be used as an initial skeleton key point group sequence based on the image sequence and the event data sequence with the same time sequence as the image sequence. Based on the image sequence and the depth data sequence, a sequence of auxiliary skeletal keypoint groups of the target object is determined. The initial bone key set sequence is modified with the auxiliary bone key set sequence. So that the accuracy of the sequence of the bone key point groups is high.

According to embodiments of the present disclosure, each image in the image sequence may include an image acquired by a binocular camera. Each depth data in the sequence of depth data may include data acquired by a depth camera. The depth data may include a distance of each pixel of the target object from the depth camera in a world coordinate system.

According to embodiments of the present disclosure, a 3D motion sensing camera may be utilized to acquire both color images and depth images, resulting in an image sequence and a depth data sequence.

According to an embodiment of the present disclosure, determining a sequence of auxiliary skeletal keypoint groups of a target object based on the sequence of images and the sequence of depth data may include: target depth data is determined from the sequence of depth data that matches the image. And determining a three-dimensional key point coordinate set of the target object in the real world based on the image and the target depth data to obtain an auxiliary skeleton key point set. And further determining the auxiliary skeleton key point group sequence.

According to an embodiment of the present disclosure, determining a bone key point set sequence based on an initial bone key point set sequence and an auxiliary bone key point set sequence may include: in the case of a match between the initial bone key set sequence and the auxiliary bone key set sequence, the bone key set sequence is determined based on the initial bone key set sequence. Under the condition that mismatch between the initial skeleton key point group sequence and the auxiliary skeleton key point group sequence is determined, updating the initial skeleton key point group sequence based on the auxiliary skeleton key point group sequence to obtain a skeleton key point group sequence.

Fig. 4 schematically illustrates a flow diagram for determining a sequence of bone key points sets, according to an embodiment of the present disclosure.

As shown in FIG. 4, determining the sequence of bone key points sets may include operations S41 0-S450.

In operation S410, an initial bone key point group sequence of the target object is obtained based on the image sequence and the event data sequence having the same timing as the image sequence.

In operation S420, a sequence of auxiliary skeletal key points of the target object is determined based on the image sequence and the depth data sequence.

In operation S430, it is determined whether there is a match between the initial bone key point group sequence and the auxiliary bone key point group sequence. In case that a match between the initial bone key group sequence and the auxiliary bone key group sequence is determined, operation S440 is performed. Otherwise, S450 is performed.

In operation S440, a bone key point group sequence is determined based on the initial bone key point group sequence.

In operation S450, the initial bone key point group sequence is updated based on the auxiliary bone key point group sequence, resulting in a bone key point group sequence.

According to an embodiment of the present disclosure, for operation S430 as shown in fig. 4, determining whether there is a match between the initial bone key point group sequence and the auxiliary bone key point group sequence may include: and determining a target auxiliary bone key point group matched with the initial bone key point group time sequence aiming at the initial bone key point group in the initial bone key point group sequence. And comparing each initial bone key point in the initial bone key point group with a target auxiliary bone key point matched with the pixel position of the initial bone key point in the target auxiliary bone key point group, determining a key point coordinate difference value between the initial bone key point and the target auxiliary bone key point, and indicating that the initial bone key point is not matched with the target auxiliary bone key point under the condition that the key point coordinate difference value is larger than a key point coordinate threshold value. And under the condition that the difference value of the key point coordinates is smaller than or equal to the key point coordinate threshold value, the matching between the initial bone key point and the target auxiliary bone key point is indicated. In the event that the number of initial bone keypoint mismatches in the initial bone keypoint set reaches a predetermined number threshold, a mismatch between the initial bone keypoint set and the auxiliary bone keypoint set is accounted for. In the event that the number of unmatched initial bone keypoint groups reaches a predetermined number threshold, a mismatch between the initial bone keypoint group sequence and the auxiliary bone keypoint group sequence is accounted for.

According to the embodiment of the disclosure, in the case of determining a match between the initial bone key point group sequence and the auxiliary bone key point group sequence, it is explained that the accuracy of the initial bone key point group sequence satisfies a predetermined condition, and the initial bone key point group sequence may be regarded as the bone key point sequence. But is not limited thereto. And according to a time sequence relationship, adding a fork to sum the initial skeleton key point group sequence and the auxiliary skeleton key point group sequence to obtain a skeleton key point sequence.

According to an embodiment of the present disclosure, in a case where a mismatch between the initial bone key point group sequence and the auxiliary bone key point group sequence is determined, the initial bone key point group sequence may be updated based on the image sequence and the depth data sequence, resulting in a bone key point group sequence.

According to an embodiment of the present disclosure, updating an initial bone key point group sequence based on an image sequence and a depth data sequence to obtain a bone key point group sequence may include: a target initial skeletal keypoint that does not match the auxiliary skeletal keypoint group sequence is determined from the initial skeletal keypoint group sequence, and texture node data, such as three-dimensional coordinates of the texture node in a world coordinate system, of each of a plurality of texture nodes associated with the target initial skeletal keypoint is determined based on the image sequence and the depth data sequence. Based on the plurality of texture node data, fitted skeletal keypoints are determined. And replacing the initial bone key points of the target by the fitting bone key points to finish updating.

For example, the target initial skeletal keypoints are, for example, initial elbow keypoints, and the three-dimensional keypoint coordinates between the initial elbow keypoints and auxiliary elbow keypoints are mismatched. Texture node data, such as texture node three-dimensional coordinates, for texture nodes around the elbow node associated with the initial elbow keypoint may be determined based on the image sequence and the depth data sequence. Fitting skeletal key points may be determined based on the plurality of texture node data. For example, the weighted summation determines three-dimensional keypoint coordinates of the fitted elbow keypoint based on a plurality of texture node data surrounding the initial elbow keypoint.

According to the embodiment of the disclosure, the initial skeleton key point group sequence is determined by utilizing the auxiliary skeleton key point group sequence, and data obtained by different processing modes can be mutually proved to improve the accuracy of the skeleton key point group sequence. Under the condition that the error between the initial skeleton key point group sequence and the actual result is large, the image sequence and the depth data sequence can be utilized to update the initial skeleton key point group sequence, so that the effectiveness of the skeleton key point group sequence is improved. In addition, various data of different types can be utilized, and the data utilization rate is improved. The motion recognition method provided by the embodiment of the disclosure is applied to training or competition scenes of athletes, and the motion recognition result and the skeleton key point group sequence are raised to be used as values of reference results.

According to an embodiment of the present disclosure, after performing operation S230 as shown in fig. 2, performing motion recognition on a target object based on an image feature sequence, a bone key point feature sequence, and a bone angle feature sequence, determining a motion recognition result of the target object, the motion recognition result may further include the following operations.

For example, a motion trajectory of the associated object is determined based on the sequence of event data and the sequence of images. The motion of the associated object is generated based on the motion of the target object. Based on the motion trajectories, a motion pattern of the associated object is determined. When it is determined that the motion pattern matches the motion recognition result, the motion recognition result is set as the target motion recognition result.

According to embodiments of the present disclosure, the associated object may be an object associated with the target object. In a ball game scenario, the target object may be a player and the associated object may be a ball, such as a shuttlecock, football, basketball, or the like.

According to an embodiment of the present disclosure, determining a motion trajectory of an associated object based on an event data sequence and an image sequence may include: and filtering background information in the image sequence by using the event data sequence to obtain a target image sequence containing the associated object and the target object. And carrying out target detection on the target image sequence to obtain a detection frame result sequence of the associated object. And determining the motion trail of the associated object based on the detection frame result sequence.

According to an embodiment of the present disclosure, based on the motion trajectory, an action pattern of the associated object may be determined. But is not limited thereto. The movement speed may also be determined based on the movement trajectory and the movement duration. Based on the motion trajectory and the movement speed, an action mode of the associated object is determined. The motion due to the associated object is generated based on the motion of the target object. There is an effect of the force between the two. Determining the accuracy of the action recognition result may be aided based on the action pattern. When it is determined that the motion pattern matches the motion recognition result, the motion recognition result is set as the target motion recognition result. Under the condition that the action mode is not matched with the action recognition result, the action recognition method can be re-executed so as to improve the accuracy of the action recognition result.

According to the embodiment of the disclosure, by utilizing the action recognition method provided by the embodiment of the disclosure, the results of the motion track, the action mode and the like of the associated object can be determined, and the richness of the recognition result is improved by combining the results with the target action recognition result of the target object.

In accordance with an embodiment of the present disclosure, before performing operation S210 as shown in fig. 2, determining a bone key point group sequence of a target object based on an image sequence, the action recognition method may further include: a sequence of images is acquired.

For example, the video is de-framed to obtain successive video frames. A plurality of target video frames are determined from the continuous video frames in a predetermined frame-extracting manner. An image sequence is determined based on the plurality of target video frames.

According to embodiments of the present disclosure, the predetermined frame extraction method may refer to extracting a plurality of target video frames from consecutive video frames at a certain time interval.

According to the embodiment of the disclosure, the image sequence is determined by the method, so that the processing precision is ensured, the key actions are prevented from being missed, and the processing efficiency is improved.

According to an embodiment of the present disclosure, the image sequence may include a plurality of sequences that are time-sequential relative to each other.

For example, a predetermined frame extraction method determines, from among the continuous video frames, a target video frame at time t1, a target video frame at time t2, and a target video frame at time t3 as the image sequence 1. From the continuous video frames, a target video frame at time t3, a target video frame at time t4, and a target video frame at time t5 are determined as the image sequence 2. A plurality of image sequences is obtained. A plurality of image sequences have video frames that overlap temporally with one another.

According to an embodiment of the present disclosure, after operation S230 as shown in fig. 2, the action recognition method provided by the embodiment of the present disclosure may further include the following operations.

For example, the motion recognition results of each of the plurality of image sequences are determined, and a plurality of motion recognition results are obtained. In the case where it is determined that the plurality of motion recognition results are all used to characterize the same motion, a target video segment is determined from the video based on the plurality of image sequences.

According to an embodiment of the present disclosure, in a case where it is determined that a plurality of motion recognition results are all used to characterize the same motion, a target video clip is determined from a video based on a plurality of image sequences. In the case where multiple motion recognition results are determined to be used to characterize different motions, a target image sequence may be determined from among multiple image sequences. A target video clip is determined from the video frames based on the target image sequence. The key video snippets can thus be archived for subsequent viewing. Thereby reducing the resource consumption and avoiding the loss of data.

According to the embodiment of the disclosure, the motion recognition method provided by the embodiment of the disclosure is applied to a badminton scene, and a plurality of cameras such as an event camera, a depth camera, a binocular camera and the like with multi-view angle can be utilized to obtain an image sequence, a depth data sequence and an event data sequence. And performing operation processing of an action recognition method on the image sequence, the depth data sequence and the event data sequence to realize dynamic tracking of athletes in the badminton court, action recognition such as step, swing, withhold and the like, and determination of results such as badminton pattern recognition, badminton drop point, badminton speed detection and the like. And providing quantitative analysis data to realize multidimensional analysis.

By utilizing the action recognition method provided by the embodiment of the disclosure, invalid actions are automatically removed while labor is saved, key actions are classified and archived, and review is facilitated.

Fig. 5 schematically illustrates a block diagram of an action recognition device according to an embodiment of the present disclosure.

As shown in fig. 5, the motion recognition apparatus 500 includes: a keypoint determination module 510, an included angle determination module 520, and an action recognition module 530.

The keypoint determination module 510 is configured to determine a sequence of skeletal keypoint groups of the target object based on the sequence of images.

The included angle determining module 520 is configured to determine a bone included angle sequence of the target object based on the bone key point group sequence.

The motion recognition module 530 is configured to perform motion recognition on the target object based on the image feature sequence, the bone key point feature sequence, and the bone angle feature sequence, and determine a motion recognition result of the target object. The image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

According to an embodiment of the present disclosure, the action recognition module includes: matching sub-module, concatenation sub-module and discernment sub-module.

A matching sub-module for, for each image feature in the sequence of image features,

according to the time sequence relation, determining target bone key point characteristics matched with the image characteristics from the bone key point characteristic sequence, and determining target bone included angle characteristics matched with the image characteristics from the bone included angle characteristic sequence.

And the splicing sub-module is used for generating a target comprehensive feature based on the image feature, the target skeleton key point feature and the target skeleton included angle feature to obtain a target comprehensive feature sequence.

And the recognition sub-module is used for performing action recognition on the target object based on the target comprehensive characteristic sequence and determining the action recognition result of the target object.

According to an embodiment of the present disclosure, the keypoint determination module includes: the first keypoint determination submodule.

And the first key point determining submodule is used for determining a skeleton key point group sequence of the target object based on the image sequence and the event data sequence which is the same as the image sequence in time sequence.

According to an embodiment of the present disclosure, the keypoint determination module includes: the second key point determining submodule, the third key point determining submodule and the fourth key point determining submodule.

The second key point determining sub-module is used for obtaining an initial skeleton key point group sequence of the target object based on the image sequence and the event data sequence which is the same as the image sequence in time sequence.

And a third keypoint determination sub-module for determining a sequence of auxiliary skeletal keypoint groups of the target object based on the image sequence and the depth data sequence.

And a fourth keypoint determination submodule for determining a bone keypoint group sequence based on the initial bone keypoint group sequence and the auxiliary bone keypoint group sequence.

According to an embodiment of the present disclosure, the fourth keypoint determination submodule includes: a first key point determining unit and a second key point determining unit.

A first keypoint determination unit for determining a sequence of bone keypoints sets based on the initial sequence of bone keypoints sets, in case a match between the initial sequence of bone keypoints sets and the sequence of auxiliary bone keypoints sets is determined.

And the second key point determining unit is used for updating the initial skeleton key point group sequence based on the image sequence and the depth data sequence to obtain a skeleton key point group sequence under the condition that the initial skeleton key point group sequence and the auxiliary skeleton key point group sequence are not matched.

According to an embodiment of the present disclosure, the action recognition apparatus further includes: the device comprises an image feature extraction module, an included angle feature extraction module and a key point feature extraction module.

And the image feature extraction module is used for carrying out feature extraction on the image sequence and the event data sequence with the same time sequence as the image sequence to obtain an image feature sequence.

And the included angle feature extraction module is used for carrying out feature extraction on the bone included angle sequence to obtain a bone included angle feature sequence.

And the key point feature extraction module is used for carrying out feature extraction on the skeleton key point group sequence to obtain a skeleton key point feature sequence.

According to an embodiment of the present disclosure, the action recognition apparatus further includes: the device comprises a track determining module, a mode identifying module and a first result determining module.

And the track determining module is used for determining the motion track of the associated object based on the event data sequence and the image sequence. The motion of the associated object is generated based on the motion of the target object.

And the mode identification module is used for determining the action mode of the associated object based on the motion trail.

And the first result determining module is used for taking the action recognition result as a target action recognition result when the action mode is determined to be matched with the action recognition result.

According to an embodiment of the present disclosure, the action recognition apparatus further includes: the device comprises a frame disassembling module, a frame extracting module and an image determining module.

And the frame disassembling module is used for performing frame disassembling processing on the video to obtain continuous video frames.

And the frame extraction module is used for determining a plurality of target video frames from the continuous video frames according to a preset frame extraction mode.

An image determination module for determining an image sequence based on a plurality of target video frames.

According to an embodiment of the present disclosure, the image sequence includes a plurality of.

According to an embodiment of the present disclosure, the action recognition apparatus further includes: the second result determining module and the fragment determining module.

And the second result determining module is used for determining the action recognition results of the image sequences and obtaining a plurality of action recognition results.

And the segment determining module is used for determining a target video segment from the video based on the plurality of image sequences under the condition that the plurality of action recognition results are used for representing the same action.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as an embodiment of the present disclosure.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as an action recognition method. For example, in some embodiments, the action recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the action recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the action recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of action recognition, comprising:

determining a skeleton key point group sequence of a target object based on the image sequence;

determining a bone angle sequence of the target object based on the bone key point group sequence; and

and performing motion recognition on the target object based on an image feature sequence, a bone key point feature sequence and a bone included angle feature sequence, and determining a motion recognition result of the target object, wherein the image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

2. The method of claim 1, wherein the performing motion recognition on the target object based on the image feature sequence, the bone key feature sequence, and the bone angle feature sequence, determining a motion recognition result of the target object comprises:

for each image feature in the sequence of image features,

according to a time sequence relation, determining target bone key point characteristics matched with the image characteristics from the bone key point characteristic sequence, and determining target bone included angle characteristics matched with the image characteristics from the bone included angle characteristic sequence;

generating a target comprehensive feature based on the image feature, the target skeleton key point feature and the target skeleton included angle feature to obtain a target comprehensive feature sequence; and

and performing action recognition on the target object based on the target comprehensive feature sequence, and determining an action recognition result of the target object.

3. The method of claim 1 or 2, wherein the determining a sequence of bone keypoint groups of the target object based on the sequence of images comprises:

and determining a skeleton key point group sequence of the target object based on the image sequence and an event data sequence with the same time sequence as the image sequence.

4. A method according to any one of claims 1 to 3, wherein the determining a sequence of sets of skeletal keypoints of the target object based on the sequence of images comprises:

obtaining an initial skeleton key point group sequence of the target object based on the image sequence and an event data sequence with the same time sequence as the image sequence;

determining an auxiliary skeleton key point group sequence of the target object based on the image sequence and the depth data sequence; and

the bone key point set sequence is determined based on the initial bone key point set sequence and the auxiliary bone key point set sequence.

5. The method of claim 4, wherein the determining the sequence of bone keypoint groups based on the initial sequence of bone keypoint groups and the sequence of auxiliary bone keypoint groups comprises:

determining the sequence of bone key points based on the initial sequence of bone key points in case a match between the initial sequence of bone key points and the sequence of auxiliary bone key points is determined; and

and under the condition that the initial skeleton key point group sequence and the auxiliary skeleton key point group sequence are not matched, updating the initial skeleton key point group sequence based on the image sequence and the depth data sequence to obtain the skeleton key point group sequence.

6. The method of any one of claims 1 to 5, further comprising:

performing feature extraction on the image sequence and an event data sequence with the same time sequence as the image sequence to obtain the image feature sequence;

extracting features of the skeleton angle sequences to obtain skeleton angle feature sequences; and

and extracting the characteristics of the skeleton key point group sequence to obtain the skeleton key point characteristic sequence.

7. The method of any one of claims 1 to 6, further comprising:

determining a motion trail of an associated object based on the sequence of event data and the sequence of images, wherein the motion of the associated object is generated based on the motion of the target object;

determining an action mode of the associated object based on the motion trail; and

and when the action mode is determined to be matched with the action recognition result, taking the action recognition result as a target action recognition result.

8. The method of any of claims 1 to 7, further comprising:

frame disassembly processing is carried out on the video to obtain continuous video frames;

determining a plurality of target video frames from the continuous video frames according to a preset frame extraction mode; and

The image sequence is determined based on the plurality of target video frames.

9. The method of claim 8, wherein the image sequence comprises a plurality of;

the method further comprises the steps of:

determining respective action recognition results of a plurality of image sequences to obtain a plurality of action recognition results; and

and determining a target video segment from the video based on a plurality of image sequences under the condition that the plurality of action recognition results are all used for representing the same action.

10. An action recognition device, comprising:

the key point determining module is used for determining a skeleton key point group sequence of the target object based on the image sequence;

the included angle determining module is used for determining a bone included angle sequence of the target object based on the bone key point group sequence; and

and the action recognition module is used for carrying out action recognition on the target object based on an image feature sequence, a bone key point feature sequence and a bone included angle feature sequence, and determining an action recognition result of the target object, wherein the image feature sequence is generated based on the image sequence, the bone key point feature sequence is generated based on the bone key point group sequence, and the bone included angle feature sequence is generated based on the bone included angle sequence.

11. The apparatus of claim 10, wherein the action recognition module comprises:

the splicing sub-module is used for generating a target comprehensive feature based on the image feature, the target skeleton key point feature and the target skeleton included angle feature to obtain a target comprehensive feature sequence; and

and the recognition sub-module is used for carrying out action recognition on the target object based on the target comprehensive characteristic sequence and determining the action recognition result of the target object.

12. The apparatus of claim 10 or 11, wherein the keypoint determination module comprises:

and the first key point determining submodule is used for determining a skeleton key point group sequence of the target object based on the image sequence and an event data sequence with the same time sequence as the image sequence.

13. The apparatus of any of claims 10 to 12, wherein the keypoint determination module comprises:

The second key point determining submodule is used for obtaining an initial skeleton key point group sequence of the target object based on the image sequence and an event data sequence which is the same as the image sequence in time sequence;

a third keypoint determination sub-module for determining an auxiliary skeletal keypoint group sequence of the target object based on the image sequence and the depth data sequence; and

a fourth keypoint determination submodule for determining the sequence of bone keypoint groups based on the initial sequence of bone keypoint groups and the sequence of auxiliary bone keypoint groups.

14. The apparatus of claim l 3, wherein the fourth keypoint determination submodule comprises:

a first key point determining unit configured to determine the bone key point group sequence based on the initial bone key point group sequence in a case where a match between the initial bone key point group sequence and the auxiliary bone key point group sequence is determined; and

and the second key point determining unit is used for updating the initial skeleton key point group sequence based on the image sequence and the depth data sequence to obtain the skeleton key point group sequence under the condition that the initial skeleton key point group sequence and the auxiliary skeleton key point group sequence are not matched.

15. The apparatus of any of claims 10 to 14, further comprising:

the image feature extraction module is used for carrying out feature extraction on the image sequence and an event data sequence with the same time sequence as the image sequence to obtain the image feature sequence;

the included angle feature extraction module is used for performing feature extraction on the bone included angle sequence to obtain the bone included angle feature sequence; and

and the key point feature extraction module is used for carrying out feature extraction on the skeleton key point group sequence to obtain the skeleton key point feature sequence.

16. The apparatus of any of claims 10 to 15, further comprising:

a track determining module, configured to determine a motion track of an associated object based on the event data sequence and the image sequence, where a motion made by the associated object is generated based on a motion of the target object;

the mode identification module is used for determining an action mode of the associated object based on the motion trail; and

17. The apparatus of any of claims 10 to 16, further comprising:

the frame disassembling module is used for performing frame disassembling processing on the video to obtain continuous video frames;

the frame extraction module is used for determining a plurality of target video frames from the continuous video frames according to a preset frame extraction mode; and

an image determination module for determining the image sequence based on the plurality of target video frames.

18. The apparatus of claim 17, wherein the image sequence comprises a plurality of;

the apparatus further comprises:

the second result determining module is used for determining the action recognition results of the image sequences to obtain a plurality of action recognition results; and

and the segment determining module is used for determining a target video segment from the video based on a plurality of image sequences under the condition that the plurality of action recognition results are used for representing the same action.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.