CN111539289A

CN111539289A - Method and device for identifying action in video, electronic equipment and storage medium

Info

Publication number: CN111539289A
Application number: CN202010301132.3A
Authority: CN
Inventors: 徐嵚嵛
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-08-14

Abstract

The embodiment of the invention provides a method and a device for identifying actions in a video, electronic equipment and a storage medium. The method comprises the following steps: determining a first sequence of video frames to be identified; inputting the first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence. According to the method, the device, the electronic equipment and the storage medium for recognizing the action in the video, provided by the embodiment of the invention, the video frame sequence to be recognized is input into the action recognition model which is trained in advance, and the action recognition model outputs one or more types of actions corresponding to each moment in the video frame sequence to be recognized, so that the recognition of the actions of multiple types which occur at the same time is realized, and the loss of information is effectively avoided.

Description

Method and device for identifying action in video, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video technologies, and in particular, to a method and an apparatus for identifying an action in a video, an electronic device, and a storage medium.

Background

The video-based action recognition can be applied to multiple fields, such as behavior analysis, man-machine interaction, public safety, intelligent monitoring and the like.

There is also a wide demand for motion recognition of sports videos such as soccer, basketball, volleyball, and the like. For example, when a sports video is edited, the corresponding purpose can be quickly realized by means of the action recognition method when wonderful moments such as shooting and shooting need to be found out from the sports video.

In the motion recognition method in the prior art, a time interval with potential motion is selected from a video, then, features of the time interval are extracted and classified, and the motion type is recognized according to a classification result. However, the time intervals extracted by the method are independent, the correlation among the actions is not considered, only a single action can be recognized for each time interval, and a plurality of actions which occur simultaneously cannot be recognized.

In sports videos, multiple actions often occur simultaneously, such as dribbling-defense, dribbling-passing, shooting-capping, shooting-robbing board, etc. in videos of basketball games. Therefore, the motion recognition method in the prior art can be not applied to sports videos.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying actions in a video, electronic equipment and a storage medium, which are used for solving the defect that a plurality of actions which occur simultaneously cannot be identified by an action identification method in the prior art.

An embodiment of a first aspect of the present invention provides a method for identifying an action in a video, including:

determining a first sequence of video frames to be identified;

inputting the first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; wherein the content of the first and second substances,

the action recognition model is obtained by training based on a sample video frame sequence and a sample video frame sequence label; the sample video frame sequence label comprises category information of one or more actions corresponding to each time in the sample video frame sequence;

the motion recognition model is used for recognizing motion types in the video based on a first motion characteristic and a time domain space domain characteristic which are obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

In the above technical solution, the motion recognition model includes a first feature extraction layer, a second feature extraction layer, a feature fusion layer, and a motion classification layer; wherein the content of the first and second substances,

the first feature extraction layer is used for extracting a first action feature from the first video frame sequence;

the second feature extraction layer is used for extracting time domain and space domain features from the first video frame sequence;

the characteristic fusion layer is used for fusing the first action characteristic with the time domain and space domain characteristics to obtain a fused characteristic;

and the action classification layer is used for generating an action recognition result of the first video frame sequence according to the fused features.

In the above technical solution, the motion recognition model includes a first feature extraction layer, a second feature extraction layer, a feature fusion layer, and a motion classification layer;

correspondingly, the step of inputting the first video frame sequence to be recognized into the action recognition model to obtain an action recognition result output by the action recognition model specifically includes:

inputting a first video frame sequence to be identified into a first feature extraction layer of a motion identification model to obtain a first motion feature of the first video frame sequence to be identified;

inputting a first video frame sequence to be identified into a second feature extraction layer of the action identification model to obtain time domain and space domain features of the first video frame sequence to be identified;

inputting the first action characteristic and the time domain and space domain characteristic of the first video frame sequence to be identified into a characteristic fusion layer of an action identification model to obtain a fused characteristic of the first video frame sequence to be identified;

and inputting the fused features of the first video frame sequence to be recognized into the action classification layer of the action recognition model to obtain an action recognition result of the first video frame sequence.

In the above technical solution, the first feature extraction layer includes: an intra-frame feature extraction layer and an action feature extraction layer;

correspondingly, the inputting the first video frame sequence to be recognized into the first feature extraction layer of the motion recognition model to obtain the first motion feature of the first video frame sequence to be recognized specifically includes:

inputting a first video frame sequence to be identified into an intra-frame feature extraction layer of a first feature extraction layer to obtain intra-frame features of all video frames in the first video frame sequence to be identified;

and inputting the intraframe characteristics of each video frame in the first video frame sequence to be identified into the action characteristic extraction layer of the first characteristic extraction layer to obtain the first action characteristics of each category of actions contained in the first video frame sequence to be identified.

In the above technical solution, the action feature extraction layer includes a time structure information extraction layer and a weighting layer;

correspondingly, the step of inputting the intra-frame features of each video frame in the first video frame sequence to be recognized into the action feature extraction layer of the first feature extraction layer to obtain the first action features of each category of actions included in the first video frame sequence to be recognized specifically includes:

inputting the intraframe characteristics of each video frame in the first video frame sequence to be identified into a time structure information extraction layer, and acquiring the second action characteristics of each category of actions contained in the first video frame sequence; wherein the second motion characteristics are capable of characterizing a temporal distribution of respective category motions in the first sequence of video frames;

and inputting the second action characteristics into a weighting layer, wherein the weighting layer gives weighting parameters for describing the relevance among all the classes of actions to the second action characteristics to generate the first action characteristics.

In the above technical solution, the inputting the first video frame sequence to be recognized into the second feature extraction layer of the motion recognition model to obtain the time domain and space domain features of the first video frame sequence to be recognized includes:

and sequentially selecting one video frame from the first video frame sequence to be identified, and generating related time domain and space domain characteristics for the selected video frame according to the second characteristic extraction layer of the action identification model to obtain the time domain and space domain characteristics of the first video frame sequence.

In the above technical solution, the generating, according to the second feature extraction layer of the motion recognition model, a relevant time-domain spatial-domain feature for the selected video frame includes:

selecting a plurality of video frames adjacent to each other in front and back from a first video frame sequence to be identified by taking the selected video frame as a center to form a video frame group;

and inputting the video frame group into a second feature extraction layer to generate time domain and space domain features related to the selected video frame.

The embodiment of the second aspect of the present invention provides an apparatus for identifying actions in a video, including:

a determining module for determining a first sequence of video frames to be identified;

the action recognition module is used for inputting the first video frame sequence to be recognized into an action recognition model to obtain an action recognition result output by the action recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; wherein the content of the first and second substances,

In a third embodiment of the present invention, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the method for identifying an action in a video according to the first embodiment of the present invention.

A fourth aspect of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method for identifying actions in a video according to the first aspect of the present invention.

According to the method, the device, the electronic equipment and the storage medium for recognizing the action in the video, provided by the embodiment of the invention, the video frame sequence to be recognized is input into the action recognition model which is trained in advance, and the action recognition model outputs one or more types of actions corresponding to each moment in the video frame sequence to be recognized, so that the recognition of the actions of multiple types which occur at the same time is realized, and the loss of information is effectively avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for recognizing an action in a video according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating operation of a motion recognition model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for recognizing actions in a video according to another embodiment of the present invention;

fig. 4 is a schematic diagram of an apparatus for recognizing actions in a video according to an embodiment of the present invention;

fig. 5 illustrates a physical structure diagram of an electronic device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In some videos such as sports videos, a plurality of actions occur at the same time, and the action recognition method in the prior art can only recognize one of the plurality of actions occurring at the same time, which causes a loss of information.

In view of the above, the embodiment of the present invention provides a method for identifying an action in a video. Fig. 1 is a flowchart of a method for identifying an action in a video according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101, determining a first video frame sequence to be identified.

In an embodiment of the invention, the first sequence of video frames to be identified may be extracted from a video of a sports game, such as a video of a basketball game, a soccer game or a volleyball game. In other embodiments of the invention, the sequence of video frames to be identified may also be derived from other types of video, such as movies, television shows, etc.

The extraction of the first sequence of video frames to be identified from the video may be performed by methods known in the art, such as ffmpeg or opencv.

102, inputting a first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence.

The motion recognition model is obtained by training based on a sample video frame sequence and a sample video frame sequence label; the sample video frame sequence label includes information of one or more action categories corresponding to respective times in the sample video frame sequence.

The motion recognition model is used for recognizing motion types in the video based on a first motion characteristic and a time-domain spatial-temporal (spatio-temporal) characteristic obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

Time-space-domain (spatial-Temporal) features are well known to those skilled in the art, and are described in detail in, for example, the article entitled "Learning spatial-Temporal reconstruction with Pseudo-3D spatial Networks," which is published by "ZHAOFAN Qiu y, Ting Yao z, and Tao Mei," et al. Related articles can be found in the following websites: https:// arxiv. org/abs/1711.10305.

The method for recognizing the action in the video provided by the embodiment of the invention realizes the recognition of the actions of various types which occur at the same time and effectively avoids the loss of information by inputting the video frame sequence to be recognized into the pre-trained action recognition model and outputting the action of one or more types corresponding to each moment in the video frame sequence to be recognized by the action recognition model.

Based on any one of the above embodiments, in an embodiment of the present invention, the motion recognition model includes: the device comprises a first feature extraction layer, a second feature extraction layer, a feature fusion layer and an action classification layer. Correspondingly, fig. 2 is a flow chart of the operation of the motion recognition model provided in the embodiment of the present invention, and as shown in fig. 2, the inputting the first video frame sequence to be recognized into the motion recognition model to obtain the motion recognition result output by the motion recognition model specifically includes:

step 201, inputting a first video frame sequence to be recognized into a first feature extraction layer of a motion recognition model, to obtain a first motion feature in the first video frame sequence to be recognized, where the first motion feature can represent the time distribution of each category motion in the first video frame sequence and the association between each category motion.

Step 202, inputting the first video frame sequence to be identified into a second feature extraction layer of the motion identification model, so as to obtain a time-domain-spatial-domain (spatio-temporal) feature of the first video frame sequence to be identified.

Step 203, inputting the first motion characteristic and the time domain and space domain characteristic in the first video frame sequence to be identified into the characteristic fusion layer of the motion identification model to obtain the fused characteristic of the first video frame sequence to be identified.

Step 204, inputting the fused features of the first video frame sequence to be recognized into a motion classification layer of a motion recognition model to obtain a motion recognition result of the first video frame sequence; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence.

According to the method for recognizing the action in the video, provided by the embodiment of the invention, the sequence of the video frames to be recognized is input into the action recognition model which is trained in advance, and the operations such as feature extraction, feature fusion, action classification and the like are respectively realized by the first feature extraction layer, the second feature extraction layer, the feature fusion layer and the action classification layer in the action recognition model, so that one or more types of actions corresponding to all times in the sequence of the video frames to be recognized are obtained, the recognition of the actions of multiple types occurring at the same time is realized, and the loss of information is effectively avoided.

Based on any one of the above embodiments, in an embodiment of the present invention, the first feature extraction layer further includes: an intra feature extraction layer and an action feature extraction layer. Correspondingly, the step 201 further includes:

step 2011, the first video frame sequence to be identified is input into the intra-frame feature extraction layer of the first feature extraction layer, so as to obtain the intra-frame features of each video frame in the first video frame sequence to be identified.

Step 2012, the intra-frame features of each video frame in the first video frame sequence to be identified are input into the motion feature extraction layer of the first feature extraction layer, so as to obtain the first motion features of each category of motion included in the first video frame sequence to be identified.

Specifically, in the embodiment of the present invention, the intra-frame feature extraction layer referred to in step 2011 is a ResNet34 network, and in other embodiments of the present invention, the intra-frame feature extraction layer may also be another type of image convolution neural network.

Table 1 describes a network structure of the ResNet34 network employed in the embodiment of the present invention.

TABLE 1

The ResNet34 network uses the output of the average pooling layer unwrapped as an intra feature for video frames using v_tIf the length of a video frame sequence is T, the dimension of the intra-frame feature of the video frame sequence output by the ResNet34 network is T × D after the video frame sequence is input into the ResNet34 network.

In a preferred implementation, the sequence of video frames to be recognized is prior to the input intra feature extraction layer, and the step of resizing the video frame images is further included.

For example, the picture size is directly changed to 256px × 240px by direct change and center cropping, and then center cropping is performed to 224px × 224 px.

In the embodiment of the present invention, the action feature extraction layer referred to in step 2012 includes a temporal structure information extraction layer and a weighting layer. Correspondingly, the step 2012 further includes:

step 2012-1, inputting the intra-frame features of each video frame in the first video frame sequence to be identified into the time structure information extraction layer, and obtaining the second motion features of each category of motion included in the first video frame sequence.

In the embodiment of the present invention, the time structure information extraction layer is implemented by using a cauchy distribution filter, and in other embodiments of the present invention, the time structure information extraction layer may also use other types of filters, such as a gaussian distribution filter.

The Cauchy distribution filter comprises N Cauchy distributions, and the expression of one Cauchy distribution is as follows:

wherein n represents the number of Cauchy distribution, 0<N is less than or equal to N, the size of N can be set according to actual conditions, T represents time T ∈ {1, 2_nRepresents a normalized coefficient;

represents the center point of the nth Cauchy distribution;

representing the width of the nth cauchy distribution the cauchy distribution output has a dimension T × N.

The center point and the width of each Cauchy distribution in the Cauchy distribution filter are determined in the training stage, so that the intra-frame features of the video frame can be directly processed by the Cauchy distribution filter in the step.

Obtaining the related expression of the second action characteristic by utilizing the Cauchy distribution filter for the intraframe characteristic of each video frame in the first video frame sequence to be identified, wherein the related expression is as follows:

wherein C ∈ C, C is the number of categories of actions, and taking basketball action as an example, the categories of the basketball action specifically include: dribbling, defense, passing, dunking, shooting, capping, and cricket, so C ═ 7.

S_c[n]The characteristic obtained by the action of the class c under the action of the nth Cauchy distribution is reflected and is denoted as a second action characteristic in the embodiment of the present invention, and the second action characteristic can represent the time distribution of each category action in the first video frame sequence.

Step 2012-2, inputting the second motion characteristics into a weighting layer, and obtaining first motion characteristics capable of characterizing the time distribution of the motion of each category in the first video frame sequence and the relevance between the motion of each category.

The second motion feature mainly reflects the existence of motion at a certain time point in the video frame sequence to be recognized, but cannot reflect the relevance between the motions. For example, dribbling-defense in a video of a basketball game occurs simultaneously, and there is a correlation between the dribbling and defense actions. Therefore, it is also necessary to obtain the first motion characteristics capable of reflecting the degree of association between the different types of motions based on the second motion characteristics.

In the embodiment of the invention, the correlation degree between different types of actions is reflected by soft attention (soft attention).

The formula for soft attention is:

wherein M represents the number of action categories having an association relation with a certain category of actions, M needs to be less than the category number, that is, M<C；m∈M；W_c，mAnd representing a weight parameter, and determining the specific value of the weight parameter in a training stage.

Soft attention A if M is 2_c，mSigmoid (W) may also be used_c，m) And 1-sigmoid (W)_c，m) Instead, wherein sigmoid function is

According to the second action characteristic and the soft attention, a calculation formula for obtaining the first action characteristic is as follows:

wherein, SA_cThe first action characteristic used for representing the action of the type c has the dimension of D × N.

The first action characteristic can reflect the degree of association between a certain class of action and other classes of action compared to the second action characteristic.

The method for identifying the action in the video provided by the embodiment of the invention can obtain the action time distribution information of each category in the video frame sequence to be identified and the association degree information between the action time distribution information and the action of other categories through the analysis of the video frame sequence to be identified, and is favorable for identifying multiple categories of actions occurring at the same time.

Based on any one of the above embodiments, in an embodiment of the present invention, the second feature extraction layer is a C3D convolutional neural network. Correspondingly, the step 202 further includes:

step 2021, selecting a video frame from the first video frame sequence to be identified, taking the video frame as a center, selecting a group of video frames from the first video frame sequence to be identified, inputting the group of video frames into a C3D convolutional neural network, and obtaining a time domain and space domain feature related to the selected video frame.

Step 2022, executing the operation of step 2021 on each video frame in the first video frame sequence to be identified, to obtain the time domain and space domain features of the first video frame sequence to be identified.

Specifically, in a specific embodiment, 16 pictures (i.e., L ═ 16) of the first 7 frames and the last 8 frames of the selected video frame can be used as a video frame group (if there are insufficient video frames before or after the selected video frame, the video frame group is filled up by filling up 0), and the video frame group is input into the C3D convolutional neural network. Table 2 describes the network structure of the C3D convolutional neural network employed in the embodiment of the present invention.

TABLE 2

The output of the fully-connected layer 2 in the C3D convolutional neural network was used as a time-domain-space-domain feature, v'_tIf the length of a video frame sequence is T, the dimension of the time-domain and space-domain features of the video frame sequence output by the C3D convolutional neural network is T × D' after the video frame sequence is input into the C3D convolutional neural network.

As a preferred implementation, the video frame images to be identified need to be resized before being input to the C3D network. For example, the video frame image size is changed to 128px × 171px, and then center-clipped to 112px × 112px size.

In other embodiments of the present invention, the second feature extraction layer may also be other types of 3D convolutional neural networks, such as I3D convolutional neural networks.

The method for identifying the action in the video provided by the embodiment of the invention can obtain the time domain and space domain characteristics of the video frame sequence to be identified by analyzing the video frame sequence to be identified, and is beneficial to identifying various types of actions occurring at the same time.

In any of the above embodiments, in an embodiment of the present invention, the feature fusion layer includes a first mapping transformation layer, a second mapping transformation layer, and a fusion layer. Correspondingly, the step 203 further comprises:

step 2031, inputting the first motion feature in the first video frame sequence to be identified into the first mapping transformation layer in the feature fusion layer to obtain the first motion feature after mapping transformation.

Step 2032, inputting the time domain and space domain features in the first video frame sequence to be identified into the second mapping transformation layer in the feature fusion layer to obtain the time domain and space domain features after mapping transformation.

Step 2033, inputting the mapped first motion feature and the mapped time domain and space domain feature into a fusion layer of the feature fusion layer to obtain a fused feature of the first video frame sequence to be identified.

In the embodiment of the present invention, the first mapping conversion layer implements mapping conversion of the first motion characteristic by using the parameter W.

In the previous embodiment of the present invention, SA was employed_cTo represent the first action characteristic of the class c action, with dimension D × N, dimension of parameter H and SA_cAre consistent in dimension. When the mapping transformation of the first action characteristic is realized through the parameter H, the mapping transformation can be realized by adopting a matrix dot multiplication and then summing, and the correlation formula is as follows:

∑SA_c*H (5)

wherein, the value of the parameter H is determined in the model training stage.

The dimension of the mapped first motion feature is 1 × 1.

In the embodiment of the invention, the second mapping conversion layer is realized by adopting a 3D convolutional layer, the input dimension of the 3D convolutional layer is D ', the output dimension is 1, the step size is 1 × 1 × 1, and the time-domain and space-domain characteristics (namely conv3D (v'_t) ) is T × 1.

In the embodiment of the invention, the fusion layer realizes the fusion of the characteristics by directly adding the mapping-transformed first action characteristics and the mapping-transformed time-domain and space-domain characteristics. The corresponding formula is as follows:

V_c＝(∑SA_c*H)+conv3D(v’_t) (6)

wherein conv3D represents a 3D convolution network, the adding part adopts a broadcasting mechanism, and the dimension of the fused feature is T multiplied by 1.

From the fused feature V_cCan know that V_c,tRepresents the fused features corresponding to the action of the c-th category at time t.

The method for identifying the action in the video, provided by the embodiment of the invention, is beneficial to realizing the identification of multiple types of actions occurring at the same time by fusing the first action characteristic and the time domain and space domain characteristic of the video frame sequence to be identified.

Based on any one of the above embodiments, in the embodiment of the present invention, the action classification layer is implemented by using a sigmoid function. Specifically, the step 204 further includes:

2041, inputting the fused features of the first video frame sequence into a sigmoid function to obtain a mapping value.

For example, the fused feature V corresponding to the action of the c-th category at time t_c,tInputting a sigmoid function to obtain a mapping value sigmoid (V) of the type c action at time t_c,t) And the value range of the mapping value is between 0 and 1.

2042, comparing the mapping value with a preset threshold value, and obtaining a motion recognition result of the first video frame sequence according to the comparison result; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence.

For example, if the predetermined threshold value is 0.5, at time t, the map value of the action of the 1 st category (dribbling) is 0.6, the map value of the action of the 2 nd category (defensive) is 0.9, the map value of the 3 rd category (passing) is 0.2, the map value of the 4 th category (shooting) is 0.15, the map value of the 5 th category (shooting) is 0.01, the map value of the 6 th category (capping) is 0.32, and the map value of the 7 th category (basketball) is 0.36. By comparing the mapping values corresponding to different categories of actions with the threshold value, it can be known that there are two categories of actions, namely dribbling and defending, in the first video frame sequence at the time t.

In the embodiment of the invention, the action in the basketball game is taken as an example, and the identification method of the action in the video is exemplified. However, it should be understood by those skilled in the art that the method for identifying the motion in the video provided by the embodiment of the present invention is not limited to the motion in the basketball game, and further, is not limited to the motion of the human body. The method for recognizing the motion in the video provided by the embodiment of the invention can be used for recognizing the motion in the video as long as the motion in different types occurs at the same time in the video.

The method for identifying the action in the video, provided by the embodiment of the invention, realizes the identification of multiple types of actions occurring at the same time by classifying the fused features of the sequence of the video frames to be identified.

Based on any of the above embodiments, fig. 3 is a flowchart of a method for recognizing actions in a video according to another embodiment of the present invention, and as shown in fig. 3, the method for recognizing actions in a video according to another embodiment of the present invention includes:

step 301, a sample video frame sequence is determined.

In embodiments of the present invention, a sequence of sample video frames may be extracted from a sample video.

The sample video may be a sports-like video. Such as a basketball game video, a football game video, a volleyball game video, etc. The sample video may be a full sports game video or a partial video of a sports game. The sample video should be of a certain size, e.g., the sum of the sample video durations is greater than 200 hours.

The sample video frame sequence can be extracted from the sample video by the methods known in the art, such as ffmpeg or opencv.

And 302, labeling the sample video frame sequence to obtain a sample video frame sequence label.

In the embodiment of the present invention, labeling the sample video frame sequence requires labeling the time interval of the occurrence of the motion and the corresponding motion category. For example, the action categories of basketball actions include: dribbling, defense, passing, dunking, shooting, capping, cricket, etc.

From the time intervals of the occurrence of the actions in the sample video frame sequence and the corresponding action categories, category information of one or more actions corresponding to each time in the sample video frame sequence can be obtained.

For example, by labeling the sample video frame sequence, it can be known that the action occurring at the 5 th second in the sample video frame sequence appears in the shooting and capping categories of actions.

Step 303, taking the sample video frame sequence as input data for training, taking category information of one or more actions corresponding to each time in the sample video frame sequence as label data for training, and performing training in a deep learning manner to obtain an action recognition model for recognizing an action recognition result from the first video frame sequence to be recognized; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence.

Specifically, in the embodiment of the present invention, the forward propagation process for training the motion recognition model includes:

when the intra-frame extraction layer is trained, the ResNet34 network is trained by using the sample video frame sequence as input data for training, and using the category information of one or more actions corresponding to each time in the sample video frame sequence as label data for training. Initial parameters of the ResNet34 network may employ ImageNet pre-trained parameters.

As a preferred implementation, the video frame images need to be resized before the sample video frame sequence is input to the ResNet34 network.

When training the motion feature extraction layer, taking a cauchy distribution filter as an example, the filter includes N cauchy distributions, and the training process includes:

wherein, the formula (7) is a Cauchy distribution center point iterative formula, x_nFor the Cauchy distribution center point obtained in the previous iteration step,

n ∈ {1, 2.., N } is the result obtained after updating.

Equation (8) is an iterative equation of the Cauchy distribution width, γ_nFor the cauchy distribution width obtained in the previous iteration,

n ∈ {1, 2.., N } is the result obtained after updating.

During the training process, the center point and the width of the Cauchy distribution need to be continuously learned so as to better represent the action. At the beginning of training x_nThe initialization values of (a) are: normal distribution random sampling with mean of 0 and labeling difference of 0.5, gamma_nThe initialization values of (a) are: and the normal distribution with the average value of 0 and the labeled difference of 0.0001 is randomly sampled.

Equation (9) is an expression of the cauchy distribution.

In addition, weights in the feature extraction layerWeight parameter W_c，mAdjustments are also required during training.

When the second feature extraction layer is trained, the C3D network is trained by using the sample video frame sequence as input data for training, and by using time interval information of occurrence of an action and category information of the action in the sample video frame sequence as label data for training. Initial parameters for the C3D network may employ pre-trained kinetics400 parameters.

As a preferred implementation, the video frame images need to be resized before the sample video frame sequence is input to the C3D network. For example, the video frame image size is changed to 128px × 171px, and then center-clipped to 112px × 112px size.

When training the first mapping transformation layer, the value of the parameter H needs to be adjusted during the training process.

In training the second mapping transform layer, the parameters in the 3D convolutional layer need to be adjusted during the training process.

And finally, performing binary classification loss function calculation and summation on the C-type actions in the forward propagation process to obtain final loss, wherein the loss function is as follows:

wherein Z_cThe true label for each type of action is a vector of T × 1, the element Z of which_c，t∈ {0,1} indicates whether there is a type c action in the sequence of video frames at time t, 0 indicates none, and 1 indicates presence.

And after the loss is obtained, adjusting parameters in the action recognition model through a back propagation algorithm to reduce the value of the loss function until the value of the loss function accords with an expected value.

Step 304, determining a sequence of video frames to be identified.

Step 305, inputting the first video frame sequence to be recognized into the motion recognition model, and obtaining a motion recognition result output by the motion recognition model.

The method for recognizing the actions in the video, provided by the embodiment of the invention, trains the action recognition model through the sample video frame sequence and the sample video frame sequence label, and outputs one or more types of actions corresponding to all moments in the video frame sequence to be recognized by the action recognition model, so that the actions of multiple types occurring at the same time are recognized, and the information loss is effectively avoided.

Based on any of the above embodiments, fig. 4 is a schematic diagram of an apparatus for recognizing an action in a video according to an embodiment of the present invention, and as shown in fig. 4, the apparatus for recognizing an action in a video according to an embodiment of the present invention includes:

a determining module 401, configured to determine a sequence of video frames to be identified;

a motion recognition module 402, configured to input the first video frame sequence to be recognized into a motion recognition model, so as to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; wherein the content of the first and second substances,

the motion recognition model is used for recognizing motion in the video based on a first motion characteristic and a time domain and space domain characteristic which are obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

The device for recognizing the action in the video provided by the embodiment of the invention inputs the video frame sequence to be recognized into the pre-trained action recognition model, and the action recognition model outputs one or more types of actions corresponding to each moment in the video frame sequence to be recognized, so that the recognition of the actions of multiple types occurring at the same time is realized, and the loss of information is effectively avoided.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method: determining a first sequence of video frames to be identified; inputting the first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; the motion recognition model is obtained by training based on a sample video frame sequence and a sample video frame sequence label; the sample video frame sequence label comprises category information of one or more actions corresponding to each time in the sample video frame sequence; the motion recognition model is used for recognizing motion types in the video based on a first motion characteristic and a time domain space domain characteristic which are obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 510, the communication interface 520, the memory 530, and the communication bus 540 shown in fig. 5, where the processor 510, the communication interface 520, and the memory 530 complete mutual communication through the communication bus 540, and the processor 510 may call the logic instructions in the memory 530 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: determining a first sequence of video frames to be identified; inputting the first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; the motion recognition model is obtained by training based on a sample video frame sequence and a sample video frame sequence label; the sample video frame sequence label comprises category information of one or more actions corresponding to each time in the sample video frame sequence; the motion recognition model is used for recognizing motion in the video based on a first motion characteristic and a time domain and space domain characteristic which are obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: determining a first sequence of video frames to be identified; inputting the first video frame sequence to be recognized into a motion recognition model to obtain a motion recognition result output by the motion recognition model; the action recognition result comprises one or more action categories corresponding to each time in the first video frame sequence; the motion recognition model is obtained by training based on a sample video frame sequence and a sample video frame sequence label; the sample video frame sequence label comprises category information of one or more actions corresponding to each time in the sample video frame sequence; the motion recognition model is used for recognizing motion in the video based on a first motion characteristic and a time domain and space domain characteristic which are obtained by a first video frame sequence to be recognized; wherein the first motion characteristics are capable of characterizing a temporal distribution of the respective category motions in the first sequence of video frames and an association between the respective category motions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for recognizing actions in a video is characterized by comprising the following steps:

determining a first sequence of video frames to be identified;

2. The method for recognizing the motion in the video according to claim 1, wherein the motion recognition model comprises a first feature extraction layer, a second feature extraction layer, a feature fusion layer and a motion classification layer; wherein the content of the first and second substances,

3. The method for recognizing the motion in the video according to claim 1, wherein the motion recognition model comprises a first feature extraction layer, a second feature extraction layer, a feature fusion layer and a motion classification layer;

4. The method according to claim 3, wherein the first feature extraction layer comprises: an intra-frame feature extraction layer and an action feature extraction layer;

5. The method according to claim 4, wherein the motion feature extraction layer comprises a temporal structure information extraction layer and a weighting layer;

inputting the intraframe characteristics of each video frame in the first video frame sequence to be identified into a time structure information extraction layer, and acquiring the second action characteristics of each category action contained in the first video frame sequence; wherein the second motion characteristics are capable of characterizing a temporal distribution of respective category motions in the first sequence of video frames;

6. The method according to claim 3, wherein the inputting the first video frame sequence to be recognized into the second feature extraction layer of the motion recognition model to obtain the temporal-spatial features of the first video frame sequence to be recognized comprises:

7. The method according to claim 6, wherein the generating the relevant temporal-spatial features for the selected video frame according to the second feature extraction layer of the motion recognition model comprises:

8. An apparatus for recognizing a motion in a video, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for identifying an action in a video according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for identifying an action in a video according to any one of claims 1 to 7.