CN115035440A

CN115035440A - Method and device for generating time sequence action nomination, electronic equipment and storage medium

Info

Publication number: CN115035440A
Application number: CN202210612861.XA
Authority: CN
Inventors: 李帅成; 杨昆霖; 侯军; 伊帅
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-09

Abstract

The disclosure relates to a method and a device for generating a time sequence action nomination, an electronic device and a storage medium, wherein the method comprises the following steps: performing feature extraction on a plurality of video segments obtained from a video to be identified to obtain a plurality of first features; for any video clip, updating the first characteristics of the video clip based on the incidence relation between the video clip and the adjacent video clip thereof to obtain the second characteristics of the video clip, wherein the adjacent video clip is positioned in the target neighborhood of the video clip; and generating a target time sequence action nomination of the video to be identified based on the second characteristics of the plurality of video segments. The method and the device for recognizing the target time sequence action nomination can generate the target time sequence action nomination with high accuracy corresponding to the video to be recognized.

Description

Method and device for generating time sequence action nomination, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a time sequence action nomination, an electronic device, and a storage medium.

Background

In recent years, video timing detection has received a great deal of attention with the proliferation of short video applications. The main purpose of video time sequence action detection is to locate all action segments in an un-clipped video and judge the action categories of all action segments, which is a process of location and identification. The time sequence action detection task is a basic problem in the field of video understanding, and can be better used for downstream tasks such as video analysis and video action analysis. Most of time sequence action detection tasks adopt a flow similar to target detection, adopt a two-stage strategy, firstly generate time sequence action nominations of videos, and then judge the category of each nominations based on the time sequence action nominations to obtain a final time sequence action detection result. That is, precise time series action nomination is a precondition for obtaining a time series action detection result with high accuracy.

Disclosure of Invention

The disclosure provides a method and a device for generating a time sequence action nomination, an electronic device and a storage medium.

According to an aspect of the present disclosure, a method for generating a time series action nomination is provided, including: performing feature extraction on a plurality of video clips obtained by videos to be identified to obtain a plurality of first features; updating the first characteristics of any video clip to obtain the second characteristics of the video clip based on the incidence relation between the video clip and the adjacent video clip, wherein the adjacent video clip is positioned in the target neighborhood of the video clip; and generating a target time sequence action nomination of the video to be identified based on the second characteristics of the plurality of video segments.

In a possible implementation manner, the updating the first feature of the video segment based on the association relationship between the video segment and the adjacent video segment to obtain the second feature of the video segment includes: for any video clip, extracting context information of the first features of the video clip and the adjacent video clips on a time sequence to obtain a third feature of the video clip; determining a target similarity weight between the video segment and its neighboring video segments; fusing the third features of the video segments and the adjacent video segments thereof based on the target similarity weight to update the first features of the video segments; and iteratively executing the steps, and determining the updated first characteristic corresponding to the video segment as a second characteristic of the video segment when a preset iteration number is reached.

In one possible implementation, the target neighborhood corresponds to at least one neighborhood dimension; the determining a target similarity weight between the video segment and its neighboring video segments comprises: for any one of the neighborhood scales, determining an initial similarity weight between the video segment and a corresponding adjacent video segment at the neighborhood scale; and fusing the initial similarity weights corresponding to the at least one neighborhood scale to obtain the target similarity weight.

In one possible implementation, the determining an initial similarity weight between the video segment and its corresponding neighboring video segment at the neighboring scale comprises: for any video clip, encoding the third characteristics of the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale to obtain the relationship characteristics between the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale; decoding the relation features to obtain initial similarity weights between the video segments and the corresponding adjacent video segments under the neighborhood scale.

In one possible implementation, the determining an initial similarity weight between the video segment and its corresponding neighboring video segment at the neighboring scale comprises: for any one of the video segments, determining a feature distance between a third feature of the video segment and a third feature of a neighboring video segment corresponding to the third feature at the neighborhood scale; and obtaining the initial similarity weight between the video segment and the corresponding adjacent video segment under the neighborhood scale based on the characteristic distance.

In one possible implementation manner, the fusing the third features of the video segment and its adjacent video segment based on the target similarity weight to update the first feature of the video segment includes: and fusing the third characteristics of the video segments and the third characteristics of the adjacent video segments with the target similarity weight larger than a similarity weight threshold value so as to update the first characteristics of the video segments.

In one possible implementation manner, the generating a target time-series action nomination of the video to be identified based on the second feature of the plurality of video segments includes: classifying the video to be identified based on the second characteristics of the video segments to obtain a first initial time sequence action nomination of the video to be identified; performing regression processing on the basis of second characteristics of the plurality of video clips and the first initial time sequence action nomination to obtain a second initial time sequence action nomination of the video to be identified; determining the target time series action nomination based on the first initial time series action nomination and the second initial time series action nomination.

According to an aspect of the present disclosure, there is provided a time-series action nomination generating device, including: the characteristic extraction module is used for extracting the characteristics of a plurality of video clips obtained by videos to be identified to obtain a plurality of first characteristics; the updating module is used for updating the first characteristics of the video clips to obtain the second characteristics of the video clips based on the incidence relation between the video clips and the adjacent video clips, wherein the adjacent video clips are positioned in the target neighborhood of the video clips; and the nomination generating module is used for generating the target time sequence action nomination of the video to be identified based on the second characteristics of the plurality of video clips.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, feature extraction is performed on a plurality of video segments obtained from a video to be identified to obtain a plurality of first features, for any video segment, the first features of the video segment are updated based on the association relationship between the video segment and the adjacent video segments thereof to obtain a second feature of the video segment, and a target time sequence action nomination of the video to be identified is generated based on the second features of the plurality of video segments. For any video clip, only the incidence relation between the video clip and the adjacent video clip in the target neighborhood is considered, the local context information on the time sequence is better utilized, the influence of global background noise in the video to be identified is reduced, and the semantic representation capability of the second feature of the video clip is improved, so that the target time sequence action nomination with higher accuracy corresponding to the video to be identified can be generated based on the second features of a plurality of video clips.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow diagram of a method of generation of a time series action nomination according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of a time series action nomination generating network according to an embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a structured area tank attention module in accordance with an embodiment of the present disclosure;

FIG. 4 shows a block diagram of a device for generating a time series action nomination according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 6 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an associative relationship describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first," "second," and "third" herein are used merely to distinguish one element from another, and do not constitute a limitation on the order.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Because the video understanding task has more information about time sequence than the common image space information, how to learn more complex and rich time sequence information in the video becomes the key of the time sequence action nomination task. In the related art, some methods improve the time sequence receptive field by using extended time sequence convolution, and some methods use graph networks to construct explicit time sequence relations. However, these methods are all for improving the capturing capability of global dependencies in video. And therefore, the semantic feature learning cannot be sufficiently performed by using the local context information in the time sequence. For an un-clipped video, which contains a large portion of redundant background information, it is important to selectively capture the motion-related context information. In the related technology, the long-term dependence of simple learning is usually only realized by using a self-attention mechanism, but whether a better structure can be used for learning better semantic features or not is not deeply explored, so that the accuracy rate of the generated time sequence action nomination is low.

The invention provides a time sequence action nomination generating method, which can better utilize context semantic information in a target neighborhood based on a structured target neighborhood corresponding to a video fragment, reduce the influence of global background noise, improve the learning capacity of semantic representation of the video fragment and further generate a target time sequence action nomination with higher accuracy.

The following describes in detail a detailed procedure of the method for generating a time-series action nomination provided by the embodiment of the present disclosure.

Fig. 1 shows a flowchart of a method for generating a time series action nomination according to an embodiment of the present disclosure. The method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling computer-readable instructions stored in a memory. Alternatively, the method may be performed by a server. As shown in fig. 1, the method may include:

in step S11, feature extraction is performed on a plurality of video segments obtained from the video to be recognized, so as to obtain a plurality of first features.

The video to be identified can be an original video which is not clipped and needs to be subjected to the task of generating the time sequence action nomination in the video time sequence action detection. The video to be identified comprises a plurality of video frames which are continuously acquired by image acquisition equipment in a time dimension.

And segmenting the video to be identified based on the preset segment length to obtain a plurality of video segments. For example, if the video to be identified includes T video frames arranged according to a time sequence, and the preset segment length is a σ frame, each σ frame that is continuous in the time sequence is segmented, and one frame is arbitrarily selected from each segmented continuous σ frame as a video segment, and finally the segmented frames are segmented to obtain L ═ T/σ video segments, each video segment being a video frame. The specific values of T and σ may be determined according to actual situations, and this disclosure does not specifically limit this.

And performing feature extraction on the plurality of video clips to obtain a first feature of each video clip.

In an example, feature extraction may be performed on a plurality of video segments by using a video recognition network (e.g., I3D, unfolded 3D ConvNet), and an initial feature of each video segment is obtained, where the initial feature may be in the form of a feature map, and a feature dimension of the initial feature is C.

For any video clip, performing linear transformation on the initial features of the video clip to obtain first features of the video clip, wherein the first features can be in the form of feature vectors, and the feature dimension of the first features is C _input 。

In step S12, for any video segment, the first feature of the video segment is updated based on the correlation between the video segment and its neighboring video segments, so as to obtain the second feature of the video segment, where the neighboring video segments are located in the target neighborhood of the video segment.

And aiming at any video clip, based on the structured target neighborhood, performing attention learning on the video clip and the adjacent video clip in the target neighborhood thereof so as to update the first feature of the video clip based on local context information and obtain the second feature which is corresponding to the video clip and has semantic representation capability.

Hereinafter, a detailed description will be given to a specific process of updating the first feature of the video segment and obtaining the second feature of the first feature of the video segment, which is not described herein again.

In step S13, a target time-series action nomination of the video to be recognized is generated based on the second features of the plurality of video segments.

Because the second feature of each video clip has higher semantic representation capability, the target time sequence action nomination with higher accuracy corresponding to the video to be identified can be generated based on the second features of the plurality of video clips.

In combination with possible implementation manners of the present disclosure, a detailed description is given to a specific process of generating a target time sequence action nomination of a video to be identified based on second features of a plurality of video segments, and details are not described herein.

According to the embodiment of the disclosure, only the incidence relation between any video segment corresponding to the video to be identified and the adjacent video segment in the target neighborhood is considered, so that the local context information on the time sequence is better utilized, the influence of global background noise in the video to be identified is reduced, and the semantic representation capability of the second feature of the video segment is improved, so that the target time sequence action nomination with higher accuracy corresponding to the video to be identified can be generated based on the second features of a plurality of video segments.

In an example, the generation of the time series action nomination described above may be implemented using a time series action nomination generation network. Fig. 2 shows a schematic diagram of a time-sequential action nomination generating network according to an embodiment of the present disclosure. As shown in fig. 2, the time sequence action nomination generating network includes a feature extraction module, a plurality of video clips obtained by segmenting a video to be identified are input into the time sequence action nomination generating network, and the feature extraction module is used for performing feature extraction on the plurality of video clips to obtain a plurality of first features.

For example, a video to be recognized is input into a time sequence action nomination generation network, and after feature extraction is performed by a feature extraction module, first features of each video segment in L video segments obtained by segmenting the video to be recognized are obtained. The first feature of any one video segment may be expressed as

The dimension of each first feature is C _input Then the characteristic extraction module outputs the corresponding LxC of the video to be identified _input A first signature sequence of magnitudes.

In a possible implementation manner, updating the first feature of the video segment based on the association relationship between the video segment and the neighboring video segment thereof to obtain the second feature of the video segment includes: for any video clip, extracting the context information of the first characteristics of the video clip and the adjacent video clips thereof on the time sequence to obtain the third characteristics of the video clip; determining a target similarity weight between the video segment and the adjacent video segments; fusing the third features of the video segment and the adjacent video segments thereof based on the target similarity weight to update the first features of the video segment; and iteratively executing the steps, and determining the updated first characteristic corresponding to the video clip as the second characteristic of the video clip under the condition that the preset iteration times are reached.

For any video segment, only the incidence relation between the video segment and the adjacent video segment in the target neighborhood is considered, the local context information on the time sequence is better utilized, the risk of updating overfitting is reduced, meanwhile, the influence of global background noise is reduced, and the semantic representation capability of the second feature of the finally obtained video segment is improved in a multi-iteration updating mode.

In one possible implementation, the target neighborhood corresponds to at least one neighborhood dimension; determining a target similarity weight between a video segment and its neighboring video segment, comprising: aiming at any neighborhood scale, determining initial similarity weight between a video segment and a corresponding adjacent video segment under the neighborhood scale; and fusing the initial similarity weights corresponding to at least one neighborhood scale to obtain a target similarity weight.

Because the duration of different video actions in the video to be identified has great change, in the one-time iterative updating process, the incidence relation between any video segment and the adjacent video segments in a plurality of target neighborhoods corresponding to different neighborhood scales is determined, and different local context information in time sequence is better utilized, so that in the current iterative updating process, the target similarity weight with higher accuracy between the video segment and the adjacent video segment in the target neighborhood with the largest neighborhood scale is obtained.

Still taking the example of FIG. 2 above, as shown in FIG. 2, the time series action nomination generation network includes a structured regional slot attention module. Corresponding LxC to the video to be identified output by the feature extraction module _input Inputting the first feature sequence of the size into a structured region slot attention module, performing T times of iterative updating by using the structured region slot attention module, and outputting a second feature of each video segment.

Fig. 3 illustrates a schematic diagram of a structured area tank attention module in accordance with an embodiment of the present disclosure. In the process of the t-th iterative updating, context information is extracted from the first feature of each video segment after the t-1 th iterative updating in a time sequence, and a third feature of the video segment for the t-th iterative updating is obtained.

In one example, for any video segment, based on time-series convolution and non-linear processing, the context information of the first feature of the video segment and its neighboring video segments in time series is extracted, and the third feature of the video segment is obtained.

For the ith video segment, for the ithFirst feature u after t-1 iteration updating of i video segments _i ^t Performing time-series convolution (Convs) and nonlinear (ReLU) processing to obtain the ith video segment with the third characteristic of v _i ^t . The third characteristic v _i ^t Query feature q (query) and value feature v (value) corresponding to the ith video segment.

In the case that the neighborhood dimension is s, the target neighborhood corresponding to the ith video segment includes: the ith-s video segment, … …, the ith-1 video segment, the ith video segment, the (i + 1) th video segment, and the (i + s) th video segment, and therefore, the third feature of the video segment adjacent to the ith video segment in the target neighborhood corresponding to the neighborhood dimension s is determined as the key feature k (key) corresponding to the ith video segment. That is, the key feature K of the ith video segment can be expressed as

In one example, the structured neighborhood slot attention module presets a plurality of neighborhood scales { s } ₁ ,s ₂ ,...,s _N }. At this time, the ith video segment corresponds to key features K under a plurality of different neighborhood scales:

for example, the structured region slot attention module presets a plurality of neighborhood metrics {1,2} based on a plurality of hierarchical local regions. At this time, the ith video segment corresponds to the key feature K under two different neighborhood scales:

based on the attention mechanism, for the ith video segment, determining the correlation between the query feature Q corresponding to the ith video segment and the key feature K at each neighborhood scale, that is, determining the initial similarity weight between the ith video segment and the adjacent video segment corresponding to the ith video segment at each neighborhood scale.

Also in the aboveTaking the ith video clip as an example, determining query characteristics Q corresponding to the ith video clip:

and the key feature K of the ith video segment under the neighborhood scale s-1:

initial similarity weight between

The method comprises the following steps: third feature of ith video segment

And a third feature of the i-1 st video clip

Initial similarity weight between

Third feature v of ith video segment _i ^t And a third feature of the (i + 1) th video clip

Initial similarity weight between them

Determining a query feature Q corresponding to the ith video clip:

and the key feature K of the ith video clip under the target neighborhood corresponding to the neighborhood scale s-2:

initial similarity weight between

The method comprises the following steps: third feature of ith video segment

And a third feature of the i-2 nd video clip

Initial similarity weight between

Third feature of ith video segment

And a third feature of the i-1 th video clip

Initial similarity weight between

Third feature of ith video segment

And a third feature of the (i + 1) th video clip

Initial similarity weight between them

Third feature of ith video segment

And a third characterization of the (i + 2) th video segment

Initial similarity weight between

And fusing the initial similarity weights of the ith video segment under a plurality of neighborhood scales to obtain the target similarity weight of the ith video segment under the target neighborhood with the maximum neighborhood scale.

Taking the ith video segment as an example, the initial similarity weights of the ith video segment in two neighborhood scales s-1 and s-2 are fused to obtain the target similarity weight of the ith video segment in the target neighborhood of the neighborhood scale s-2

The method comprises the following steps: third feature of ith video segment

And a third feature of the i-2 th video clip

Target similarity weight between

Third feature of ith video segment

And a third feature of the i-1 th video clip

Target similarity weight between

Third feature of ith video clip

And a third feature of the (i + 1) th video clip

Target similarity weight between

Third feature of ith video segment

And a third feature of the (i + 2) th video clip

Target similarity weight between

Target similarity weight based on ith video clip in target neighborhood

Fusing the third features of the plurality of video segments included in the target neighborhood to update the first feature of the ith video segment to obtain the first feature after the tth iterative update corresponding to the ith video segment

Taking the ith video segment as an example, the first feature after the t-th iterative update corresponding to the ith video segment

And by analogy, in the process of the t-th iteration updating, the first characteristic of each video segment is updated according to the mode. And then, the steps are executed iteratively until the preset iteration times are reached, and the updated first characteristic of each video segment reaching the preset iteration times is determined as the second characteristic thereof.

In one possible implementation, determining, for any neighborhood scale, an initial similarity weight between a video segment and its corresponding neighboring video segment at the neighborhood scale includes: for any video clip, encoding the third characteristics of the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale to obtain the relationship characteristics between the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale; and decoding the relation characteristics between the video segment and the adjacent video segment corresponding to the video segment under the adjacent domain scale to obtain the initial similarity weight between the video segment and the adjacent video segment corresponding to the video segment under the adjacent domain scale.

Still taking the above-described fig. 3 as an example, as shown in fig. 3, the structured zone slot attention module includes an encoder and a decoder. And utilizing an encoder to encode the plurality of third features to obtain a target relation feature graph.

Inputting the L third features into an encoder, performing dimension conversion on the L third features by using the encoder, and constructing an initial relation feature graph of L multiplied by L, wherein the first feature in the initial relation feature graph _i Column j of row is used to indicate _i Initial relationship characteristics of the video segment and the jth video segment.

And (3) performing 2D convolution processing on the initial relation feature map based on the 2D convolution layer with the convolution kernel size (2s +1,1), the filling size (s,0) and the step size (1,1), namely acquiring local context information with the size of 2s +1 on each column of the 2D feature map to obtain a dense relation feature map F' with the size of L multiplied by L.

Because the dense relation feature map F 'includes many interference items, in order to obtain only the association relation between each video segment and a plurality of adjacent video segments in the target neighborhood, the dense relation feature map F' is filtered based on the neighborhood scale s of the target neighborhood and the following formula (1), so as to obtain the target relation feature map F ″.

Wherein, F " _i,j And representing the characteristic relation between the ith video segment and the third characteristic of the jth video segment.

For the ith video segment, the target feature relationship graph comprises: third feature of ith video segment

And a third feature of two adjacent video segments (i-1, i +1 video segments) in the target neighborhood corresponding to the neighborhood dimension s-1

The characteristic relationship between: f _i,i-1 、F _i,i+1 。

And decoding the target relation characteristic graph corresponding to each neighborhood scale by using a decoder to obtain the initial similarity weight corresponding to each neighborhood scale. When the neighborhood scale s is 1, decoding the target relation feature map corresponding to the neighborhood scale s1 to obtain an initial similarity weight corresponding to the neighborhood scale s 1.

In one possible implementation, determining, for any neighborhood scale, an initial similarity weight between a video segment and its corresponding neighboring video segment at the neighborhood scale includes: for any video segment, determining a feature distance between the third feature of the video segment and the third feature of the adjacent video segment corresponding to the video segment in the neighborhood scale; and obtaining an initial similarity weight between the video segment and the corresponding adjacent video segment under the neighborhood scale based on the characteristic distance.

By determining the feature distance between the third feature of the video segment and the third feature of the corresponding adjacent video segment in any neighborhood scale, the initial similarity weight between the video segment and the corresponding adjacent video segment in any neighborhood scale can be quickly determined.

Still taking the above ith video clip as an example, when determining the query feature Q corresponding to the ith video clip:

and the key feature K of the ith video segment under the neighborhood of the neighborhood scale s-2:

initial similarity weight between

The method comprises the following steps: by determining a third characteristic of the ith video segment

And a third feature of the i-2 th video segment

Characteristic distance between them as initial similarity weight

By determining a third characteristic of the ith video segment

And a third feature of the i-1 th video clip

Characteristic distance between them as initial similarity weight

By determining a third characteristic of the ith video segment

And a third feature of the (i + 1) th video clip

The characteristic distance therebetween is used as the initial similarity weight

By determining a third characteristic of the ith video segment

And a third feature of the (i + 2) th video clip

Characteristic distance between them as initial similarity weight

The feature distance between two features may be determined by using a cosine distance algorithm, an euclidean distance algorithm, a dot product distance algorithm, and the like, which is not specifically limited in this disclosure.

In one possible implementation manner, fusing the third features of the video segment and the neighboring video segments thereof based on the target similarity weight to update the first feature of the video segment, including: and fusing the third features of the video segments and the third features of the adjacent video segments with the target similarity weight larger than the similarity weight threshold value so as to update the first features of the video segments.

In order to better utilize the local context in time sequence, a similarity weight threshold value is set, and only the third feature of the video segment and the third feature of the adjacent video segment with the target similarity weight larger than the similarity weight threshold value are fused to update the first feature of the video segment.

Still taking the above-mentioned tth iterative update process corresponding to the ith video segment as an example, the target similarity weight of the ith video segment in the target neighborhood

The method comprises the following steps:

similarity weight at target

And

less than the similarity weight threshold, only the target similarity weight

And

under the condition that the similarity weight threshold value is larger than the similarity weight threshold value, the first feature after the t iteration updating corresponding to the ith video segment

In one possible implementation manner, generating a target time-series action nomination of a video to be recognized based on second features of a plurality of video segments includes: classifying the video to be identified based on the second characteristics of the video segments to obtain a first initial time sequence action nomination of the video to be identified; performing regression processing based on the second characteristics of the plurality of video clips and the first initial time sequence action nomination to obtain a second initial time sequence action nomination of the video to be identified; a target time series action nomination is determined based on the first initial time series action nomination and the second initial time series action nomination.

Based on the above description, each video clip is a video frame in a video to be identified. And classifying the second characteristics of the plurality of video segments to obtain the initial prediction confidence coefficient of each video segment as the motion starting video frame and the ending prediction confidence coefficient of each video segment as the motion ending video frame.

Further, based on the start confidence threshold and the end confidence threshold, a plurality of predicted action start video frames (the start prediction confidence being greater than the start confidence threshold) and a plurality of predicted action end video frames (the end prediction confidence being greater than the end confidence threshold) are determined.

And determining a plurality of first initial time sequence action nominations of the video to be identified based on the combination of any predicted action starting video frame and any predicted action ending video frame. For example: the first initial time sequence action nomination comprises the following steps: first initial sequence action nomination a: a 3 rd video segment (predicted action start video frame) and a 6 th video segment (predicted action end video frame); first start sequence action nomination B: the 4 th video segment (the predicted motion start video frame) and the 7 th video segment (the predicted motion end video frame).

Still taking the above fig. 2 as an example, as shown in fig. 2, the time sequence action nomination generating network includes a nomination classifier, and the nomination classifier is used to classify the second features of the plurality of video segments to obtain a first initial time sequence action nomination of the video to be identified.

And aiming at any one first initial time sequence action nomination, determining nomination characteristics of the first initial time sequence action nomination. For example, the first initial sequence operation nomination a: the 3 rd video segment (predicted motion start video frame) and the 6 th video segment (predicted motion end video frame) are determined based on the second characteristics of the 3 rd to 6 th video segments, and the nomination characteristics of the first initial time sequence motion nomination A are determined. In an example, four second features corresponding to 3-6 video segments may be averaged as the nomination feature of the first initial time series action nomination a. And similarly, obtaining the nomination feature of the nomination B of the second initial time sequence action.

And performing regression processing on the nomination features of the nomination of the first initial time sequence action to obtain the nomination of the second initial time sequence action and the regression confidence coefficient of the nomination. And the number of the second initial time sequence action nominations is the same as that of the first initial time sequence action nominations.

For example, the regression processing is performed on the nomination feature of the first initial time-series operation nomination a to obtain a second initial time-series operation nomination a': the 3 rd video segment (the predicted motion start video frame) and the 6 th video segment (the predicted motion end video frame), and the regression confidence of the second initial timing motion nomination a'.

Performing regression processing on the nomination features of the first initial time sequence action nomination B to obtain a second initial time sequence action nomination B': the 3 rd video segment (predicted action start video frame) and the 6 th video segment (predicted action end video frame), and the regression confidence of the second initial timing action nomination B'.

Still taking the above fig. 2 as an example, as shown in fig. 2, the time sequence action nomination generating network includes a nomination regressor, and performs regression processing by using the nomination regressor based on the second characteristics of the plurality of video segments and the first initial time sequence action nomination, so as to obtain a second initial time sequence action nomination of the video to be identified.

And fusing the corresponding first initial time sequence action nomination and the second initial time sequence action nomination with the same start. For example, the first initial time-series operation nomination a and the second initial time-series operation nomination a 'are fused to obtain a fused initial time-series operation nomination a ″ and a target confidence thereof is a regression confidence of the second initial time-series operation nomination a', a starting prediction confidence of the first initial time-series operation nomination a, and an ending prediction confidence corresponding to the first initial time-series operation nomination a.

Similarly, the first initial time-series action nomination B and the second initial time-series action nomination B 'are fused to obtain a fused initial time-series action nomination B ″ and a target confidence thereof is the regression confidence of the second initial time-series action nomination B', the initial prediction confidence of the first initial time-series action nomination B, and the ending prediction confidence of the first initial time-series action nomination B.

And filtering the fused initial time sequence action nominations by using a target confidence coefficient threshold, and determining the fused initial time sequence action nominations with the target confidence coefficient larger than the target confidence coefficient threshold as the target time sequence action nominations.

In order to further improve the nomination accuracy, the nomination of the target time sequence action can be further processed by non-maximum suppression (NMS) to obtain the nomination of the target time sequence action of the video to be identified.

In the embodiment of the disclosure, feature extraction is performed on a plurality of video segments obtained from a video to be identified to obtain a plurality of first features, for any video segment, the first features of the video segment are updated based on the association relationship between the video segment and the adjacent video segments thereof to obtain a second feature of the video segment, and a target time sequence action nomination of the video to be identified is generated based on the second features of the plurality of video segments. For any video clip, only the association relation between the video clip and the adjacent video clip in the target neighborhood is considered, the local context information on the time sequence is better utilized, the influence of the global background noise in the video to be identified is reduced, and the semantic representation capability of the second feature of the video clip is improved, so that the target time sequence action nomination with higher accuracy corresponding to the video to be identified can be generated based on the second features of a plurality of video clips.

It is understood that the above-mentioned embodiments of the method of the present disclosure can be combined with each other to form a combined embodiment without departing from the principle logic, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a device, an electronic device, a computer-readable storage medium, and a program for generating a time series action nomination, which are all used to implement any one of the methods for generating a time series action nomination provided by the present disclosure, and the descriptions of the corresponding technical solutions and the corresponding descriptions in the method sections are omitted for brevity.

Fig. 4 shows a block diagram of a device for generating a time-series action nomination according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 40 includes:

a feature extraction module 41, configured to perform feature extraction on a plurality of video segments obtained from a video to be identified to obtain a plurality of first features;

an updating module 42, configured to update, for any video segment, a first feature of the video segment based on a correlation between the video segment and an adjacent video segment of the video segment, so as to obtain a second feature of the video segment, where the adjacent video segment is located in a target neighborhood of the video segment;

and a nomination generating module 43, configured to generate a nomination of a target time sequence action of the video to be identified, based on the second features of the multiple video segments.

In one possible implementation, the update module 42 includes:

the first determining submodule is used for extracting context information of the first characteristics of the video clip and the adjacent video clip on a time sequence aiming at any video clip to obtain a third characteristic of the video clip;

a second determining submodule, configured to determine a target similarity weight between the video segment and a neighboring video segment;

the third determining submodule is used for fusing the third characteristics of the video segment and the adjacent video segments thereof based on the target similarity weight between the video segment and the adjacent video segments thereof so as to update the first characteristics of the video segment;

and the iteration submodule is used for iteratively executing the steps, and determining the updated first characteristic corresponding to the video segment as the second characteristic of the video segment under the condition that the preset iteration times are reached.

In one possible implementation, the target neighborhood corresponds to at least one neighborhood dimension;

a second determination submodule comprising:

a first determining unit, configured to determine, for any neighborhood scale, an initial similarity weight between the video segment and a neighboring video segment corresponding to the video segment at the neighborhood scale;

and the second determining unit is used for fusing the initial similarity weights corresponding to at least one neighborhood scale to obtain a target similarity weight.

In a possible implementation manner, the first determining unit is specifically configured to:

for any video clip, coding the third characteristics of the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale to obtain the relationship characteristics between the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale;

and decoding the relation characteristics between the video segment and the adjacent video segment corresponding to the video segment under the neighborhood scale to obtain the initial similarity weight between the video segment and the adjacent video segment corresponding to the video segment under the neighborhood scale.

for any video clip, determining a feature distance between the third feature of the video clip and the third feature of the corresponding adjacent video clip under the neighborhood scale;

and obtaining the initial similarity weight between the video segment and the corresponding adjacent video segment under the neighborhood scale based on the characteristic distance.

In a possible implementation manner, the third determining submodule is specifically configured to:

and fusing the third features of the video segments and the third features of the adjacent video segments with the target similarity weight larger than the similarity weight threshold value so as to update the first features of the video segments.

In a possible implementation manner, the nomination generating module 43 is specifically configured to:

classifying the video to be identified based on the second characteristics of the video segments to obtain a first initial time sequence action nomination of the video to be identified;

performing regression processing based on the second characteristics of the plurality of video clips and the first initial time sequence action nominations to obtain second initial time sequence action nominations of the videos to be recognized;

a target time series action nomination is determined based on the first initial time series action nomination and the second initial time series action nomination.

The method has specific technical relevance with the internal structure of the computer system, and can solve the technical problems of how to improve the hardware operation efficiency or the execution effect (including reducing data storage capacity, reducing data transmission capacity, improving hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system according with the natural law.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

Embodiments of the present disclosure also provide a computer program product, which includes computer readable code or a non-volatile computer readable storage medium carrying computer readable code, when the computer readable code is run in a processor of an electronic device, the processor in the electronic device executes the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. Referring to fig. 5, the electronic device 800 may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or other terminal device.

Referring to fig. 5, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operating mode, such as a shooting mode or a video mode. Each of the front camera and the rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in memory 804 or transmitted via communications component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing status assessments of various aspects to the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G), a long term evolution of universal mobile communication technology (LTE), a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, etc. The specific application can relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to a real scene or an article, and can also relate to special effect treatment related to people such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure. Referring to fig. 6, the electronic device 1900 may be provided as a server or a terminal device. Referring to fig. 6, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932 ^TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) ^TM ) Multi-user, multi-process computer operating system (Unix) ^TM ) Free and open native code Unix-like operating System (Linux) ^TM ) Open native code Unix-like operating System (FreeBSD) ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied therewith for causing a processor to implement various aspects of the present disclosure.

The computer-readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a variety of computing/processing devices, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The foregoing description of the various embodiments is intended to highlight the differences between the various embodiments, and the same or similar points may be referred to each other, and for brevity, will not be described again herein.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing of each step in the method of the present invention does not imply a strict order of execution and should in any way limit the process of execution, and that the specific order of execution of each step should be determined by its function and possible inherent logic.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, before the sensitive personal information is processed, a product applying the technical scheme of the application obtains individual consent and simultaneously meets the requirement of 'explicit consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is regarded as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by means of popup information or asking a person to upload personal information of the person by himself; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for generating a time-series action nomination is characterized by comprising the following steps:

performing feature extraction on a plurality of video segments obtained from a video to be identified to obtain a plurality of first features;

aiming at any one video clip, updating the first characteristics of the video clip based on the incidence relation between the video clip and the adjacent video clip thereof to obtain the second characteristics of the video clip, wherein the adjacent video clip is positioned in the target neighborhood of the video clip;

and generating a target time sequence action nomination of the video to be identified based on the second characteristics of the plurality of video segments.

2. The method according to claim 1, wherein the updating the first feature of the video segment based on the association relationship between the video segment and the neighboring video segments thereof to obtain the second feature of the video segment comprises:

for any video clip, extracting the context information of the first features of the video clip and the adjacent video clips on the time sequence to obtain a third feature of the video clip;

determining a target similarity weight between the video segment and its neighboring video segments;

fusing the third features of the video segments and the adjacent video segments thereof based on the target similarity weight so as to update the first features of the video segments;

and iteratively executing the steps, and determining the updated first feature corresponding to the video segment as the second feature of the video segment under the condition that the preset iteration times are reached.

3. The method of claim 2, wherein the target neighborhood corresponds to at least one neighborhood dimension;

the determining a target similarity weight between the video segment and its neighboring video segments comprises:

for any one of the neighborhood scales, determining an initial similarity weight between the video segment and a corresponding adjacent video segment at the neighborhood scale;

and fusing the initial similarity weights corresponding to the at least one neighborhood scale to obtain the target similarity weight.

4. The method of claim 3, wherein determining an initial similarity weight between the video segment and its corresponding neighboring video segment at the neighborhood scale comprises:

for any video clip, encoding the third characteristics of the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale to obtain the relationship characteristics between the video clip and the adjacent video clip corresponding to the video clip under the neighborhood scale;

and decoding the relation characteristics to obtain the initial similarity weight between the video segment and the adjacent video segment corresponding to the video segment under the neighborhood scale.

5. The method of claim 3, wherein determining an initial similarity weight between the video segment and its corresponding neighboring video segment at the neighborhood scale comprises:

for any one of the video segments, determining a feature distance between a third feature of the video segment and a third feature of a neighboring video segment corresponding to the third feature at the neighborhood scale;

and obtaining an initial similarity weight between the video segment and the adjacent video segment corresponding to the video segment under the neighborhood scale based on the characteristic distance.

6. The method according to any one of claims 2 to 5, wherein the fusing the third features of the video segment and the neighboring video segments thereof to update the first feature of the video segment based on the target similarity weight comprises:

and fusing the third features of the video segments and the third features of the adjacent video segments with the target similarity weight larger than a similarity weight threshold value so as to update the first features of the video segments.

7. The method according to any one of claims 1 to 6, wherein the generating a target time series action nomination of the video to be recognized based on the second features of the plurality of video segments comprises:

performing regression processing on the basis of the second characteristics of the plurality of video clips and the first initial time sequence action nomination to obtain a second initial time sequence action nomination of the video to be identified;

determining the target time sequence action nomination based on the first initial time sequence action nomination and the second initial time sequence action nomination.

8. An apparatus for generating a time-series action nomination, comprising:

the characteristic extraction module is used for extracting the characteristics of a plurality of video clips obtained by videos to be identified to obtain a plurality of first characteristics;

the updating module is used for updating the first characteristics of the video clips to obtain the second characteristics of the video clips based on the incidence relation between the video clips and the adjacent video clips of the video clips aiming at any video clip, wherein the adjacent video clips are positioned in the target neighborhood of the video clips;

and the nomination generating module is used for generating the target time sequence action nomination of the video to be identified based on the second characteristics of the plurality of video segments.

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 7.

10. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 7.