CN114419508A

CN114419508A - Recognition method, training method, device, equipment and storage medium

Info

Publication number: CN114419508A
Application number: CN202210059020.0A
Authority: CN
Inventors: 吴文灏; 夏博洋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-29

Abstract

The disclosure provides an identification method, a training method, a device, equipment and a storage medium, and relates to the field of artificial intelligence, in particular to computer vision, video analysis and deep learning technologies. The specific implementation scheme is as follows: based on a target video frame in a video to be identified, obtaining first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame; selecting a first target value from a decision value representing the first identification decision information and a characteristic value representing the first block characteristic information; and if the first target value represents the characteristic value of the block, taking the block corresponding to the first target value as a first target block. Therefore, the salient block in the video to be identified, namely the first target block, is quickly identified.

Description

Recognition method, training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to computer vision, video analysis, and deep learning techniques.

Background

Efficient video identification, namely, the accuracy of video identification is required, and meanwhile, the computing resources required by the video identification are limited. Efficient video identification is widely used in scenes such as automatic driving and video monitoring, and is becoming an increasingly important topic in computer vision communities.

Disclosure of Invention

The disclosure provides an identification method, a training method, an apparatus, a device and a storage medium.

According to an aspect of the present disclosure, there is provided an identification method including:

based on a target video frame in a video to be identified, obtaining first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame;

selecting a first target value from a decision value representing the first identification decision information and a characteristic value representing the first block characteristic information;

and if the first target value represents the characteristic value of the block, taking the block corresponding to the first target value as a first target block.

According to another aspect of the present disclosure, there is provided a model training method, including:

inputting a target sample frame in a sample video to a block model to be trained to obtain second block feature information of a block contained in the target sample frame and second identification decision information aiming at the target sample frame;

selecting a second target value from the decision value representing the second identification decision information and the characteristic value representing the second block characteristic information;

under the condition that the second target value represents the characteristic value of the block, taking the block corresponding to the second target value as a second target block, and inputting the second target block into a preset classification model for classification to obtain a classification result;

and performing joint training on the block model to be trained and a preset classification model based on the classification result, label information corresponding to the target sample frame and a loss function determined by the control parameter of the identification decision information to obtain the target block model and the target classification model.

According to still another aspect of the present disclosure, there is provided an identification apparatus including:

the video frame processing unit is used for obtaining first block feature information of a block contained in a target video frame and first identification decision information aiming at the target video frame based on the target video frame in a video to be identified;

a target value determining unit, configured to select a first target value from a decision value representing the first identification decision information and a feature value representing the first block feature information;

and the target block determining unit is used for taking the block corresponding to the first target value as the first target block under the condition that the first target value represents the characteristic value of the block.

According to still another aspect of the present disclosure, there is provided a model training apparatus including:

the first model processing unit is used for inputting a target sample frame in a sample video to a block model to be trained to obtain second block characteristic information of a block contained in the target sample frame and second identification decision information aiming at the target sample frame;

a result processing unit, configured to select a second target value from a decision value representing the second identification decision information and a feature value representing the second block feature information;

the second model processing unit is used for taking the block corresponding to the second target value as a second target block under the condition that the second target value represents the characteristic value of the block, and inputting the second target value into a preset classification model for classification to obtain a classification result;

and the model training unit is used for performing combined training on the block model to be trained and a preset classification model based on the classification result, the label information corresponding to the target sample frame and a loss function determined by the control parameter of the identification decision information to obtain the target block model and the target classification model.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described identification method; alternatively, the training method described above is performed.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the above-described identification method; alternatively, the training method described above is performed.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the identification method according to the above; alternatively, the training method described above is performed.

Thus, the recognition efficiency can be improved, and the recognition cost can be reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart of an implementation of an identification method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a target video frame after a blocking process in a specific example according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of an implementation of a model training method according to an embodiment of the present disclosure;

FIG. 4 is a sample schematic diagram of a model identification method in a specific example according to an embodiment of the disclosure;

FIGS. 5(a) and 5(b) are schematic diagrams of an identification flow of a model identification method in a specific example according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a structure of an identification appliance according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a model identification apparatus according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of an electronic device for implementing a recognition method or a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In practical application scenarios, accuracy is no longer a main problem of video identification, but how to complete identification under the condition of few computing resources becomes a problem of important attention required by video identification. The invention aims to solve the problem that the calculation amount is far less than that of the conventional method for identifying the complete frame by sampling a video to be identified and subsequently identifying and classifying a significant space block (namely a first target block) of a target video frame (such as a significant frame) obtained by sampling, so that the identification precision is not reduced and the calculation cost is greatly reduced.

Specifically, the present disclosure provides an identification method, as shown in fig. 1, including:

step S101: based on a target video frame in a video to be identified, first block feature information of a block included in the target video frame and first identification decision information aiming at the target video frame are obtained.

Here, as shown in fig. 2, the target video frame may include a plurality of blocks, for example, 9 blocks, based on which first block characteristic information of each block may be obtained, for example, the first block characteristic information may specifically be a position of the target video frame where the current block is located, and a characteristic value of the current block, which may specifically represent a significant value of the current block from other blocks.

For example, taking the 9 blocks shown in fig. 2 as an example, the 5 th block can be a significant block by numbering from top to bottom and from left to right.

In this example, the first identification decision information is used to characterize identification decision information for continuing identification of the target video frame or stopping identification.

Step S102: and selecting a first target value from the decision value representing the first identification decision information and the characteristic value representing the first block characteristic information.

Step S103: and if the first target value represents the characteristic value of the block, taking the block corresponding to the first target value as a first target block.

Here, in the case where the first target value characterizes the feature value of the block, it is explained that the identification needs to be continued. And then, taking the block corresponding to the first target value as a first target block.

Therefore, the characteristic value of the block contained in the target video frame and the first identification decision information aiming at the target video frame can be obtained by the scheme, and further, under the condition that continuous identification is needed, namely under the condition that the characteristic value of the block is represented by the first target value, the block corresponding to the first target value is used as the first target block, and the first target block is the obvious block obtained by identification.

In a specific example of the present disclosure, after obtaining a first target block, the first target block may be further classified, so as to obtain a target classification result, that is, a target classification result of the target video frame. In practical application, when the classification result of the video to be recognized can be determined based on the target classification results of a plurality of target video frames in the video to be recognized, in other words, based on the target classification results of the target video frames, the classification result of the video to be recognized can be obtained, so that technical support is provided for rapid video classification;

in this way, since the scheme of the present disclosure classifies the first target block obtained by identification in the identification (also referred to as classification) process, compared with the scheme of identifying the whole video frame, the present disclosure can effectively improve the identification preparation rate, and at the same time, reduce the calculation cost, that is, reduce the identification cost.

In a specific example of the present disclosure, in order to further improve the classification efficiency, a classification model may be further used to complete a classification task, specifically, the classifying the first target block to obtain a target classification result includes: and inputting the first target block into a target classification model to obtain a target classification result.

In practical application, a high-performance classification model can be used as a target classification model, so that detail extraction is performed and classification is completed, and the accuracy of a classification result is further improved.

In this way, since the present disclosure uses the block, i.e. the first target block, instead of the complete video frame in the process of performing the classification processing by using the classification model, the recognition preparation rate can be further improved, and meanwhile, the calculation cost is reduced.

In a specific example of the disclosed solution, in a case where the first target value characterizes a decision value, the identification process for the video to be identified is stopped. That is to say, under the condition that the first target value represents the decision value, it is described that the identification needs to be stopped, that is, the identification of the current target video frame is stopped, and under the condition that a plurality of target video frames are in the video to be identified, the identification of the target video frame which is not identified yet in the video to be identified is also stopped, in other words, the identification process for the video to be identified is stopped, so that the repeated identification is effectively avoided through the set early-quit mechanism, the identification efficiency is further improved, and meanwhile, the identification cost is further reduced.

For example, if 10 target video frames are acquired from the video to be identified, 5 target video frames are identified, and the obtained identification results are the same, at this time, based on the early-quit mechanism of the present disclosure, in the process of processing the 6 th target video frame, the first target value may represent the decision value, that is, the identification task may be directly stopped without identifying the 6 th target video frame and the remaining three other target video frames, so that repeated invalid identification is effectively avoided, the identification efficiency is further improved, and the identification cost is further reduced.

In a specific example of the present disclosure, in order to further improve the recognition efficiency and the recognition preparation rate, a model may be further used to obtain block feature information and recognition decision information; specifically, the obtaining of the first block feature information of the block included in the target video frame based on the target video frame in the video to be identified and the first identification decision information for the target video frame specifically include: inputting a target video frame in a video to be identified into a target block model to obtain first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame.

Therefore, the model (namely, the target block model) trained in advance is fully utilized to extract the significant features of each block in the target video frame, and the significant features are expressed by quantifiable feature values, and meanwhile, the identification decision information (namely, the decision value) can be quantitatively expressed, so that the identification efficiency is improved, and meanwhile, the identification accuracy is ensured.

In a specific example of the present disclosure, based on different extraction functions, the target block model may include different sub models, for example, three sub models, which are a first target network, a second target network, and a third target network, so that different feature extraction tasks are completed through the three sub models, and the recognition efficiency is further improved.

Specifically, the above inputting a target video frame in a video to be identified into a target block model to obtain first block feature information of a block included in the target video frame and first identification decision information for the target video frame specifically includes:

inputting a target video frame in a video to be identified into a first target network in a target block model to obtain global feature information of the target video frame; inputting the global feature information of the target video frame and the key feature information of the associated frame of the target video frame into a second target network in the target block model to obtain the key feature information of the target video frame; inputting the key feature information of the target video frame into a third target network in the target block model to obtain first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame.

Therefore, different submodels, namely different target networks, are fully utilized to complete different feature extraction tasks so as to exert the advantages of each target network, so that the accuracy of quantifiable feature values is improved, the accuracy of quantifiable decision values is improved, the recognition efficiency is further improved, and meanwhile, the recognition accuracy is further improved.

In a specific example of the disclosed solution, the associated frame of the target video frame is a video frame that is previous to the target video frame. Thus, the accuracy of identification is further improved by utilizing the historical information.

In a specific example of the disclosed solution, the first target network is a lightweight network; and/or the third target network is a fully connected neural network; wherein the feature dimension extracted by the lightweight network is smaller than the feature dimension extracted by the fully-connected neural network.

Here, in a specific example, the first target network is a lightweight network; and the third target network is a fully-connected neural network, and the feature dimension extracted by the lightweight network is smaller than the feature dimension extracted by the fully-connected neural network.

Therefore, in the process of extracting the features of the whole video frame (namely, the target video frame), the light-weight network is adopted in the scheme, so that the calculation amount is reduced; for the model needing to make a decision, a network with rich characteristic dimensionality, namely a fully-connected neural network is selected to ensure the recognition preparation rate. Thus, the calculation amount is effectively reduced under the condition that the identification accuracy is not reduced.

For example, a lightweight network may be embodied as MobileNetv 2; further, for a scene in which the target video frame includes N blocks, the fully-connected neural network may be composed of a layer of N +1 neurons, and at this time, an output value of the fully-connected neural network is an N + 1-dimensional vector, where N dimensions in the N +1 dimensions represent a feature value of each block in the target video frame, and each dimension in the N dimensions corresponds to one block; the other dimension of the N +1 dimension represents a decision value (i.e., the decision value of the first identification decision information described above), so as to provide data support for making a subsequent decision that requires recognition to continue or that requires recognition to stop (also referred to as exit recognition).

In a specific example of the present disclosure, the second target network is a gated loop network, where the gated loop network is trained based on key feature information of a video sample frame and key feature information of an associated video frame of the video sample frame. Therefore, based on the loop structure in the gated loop network, the historical information (namely the state variable of the previous frame) is fully utilized, and network support is provided for more accurate characteristic values and decision values.

For example, a video to be identified includes a plurality of target video frames, and the current processed target video frame is the ith target video frame, that is, the ith target video frame is taken as an example for explanation; in particular, to better make recognition decisions, the present example employs Gumbel Softmax trick based on q output by a fully-connected neural network_iSamplingA specific action decision (i.e. first target value) a_i. To facilitate the subsequent expression, the processing result (i.e. the first block feature information and the first identification decision information) q of the ith target video frame is used_iIs characterized by

Namely, it is

Thus, q is characterized_iIs a vector of dimension N + 1.

Further, q for the output of the fully-connected neural network_iTaking softmax, i.e.

Here, the first and second liquid crystal display panels are,

i.e. softmax of the corresponding i-th target video frame, based on which it can be noted as

The pi_iNamely the result of selecting softmax corresponding to the ith target video frame.

The action decision can be obtained based on the following formula, namely the first target value a_iNamely:

wherein, a_i∈[0,N]I.e. a_iIs 0, or a natural number from 1 to N. Further, if it is assumed that: characterization of q_iIs a vector [ q ] of dimension N +1₀,q₁,…,q_l,…,q_N]Middle 0 position q₀Bit 1 q representing a decision value requiring continued recognition or requiring cessation of recognition₁Characterizing the eigenvalues of the first block, similarly, the firstq_lBits representing characteristic values of the ith block, Nth bit q_NCharacterizing the feature value of the Nth block. At this time, if a_iNon-zero, i.e. a_i∈[1,N]Then, it means that the identification needs to be continued, and a is determined from the block contained in the ith target video frame_iThe indicated block (i.e. the first target block), for example, determines the a-th block in the i-th target video frame_iA block, and then the a_iInputting a target classification model by each block; if a_iIf the value is 0, the feedforward process of the current target video frame is exited in advance, and the subsequent other frames are not input into the space-time sampling strategy module for identification any more, namely the identification process is stopped.

For example, with q_iThe description is for a 5-dimensional vector as an example:

at this time, a_iSince 0.3 of the 2 nd bit is the maximum value, the position of 0.3 is output, i.e. 2; further, the 2 nd block in the ith target video frame is input to the classification model for identification.

Or, continue with q_iThe description is for a 5-dimensional vector as an example:

at this time, a_i0.3 of the 0 th bit is the maximum value, so that the output a is output_iWhen 0, it means that the recognition needs to be stopped, and the other frames remaining in the list L are not recognized any more.

Therefore, the obvious blocks are input into the classification model for identification under the condition that the identification needs to be continued, and the identification process is directly quitted under the condition that the identification does not need to be continued, so that the accuracy is effectively improved under the condition that the calculation cost is not changed; meanwhile, the calculation cost can be effectively reduced under the condition that the accuracy rate is not changed.

In a specific example of the present disclosure, sampling may be performed in a manner that a video to be identified is grouped to obtain at least two groups of sub-videos; and selecting a video frame from at least one group of the sub-videos as a target video frame. Therefore, the video frames sampled from the time dimension are identified, and not all the frames in the video to be identified, so that the identification efficiency is further improved.

In a specific example of the present disclosure, in a case that a plurality of target video frames are obtained by sampling, combining the obtained plurality of target video frames into a video frame set; and then, selecting a target video frame aiming at the video to be identified from the video frame set. Therefore, the sampling task is completed first, the target video frames obtained in the sampling process form a video frame set, and then the target video frames are selected from the video frame set to be identified, for example, the target video frames are identified one by one from the first target video frame in the video frame set until the identification is exited.

Therefore, the scheme only needs to identify the target video frame in the video frame set obtained by sampling from the time dimension, but not all frames in the video to be identified, and therefore, the identification efficiency is further improved.

For example, in time, the video to be identified may be divided into a plurality of coarse-grained chapters, that is, the video to be identified may be divided into a plurality of groups, each coarse-grained chapter may have a significant frame, and therefore, the frame sequence may be sampled from coarse (i.e., the number of chapters to be divided is small) to fine (the number of chapters to be divided is large). For video to be identified

T is a natural number greater than or equal to 1, a list of frame indexes L { } (initially null) is initialized, and first, the frame at the middle of the video to be identified is taken as the target video frame, that is, the target video frame is taken as the video frame

Then, the user can use the device to perform the operation,dividing the video to be identified into 2 chapters at equal intervals, taking the most middle 1 frame from each chapter as a target video frame, and putting the target video frame into a list L

Dividing 3 chapters at equal intervals by the circulation, taking out the most middle frame from each chapter as a target video frame, and continuously putting L; here, if the frame obtained by current sampling is coincident with the frame in L, the frame obtained by current sampling is discarded, and so on, and the sampling obtains a hierarchical frame index list

In terms of time, M target video frames are obtained from the video to be identified, wherein M is a natural number which is greater than or equal to 1 and less than or equal to T.

It can be understood that the scheme disclosed by the invention can be applied to large-scale short video classification and movie and television series classification tasks, and the video classification accuracy can be improved under the condition that the calculated amount is not changed; meanwhile, under the condition that the accuracy rate is not changed, the calculation amount is reduced.

The present disclosure further provides a model training method, as shown in fig. 3, including:

step S301: inputting a target sample frame in a sample video to a block model to be trained to obtain second block feature information of a block contained in the target sample frame and second identification decision information aiming at the target sample frame.

Step S302: and selecting a second target value from the decision value representing the second identification decision information and the characteristic value representing the second block characteristic information.

In step S303, under the condition that the second target value represents the feature value of the block, the block corresponding to the second target value is used as a second target block, and is input to a preset classification model for classification, so as to obtain a classification result.

Step S304: and performing joint training on the block model to be trained and a preset classification model based on the classification result, label information corresponding to the target sample frame and a loss function determined by the control parameter of the identification decision information to obtain the target block model and the target classification model.

Therefore, model support is provided for completing efficient recognition, and meanwhile, model support is provided for improving recognition efficiency and reducing recognition cost.

In a specific example of the disclosed aspect, wherein,

the control parameter of the identification decision information is related to at least one of the following information:

the decision value in case the second target value characterizes the decision value;

the number of the target sample frames;

a preset value.

For example, the loss function of the disclosed solution may be specifically:

L＝CELoss(z,label)+penalty；

wherein CELoss is the classification loss, z is the classification decision output by the classification model to be trained, and label is the class label. The dependency characterizes a control parameter of the recognition decision information and, in particular,

here, the value of pi_i[0]Characterizing a in the ith target video frame_iArgmax result of 0.σ is a preset value. P is the number of target sample frames, more specifically, P is all targets before early retirementTotal number of frames of the standard sample.

Therefore, the problem that the stopping (early quitting) process is not triggered when the model is continuously identified in order to reduce cross entropy loss in the model training stage is effectively solved, and support is provided for further realizing efficient identification.

In a specific example of the present disclosure, the inputting a target sample frame in a sample video to a block model to be trained to obtain second block feature information of a block included in the target sample frame and second identification decision information for the target sample frame includes:

inputting a target sample frame in a sample video to a first network to be trained in a block model to be trained to obtain global feature information of the target sample frame; inputting the global feature information of the target sample frame and the key feature information of the associated frame of the target sample frame into a second network to be trained in the block model to be trained to obtain the key feature information of the target sample frame; and inputting the key feature information of the target sample frame into a third network to be trained in the block model to be trained to obtain second key feature information of a block contained in the target sample frame and second identification decision information aiming at the target sample frame.

Therefore, the accuracy and the recognition efficiency of model recognition are further improved by performing combined training on the block model to be trained (comprising the lightweight trunk network, the gated cyclic network and the fully-connected neural network) and the preset classification model.

In a specific example of the disclosed solution, the associated frame of the target sample frame is a previous video frame of the target sample frame. Thus, the accuracy of identification is further improved by utilizing the historical information.

In a specific example of the present disclosure, the first network to be trained is a lightweight network to be trained; and/or the third network to be trained is a fully-connected neural network to be trained; and the characteristic dimension extracted by the lightweight network to be trained is smaller than that extracted by the fully-connected neural network to be trained.

In a specific example, the first network to be trained is a lightweight network to be trained; and the third network to be trained is a fully-connected neural network to be trained.

In a specific example of the present disclosure, the second network to be trained is a gated cyclic network to be trained. Therefore, based on the loop structure in the gated loop network, the historical information (namely the state variable of the previous frame) is fully utilized, and network support is provided for more accurate characteristic values and decision values.

It is understood that, in order to distinguish models of the model training phase and the model using phase, the models trained in the model training phase may be respectively referred to as: a block model to be trained and a classification model to be trained (namely a preset classification model); wherein the block model to be trained comprises: the method comprises the steps of a lightweight network to be trained, a gated cyclic network to be trained and a fully-connected neural network to be trained.

Correspondingly, after the training is finished, the models used by the identification method can be correspondingly obtained, namely a target block model and a target classification model; the target block model obtained after the training comprises: a lightweight backbone network, a gated cyclic network, and a fully-connected neural network.

For example, a sample video includes a plurality of target sample frames, and the current processed target sample frame is the ith target sample frame, that is, the ith target sample frame is taken as an example for explanation; in particular, to better make recognition decisions, the present example employs Gumbel Softmax techniques based on the q output by the fully-connected neural network to be trained_iSampling a specific action decision (i.e. a second target value) a_i. For facilitating the follow-upExpressing the processing result (i.e. the second block feature information and the second identification decision information) q of the ith target sample frame_iIs characterized by

Namely, it is

Thus, q is characterized_iIs a vector of dimension N + 1.

Further, q of the output of the fully-connected neural network to be trained_iTaking softmax, i.e.

Here, the first and second liquid crystal display panels are,

i.e. softmax of the corresponding i-th target sample frame, based on which it can be said to be

The pi_iNamely the result of selecting softmax corresponding to the ith target sample frame.

The action decision can be obtained based on the following formula, namely the second target value a_iNamely:

wherein, a_i∈[0,N]I.e. a_iIs 0, or a natural number from 1 to N; g_kIs a noise value used for back propagation in the training process.

Further, if it is assumed that: characterization of q_iIs a vector [ q ] of dimension N +1₀,q₁,…,q_l,…,q_N]Middle 0 position q₀Bit 1 q representing a decision value requiring continued recognition or requiring cessation of recognition₁Characterizing the eigenvalues of the first block, and, similarly, the qth_lBits representing characteristic values of the ith block, Nth bit q_NCharacterizing the Nth blockThe value is obtained. At this time, if a_iNon-zero, i.e. a_i∈[1,N]Then, it means that the identification needs to be continued, and a is determined from the block contained in the ith target sample frame_iThe indicated block (i.e., the second target block), for example, determines the a-th block in the i-th target video frame_iA block, and then the a_iInputting a target classification model by each block; if a_iIf the value is 0, the feedforward process of the current target sample frame is exited in advance, and the subsequent other frames are not continuously input into the block model to be trained for identification, namely the identification process is stopped.

Therefore, the scheme of the invention inputs the significant blocks into the classification model to be trained for recognition under the condition of continuous recognition, and directly exits from the recognition process under the condition of no need of continuous recognition, so that the accuracy of the model is effectively improved under the condition of unchanged calculation cost; meanwhile, the calculation cost can be effectively reduced under the condition that the accuracy rate is not changed.

The following describes the present disclosure in further detail with reference to specific examples, and in particular, the present disclosure provides an efficient video recognition method for spatio-temporal hierarchical sampling. In a specific example, the frame mainly comprises three parts, respectively: the system comprises an initial sampling space-time module, a space-time sampling strategy module and a classification module. The initial sampling spatio-temporal module is mainly used for carrying out hierarchical sampling on a video to be identified and obtaining a target video frame for subsequent identification; the spatio-temporal sampling strategy module is mainly used for identifying and obtaining a part which is significant in space and time from a target video frame, and can also be called a significant block (namely a first target block in the scheme of the disclosure); the classification module is mainly used for inputting the significant blocks determined by the space-time sampling strategy module into a high-performance classification model so as to extract details and finish classification.

The specific description of each part specifically includes:

a first part: the initial sampling space-time module specifically comprises the following steps:

first, in order to complete spatio-temporal saliency sampling, the present example pre-samples some target video frames with saliency features and blocks contained in the target video frames. Specifically, they are introduced separately from two dimensions, time and space. In time, based on the hierarchical characteristics of the video to be recognized when being generated, the video to be recognized can be divided into a plurality of coarse-grained chapters, that is, the video to be recognized is divided into a plurality of groups, each coarse-grained chapter is likely to have a significant frame, and therefore, the frame sequence can be sampled from coarse (i.e., the number of chapters to be divided is small) to fine (the number of chapters to be divided is large). As shown in FIG. 4, for the video to be identified

Then, the video to be identified is divided into 2 chapters at equal intervals, and the most part of each chapter is takenThe middle 1 frame is taken as the target video frame and put into the list L, at this time

In terms of time, M target video frames are obtained from the video to be identified, wherein M is a natural number which is greater than or equal to 1 and less than or equal to T. Spatially, for each frame j_iThat is, the target video frame is similar to the convolution operation, for example, a 2-dimensional sliding window is set, starting from the top left corner, the spatial blocks are sampled in order from left to right and from top to bottom according to the line-first principle, and the blocks included in the target video frame are obtained and can be recorded as the block set

N is the number of blocks of the sampled space, and is a natural number greater than or equal to 2. Accordingly, each frame gets a block set

A second part: the spatio-temporal sampling strategy module (i.e. the target block model described above) can output a specific block (i.e. the first target block) sampled in space, and a spatio-temporal sampling strategy (i.e. the first identification decision information) of whether to continue to identify or stop identifying in time; specifically, the method mainly comprises the following steps: a lightweight backbone network (e.g., MobileNetv2, etc.) (i.e., a lightweight network according to the present disclosure), a Gated Recirculation Unit (GRU) (i.e., a Gated recirculation network according to the present disclosure), and a fully-connected neural network; the method comprises the following specific steps:

firstly, a lightweight backbone network (such as MobileNetv 2) is used as a coarse-grained feature extractor to extract global feature information v of an original image (namely, each target video frame obtained by sampling the first part of the original image)₀(ii) a Here, taking the ith target video frame in the list L as an example for explanation, the global feature information v of the ith target video frame is obtained₀。

Secondly, global characteristic information v of the ith target video frame is obtained₀And inputting the state variable (namely, key feature information) of the previous frame of the ith target video frame into a gated loop network to obtain the state variable (namely, key feature information) of the ith target video frame, so that a better decision can be made by utilizing historical information (namely, the state variable of the previous frame) based on a loop structure (namely, the gated loop network).

Finally, the state variable of the ith target video frame is input into a full-connection neural network, namely, the processing result q aiming at the ith target video frame can be output_i. Here, the fully-connected neural network is mainly composed of a layer of N +1 neurons, where N is the number of blocks determined by the first part, and based on this, q is the number of the blocks determined by the first part_iThe vector is an N + 1-dimensional vector, wherein N dimensions in the N +1 dimensions represent characteristic values of each block, and each dimension in the N dimensions corresponds to one block; the other dimension of the N +1 dimension represents a decision value (i.e., the decision value of the first identification decision information described above), so as to provide data support for making a subsequent decision that requires recognition to continue or that requires recognition to stop (also referred to as exit recognition).

Here, to better make recognition decisions, the present example employs Gumbel Softmax trick based on q output by a fully-connected neural network_iSampling out a specific action decision a_i. In order to facilitate subsequent expression, processing result q of the ith target video frame_iIs characterized by

Namely, it is

Thus, q is characterized_iIs a vector of dimension N + 1. Further, q for the output of the fully-connected neural network_iTaking softmax, i.e.

Here, the first and second liquid crystal display panels are,

The action decision a can be obtained based on the following formula_iNamely:

wherein, a_i∈[0,N]I.e. a_iIs 0, or a natural number from 1 to N. Further, if it is assumed that: characterization of q_iIs a vector [ q ] of dimension N +1₀,q₁,…,q_l,…,q_N]Middle 0 position q₀Bit 1 q representing a decision value requiring continued recognition or requiring cessation of recognition₁Characterizing the eigenvalues of the first block, and, similarly, the qth_lBits representing characteristic values of the ith block, Nth bit q_NCharacterizing the feature value of the Nth block. At this time, if a_iNon-zero, i.e. a_i∈[1,N]Then, it means that the identification needs to be continued, and a is determined from the block contained in the ith target video frame_iThe indicated block (i.e. the first target block), for example, determines the a-th block in the i-th target video frame_iA block, and then the a_iInputting a classification model for each block; if a_iIf the value is 0, the feedforward process of the current video frame is exited in advance, and other frames in the subsequent L are not input into the space-time sampling strategy module any more, namely the identification process is stopped.

As shown in fig. 5(a), after the target video frame is input to the spatio-temporal sampling policy module, the feature values of 9 blocks and a decision value can be obtained, and then the feature values of the 9 blocks and the decision value are processed in the above manner, and then the feature values corresponding to the block 5 are output, and then the block 5 is input to the classification model for classification, so as to obtain a prediction result, that is, a classification result.

It is understood that, after passing through the spatio-temporal sampling strategy module in fig. 5(a), a specific block map is not output, but specific eigenvalues, and here, in order to illustrate that 9 eigenvalues output correspond to 9 block maps, the 9 block maps are shown here.

Further, as shown in fig. 5(b), taking sampling to obtain 15 target video frames as an example, the 15 target video frames may be specifically numbered as target video frame 1 to target video frame 15; inputting a target video frame 1 into a space-time sampling strategy model to obtain a characteristic value of a block 1, and further inputting the block 1 into a classification model for classification processing to obtain a classification result; similarly, processing the target video frames 2 to 4 to obtain classification results; until the target video frame 5 is processed, a decision value is output, namely recognition is stopped (namely early quitting), and at the moment, the target video frame 5 and the frames behind the target video frame are not recognized any more; at this time, the recognition result before the early stage, that is, the recognition result of the target video frame 4, may be used as the final classification result (that is, the final classification decision).

Further, for example, with q_iThe description is for a 5-dimensional vector as an example:

at this time, a_iSince 0.3 of the 2 nd bit is the maximum value, the position of 0.3 is output, i.e. 2; further, the ith oneThe 2 nd block in the target video frame is input to the classification model for identification.

And a third part: the classification model (i.e. the above-mentioned object classification model) is specifically as follows:

the example inputs the more informative blocks determined in the second part (i.e., the first target blocks) into a high performance network (i.e., a classification model) to extract high quality features. Here, a layer of gated round robin unit (GRU) may be further disposed in the classification model, so that a better classification decision may be made by combining the current high-quality features on the basis of the historical information.

It should be noted that only the classification decision of the last frame before exit is used as the final classification decision in this example.

It should be noted that the present example may employ the following training strategy to train the model used above. Here, in order to distinguish between models in the model training phase and the model using phase, the models trained in the model training phase may be respectively referred to as: a space-time sampling strategy module to be trained (namely a block model to be trained) and a classification model to be trained (namely the preset classification model); wherein, the spatio-temporal sampling strategy module to be trained (i.e. the block model to be trained) comprises: the method comprises the steps of training a lightweight backbone network to be trained (namely the lightweight network to be trained), a gated cyclic unit to be trained (namely the gated cyclic network to be trained), and a fully-connected neural network to be trained.

Correspondingly, after the training is finished, models used by the second part and the third part can be correspondingly obtained, namely a space-time sampling strategy module (namely a target block model) and a classification model (namely a target classification model); the space-time sampling strategy module (i.e. the target block model) obtained after the training is completed includes: the system comprises a lightweight backbone network, a gated cyclic unit (namely, a gated cyclic network) and a fully connected neural network.

The following gives a specific training strategy of the present example, which is introduced from two perspectives, specifically including:

first, training sequence:

firstly, performing preheating training, namely independently training a classification model to be trained and a lightweight trunk network to be trained (namely the lightweight trunk network to be trained), which can also be called preheating training, for example, inputting blocks randomly adopted in a sample video into the classification model to be trained for model training; similarly, video frames (or blocks) randomly adopted in the sample video are input to the lightweight trunk network to be trained for model training, so as to complete the training in the preheating stage. Here, the pre-heating training phase may use all frames in the time sequence as sample data for training, or may use other manners for performing the pre-heating training, which is not limited by the present disclosure. Therefore, a classification module with certain discrimination and a coarse-grained feature extractor (namely, a lightweight trunk network) with certain feature extraction capability can be obtained through preheating training.

It can be understood that, since the model obtained in the preheating training stage is not the trained model, for convenience of description, the model trained in the preheating stage is continuously referred to as the model to be trained, i.e., the classification model to be trained and the lightweight trunk network to be trained (i.e., the lightweight network to be trained).

Secondly, performing combined training; namely, the lightweight trunk network to be trained, the gated circulation unit to be trained (namely, the gated circulation network to be trained), the fully-connected neural network to be trained and the classification model to be trained are subjected to combined training.

Specifically, a plurality of data for training, i.e., target sample frames, are sampled from a sample video. It is understood that the model training phase and the model using phase are independent phases, and the used video and the video frames sampled from the used video may be the same or different, and the present disclosure is not limited thereto. Here, in order to facilitate the use of the formula of the phase for the multiplexing model, i.e. the formula of the second part above, the target sample frame may also be understood as the ith target sample frame.

Further, similar to the model using stage, that is, the second part described above, the ith target sample frame in the sample video is input to the lightweight backbone network to be trained, so as to obtain global feature information of the ith target sample frame; inputting the global feature information of the ith target sample frame and the key feature information (namely, state variables) of the associated frame of the ith target sample frame into a gated loop network to be trained to obtain the key feature information (namely, state variables) of the ith target sample frame; inputting the key feature information of the ith target sample frame into a fully-connected neural network to be trained to obtain second key feature information of a block contained in the ith target sample frame and second identification decision information aiming at the ith target sample frame, namely q similar to the second part in the training stage_i。

Further, during the training phase, q similar to the second part above is obtained_iIs treated to obtain a similar to the second part a above_i。

Here, it should be noted that the training phase determines a_iIs different from the formula of the model using stage, i.e. different from a of the second part above_iIn particular, a of the training phase_iThe method specifically comprises the following steps:

wherein G is_kIs a noise value used for back propagation in the training process.

Further, in the training phase, a similar to the second part above is obtained_iIn case of being other than 0, a is_iWhat is meant byInputting the blocks to a classification model to be trained for classification to obtain a classification result;

further, based on the classification result, the label information corresponding to the target sample frame, and the loss function determined by the control parameter of the identification decision information, the combined training is performed on the lightweight trunk network to be trained, the gated cyclic unit to be trained (i.e., the gated cyclic network to be trained), the fully-connected neural network to be trained, and the classification model to be trained.

Second, training the target:

it should be clear that the training objectives of the disclosed solution, in addition to the normal classification penalty (cross-entropy penalty with image labels), also require a penalty of π obtained during the training phase_iSome restrictions are made because the model triggers a as little as possible in order to reduce cross-entropy losses_iContinuously continuing to recognize the action which is 0; but this spontaneous behavior of the model defeats the purpose of the efficient recognition of this example. Based on this, the present example is at π_iA of (a)_iOn a component equal to 0, i.e. pi_i[0]A penalty term (i.e. the control parameter described above) penalty is added, namely:

here, σ is a preset value, that is, a preset value, P is the number of target sample frames, and more specifically, P is the total number of all target sample frames before early retirement. For example, 5 elements are included for list L, i.e.

In other words, it may be required to exit from the recognition around the 3 rd frame, and at this time, σ may be 0.6. In practical application, the model can adaptively adopt different frame numbers for videos with different difficulties in the learning process. For example, for a video with a high recognition difficulty, the number of frames used is large, and for a video with a low recognition difficulty, the number of frames sampled is small.

Based on this, the loss function of the joint training is:

L＝CELoss(z,label)+penalty；

wherein CELoss is the classification loss, z is the classification decision output by the classification model to be trained, and label is the class label.

Here, it is understood that the sampling mode used in the training phase may be similar to or different from the first part described above, and the present disclosure is not limited thereto.

Thus, the scheme of the disclosure fully considers chapter performance in the video generation process, and realizes sampling from coarse to fine in time; meanwhile, the difference of the significance of different blocks in space is fully considered, significance mining is completed in space, and significant blocks are obtained through mining; meanwhile, a conditional early exit mechanism (namely the exit mechanism) is set; therefore, under the condition that continuous identification is needed, the obvious blocks are input into the classification model for identification, and therefore the accuracy is effectively improved under the condition that the calculation cost is not changed; meanwhile, under the condition that the accuracy rate is not changed, the calculation cost is effectively reduced.

The present disclosure also provides an identification apparatus, specifically, as shown in fig. 6, including:

a video frame processing unit 601, configured to obtain, based on a target video frame in a video to be identified, first block feature information of a block included in the target video frame and first identification decision information for the target video frame;

a target value determining unit 602, configured to select a first target value from a decision value representing the first identification decision information and a feature value representing the first block feature information;

a target block determining unit 603, configured to, when the first target value represents a feature value of a block, take the block corresponding to the first target value as a first target block.

In a specific example of the present disclosure, the method further includes:

and the target classification unit is used for classifying the first target block to obtain a target classification result.

In a specific example of the disclosure, the target classification unit is specifically configured to input the first target block to a target classification model, so as to obtain a target classification result.

In a specific example of the present disclosure, the target block determining unit is further configured to stop an identification process for the video to be identified if the first target value represents a decision value.

In a specific example of the disclosure, the video frame processing unit is specifically configured to:

inputting a target video frame in a video to be identified into a target block model to obtain first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame.

inputting a target video frame in a video to be identified into a first target network in a target block model to obtain global feature information of the target video frame;

inputting the global feature information of the target video frame and the key feature information of the associated frame of the target video frame into a second target network in the target block model to obtain the key feature information of the target video frame;

inputting the key feature information of the target video frame into a third target network in the target block model to obtain first block feature information of a block contained in the target video frame and first identification decision information aiming at the target video frame.

In a specific example of the disclosed solution, the associated frame of the target video frame is a video frame that is previous to the target video frame.

In a specific example of the disclosed solution, the first target network is a lightweight network; and/or the third target network is a fully connected neural network;

the feature dimension extracted by the lightweight network is smaller than the feature dimension extracted by the fully-connected neural network.

In a specific example of the present disclosure, the second target network is a gated loop network, where the gated loop network is trained based on key feature information of a video sample frame and key feature information of an associated video frame of the video sample frame.

In a specific example of the present disclosure, the video frame processing unit is further configured to perform grouping processing on videos to be identified to obtain at least two groups of sub-videos; and selecting a video frame from at least one group of the sub-videos as a target video frame.

In a specific example of the present disclosure, the video frame processing unit is further configured to, in a case that a plurality of target video frames are obtained, combine the obtained plurality of target video frames into a video frame set; and selecting a target video frame aiming at the video to be identified from the video frame set.

The specific functions of the units in the above identification apparatus can be described with reference to the above identification method, and are not described herein again.

The present disclosure further provides a model training apparatus, specifically, as shown in fig. 7, including:

a first model processing unit 701, configured to input a target sample frame in a sample video to a block model to be trained, to obtain second block feature information of a block included in the target sample frame, and second identification decision information for the target sample frame;

a result processing unit 702, configured to select a second target value from the decision value representing the second identification decision information and the feature value representing the second block feature information;

the second model processing unit 703 is configured to, under the condition that the second target value represents the feature value of the block, take the block corresponding to the second target value as a second target block, and input the second target block to a preset classification model for classification to obtain a classification result;

and a model training unit 704, configured to perform joint training on the block model to be trained and a preset classification model based on the classification result, the label information corresponding to the target sample frame, and a loss function determined by the control parameter of the identification decision information, so as to obtain the target block model and the target classification model.

In a specific example of the disclosed solution, the control parameter of the identification decision information is related to at least one of the following information:

the number of the target sample frames;

a preset value.

In a specific example of the disclosure, the first model processing unit is specifically configured to:

inputting a target sample frame in a sample video to a first network to be trained in a block model to be trained to obtain global feature information of the target sample frame;

inputting the global feature information of the target sample frame and the key feature information of the associated frame of the target sample frame into a second network to be trained in the block model to be trained to obtain the key feature information of the target sample frame;

and inputting the key feature information of the target sample frame into a third network to be trained in the block model to be trained to obtain second key feature information of a block contained in the target sample frame and second identification decision information aiming at the target sample frame.

In a specific example of the disclosed solution, the associated frame of the target sample frame is a previous video frame of the target sample frame.

In a specific example of the present disclosure, the first network to be trained is a lightweight network to be trained; and/or the third network to be trained is a fully-connected neural network to be trained;

the characteristic dimensionality extracted by the lightweight network to be trained is smaller than that extracted by the fully-connected neural network to be trained.

In a specific example of the present disclosure, the second network to be trained is a gated cyclic network to be trained.

The specific functions of each unit in the model training device can be described with reference to the model training method, and are not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as the recognition method or the model training method. For example, in some embodiments, the recognition method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM803 and executed by the computing unit 801, a computer program may perform one or more steps of the recognition method or the model training method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the recognition method or the model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An identification method, comprising:

2. The method of claim 1, further comprising:

and classifying the first target block to obtain a target classification result.

3. The method of claim 2, wherein the classifying the first target block to obtain a target classification result comprises:

and inputting the first target block into a target classification model to obtain a target classification result.

4. The method of any of claims 1 to 3, further comprising:

and stopping the identification process aiming at the video to be identified under the condition that the first target value represents a decision value.

5. The method according to any one of claims 1 to 4, wherein the obtaining, based on a target video frame in the video to be identified, first block feature information of a block included in the target video frame and first identification decision information for the target video frame includes:

6. The method according to claim 5, wherein the inputting a target video frame in the video to be identified into a target block model, obtaining first block feature information of a block included in the target video frame, and first identification decision information for the target video frame comprises:

7. The method of claim 6, wherein the associated frame of the target video frame is a video frame that is previous to the target video frame.

8. The method of claim 6 or 7, wherein the first target network is a lightweight network; and/or the third target network is a fully connected neural network;

9. The method of any of claims 6 to 8, wherein the second target network is a gated-loop network, wherein the gated-loop network is trained based on key feature information of a video sample frame and key feature information of an associated video frame of the video sample frame.

10. The method of any of claims 1 to 9, further comprising:

grouping the videos to be identified to obtain at least two groups of sub-videos;

and selecting a video frame from at least one group of the sub-videos as a target video frame.

11. The method of claim 10, further comprising:

under the condition of obtaining a plurality of target video frames, combining the obtained target video frames into a video frame set;

and selecting a target video frame aiming at the video to be identified from the video frame set.

12. A model training method, comprising:

13. The method of claim 12, wherein,

the number of the target sample frames;

a preset value.

14. The method according to claim 12 or 13, wherein the inputting a target sample frame in a sample video to a block model to be trained, obtaining second block feature information of a block included in the target sample frame, and second identification decision information for the target sample frame includes:

15. The method of claim 14, wherein the associated frame of the target sample frame is a video frame that is previous to the target sample frame.

16. The method of claim 14 or 15, wherein the first network to be trained is a lightweight network to be trained; and/or the third network to be trained is a fully-connected neural network to be trained; wherein the content of the first and second substances,

17. The method of any of claims 14 to 16, wherein the second network to be trained is a gated round robin network to be trained.

18. An identification device comprising:

19. The identification device of claim 18, further comprising:

20. The identification device of claim 19, wherein the object classification unit is specifically configured to input the first object block into an object classification model to obtain an object classification result.

21. The identification device according to any one of claims 18 to 20, wherein the target block determination unit is further configured to stop the identification process for the video to be identified if the first target value represents a decision value.

22. The identification apparatus according to any one of claims 18 to 21, wherein the video frame processing unit is specifically configured to:

23. An identification device as claimed in claim 22, wherein the video frame processing unit is specifically configured to:

24. The identification device of claim 23, wherein the associated frame of the target video frame is a video frame that is previous to the target video frame.

25. An identification device as claimed in claim 23 or 24 wherein the first target network is a lightweight network; and/or the third target network is a fully connected neural network;

26. The identification device of any one of claims 23 to 25, wherein the second target network is a gated-loop network, wherein the gated-loop network is trained based on key feature information of a video sample frame and key feature information of an associated video frame of the video sample frame.

27. The identification apparatus according to any one of claims 18 to 26, wherein the video frame processing unit is further configured to perform grouping processing on the video to be identified to obtain at least two groups of sub-videos; and selecting a video frame from at least one group of the sub-videos as a target video frame.

28. The identification apparatus according to claim 27, wherein the video frame processing unit is further configured to, in a case that a plurality of target video frames are obtained, combine the obtained plurality of target video frames into a video frame set; and selecting a target video frame aiming at the video to be identified from the video frame set.

29. A model training apparatus comprising:

30. The model training apparatus of claim 29,

the number of the target sample frames;

a preset value.

31. The model training apparatus as claimed in claim 29 or 30, wherein the first model processing unit is specifically configured to:

32. The model training apparatus as claimed in claim 31, wherein the associated frame of the target sample frame is a video frame previous to the target sample frame.

33. The model training apparatus of claim 31 or 32, wherein the first network to be trained is a lightweight network to be trained; and/or the third network to be trained is a fully-connected neural network to be trained;

34. The model training apparatus of any one of claims 31 to 33 wherein the second network to be trained is a gated loop network to be trained.

35. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the identification method of any one of claims 1-11; or, performing the training method of any one of claims 12-17.

36. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the identification method according to any one of claims 1-11; or, performing the training method according to any one of claims 12-17.

37. A computer program product comprising a computer program which, when executed by a processor, implements an identification method according to any one of claims 1-11; or, performing the training method according to any one of claims 12-17.