CN112307883A

CN112307883A - Training method, training device, electronic equipment and computer readable storage medium

Info

Publication number: CN112307883A
Application number: CN202010763380.XA
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2021-02-02
Anticipated expiration: 2040-07-31
Also published as: CN112307883B

Abstract

The disclosure relates to a training method, a training device, an electronic device and a computer-readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one of the extracted image blocks as a query image block; inputting each image block into a visual feature extraction model to obtain a code corresponding to each image block, wherein the code corresponding to the query image block is used as a query code; determining a first comparison loss function according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and the codes corresponding to the image blocks in different sample videos, and adjusting parameters of a visual feature extraction model according to the loss function of the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises the first comparison loss function.

Description

Training method, training device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

In recent years, artificial intelligence techniques have been rapidly developed. Computer vision is an important branch of the field of artificial intelligence, and certain achievements have been achieved at present. Computer vision includes the understanding and processing of images, videos, etc. by a computer. Where the understanding and processing complexity of the video is higher.

Extracting visual features of a video is a very critical part in understanding of the video, and the accuracy of visual feature extraction directly relates to the understanding of the video and the accuracy of results of downstream tasks (e.g., motion recognition, object tracking). The extraction of visual features can adopt a deep learning method. The deep learning includes supervised learning, unsupervised learning and the like. Currently, supervised learning has made a significant advance, and is dominant in the learning of visual features of video.

Disclosure of Invention

The inventor finds that: the results of supervised learning depend to a large extent on the large number of specialized labels required to train the deep neural network. The process of labeling is complex and cumbersome. In addition, supervised learning is carried out aiming at a very specific task, and the obtained visual feature extraction model is difficult to be suitable for other tasks, so that the generalization problem exists.

One technical problem to be solved by the present disclosure is: a new unsupervised training method for a visual feature extraction model is provided.

According to some embodiments of the present disclosure, there is provided a training method comprising: selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one of the extracted image blocks as a query image block; inputting each image block into a visual feature extraction model to obtain a code corresponding to each image block, wherein the code corresponding to the query image block is used as a query code; determining a first comparison loss function according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and the codes corresponding to the image blocks in different sample videos, wherein the higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query code and the codes corresponding to the image blocks in different sample videos is, and the smaller the value of the first comparison function is; and adjusting parameters of the visual feature extraction model according to the loss function of the visual feature extraction model, and training the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises a first comparative loss function.

In some embodiments, a frame in which the query image block is located serves as an anchor frame, the extracted image block further includes another image block extracted from the anchor frame and different from the query image block, and the extracted image block serves as a first key-value image block, and the method further includes: determining a second contrast loss function according to the similarity between the query code of each sample video and the code corresponding to the first key value image block and the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video, wherein the higher the similarity between the query code and the code corresponding to the first key value image block is, the lower the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video is, and the smaller the value of the second contrast loss function is; wherein the loss function of the visual feature extraction model further comprises a second comparative loss function.

In some embodiments, a frame in which the image block is located is queried as an anchor frame, where the anchor frame is a first frame or a last frame of the multi-frame image arranged in time sequence, and the method further includes: aiming at each sample video, combining the query codes and codes corresponding to image blocks extracted from other frames in the same sample video into sequence codes according to a preset sequence; inputting the sequence codes into a classification model to obtain a prediction time sequence of a query image block and image blocks extracted from other frames in the same sample video in the sample video; determining a third loss function according to the corresponding predicted time sequence of each sample video and the real time sequence of the image blocks extracted from the image blocks and other frames in the same sample video in the sample video; wherein the loss function of the visual feature extraction model further comprises a third contrast loss function.

In some embodiments, the visual feature extraction model includes a query encoder and a key value encoder, the query encoder is configured to obtain a query code, and the key value encoder is configured to obtain codes corresponding to image blocks other than the query image block; adjusting parameters of the visual feature extraction model according to the loss function of the visual feature extraction model comprises: in each iteration, the parameters of the current iteration of the query encoder are adjusted according to the loss function of the visual feature extraction model, and the parameters of the current iteration of the key value encoder are adjusted according to the parameters of the last iteration of the query encoder and the parameters of the last iteration of the key value encoder.

In some embodiments, a frame in which the query image block is located serves as an anchor frame, the extracted image blocks further include another image block which is extracted from the anchor frame and is different from the query image block, the image block serves as a first key value image block, and one image block is respectively extracted from two other frames of the same sample video and serves as a second key value image block and a third key value image block; determining a first comparison loss function according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and the codes corresponding to image blocks in different sample videos comprises: for each sample video, determining an inter-frame loss function corresponding to the sample video according to the similarity of a query code and a first key value code corresponding to a first key value image block, a second key value code corresponding to a second key value image block and a third key value code corresponding to a third key value image block, and the similarity of the query code and each negative key value code, wherein each negative key value code comprises the first key value code, the second key value code and the third key value code corresponding to other sample videos; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

In some embodiments, extracting image blocks from other frames in the same sample video includes extracting one image block from two other frames in the same sample video respectively as a second key value image block and a third key value image block corresponding to the sample video; determining a second contrast loss function according to the similarity between the query code of each sample video and the code corresponding to the first key value image block and the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video includes: for each sample video, determining an intra-frame loss function corresponding to the sample video according to the similarity of the query code and a first key value code corresponding to the first key value image block, and the similarity of the query code and a second key value code corresponding to the second key value image block and a third key value code corresponding to the third key value image block; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

In some embodiments, the extracted image blocks further include another image block extracted from the anchor frame, which is different from the query image block, as a first key-value image block, and one image block is respectively extracted from two other frames of the same sample video, as a second key-value image block and a third key-value image block, and combining, in a preset order, codes corresponding to the query code and the image blocks extracted from the other frames of the same sample video into a sequence code includes: generating sequence codes according to the sequence of the query codes, the second key value codes corresponding to the second key value image blocks and the third key value codes corresponding to the third key value image blocks; inputting the sequence codes into a classification model, and obtaining the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video comprises the following steps: inputting the sequence code into a binary model to obtain a result of the query image block before or after the second key value image block and the third key value image block as a prediction time sequence; according to the corresponding predicted time sequence of each sample video and the real time sequence of the image blocks extracted from the image blocks and other frames in the same sample video in the sample video, determining a third loss function comprises: and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

In some embodiments, the method further comprises: determining similarity of the query code and the first, second and third key value codes according to dot products of the query code and the first, second and third key value codes respectively; and determining the similarity between the query code and each negative key value code according to the dot product of the query code and each negative key value code.

In some embodiments, the inter-frame loss function for each sample video is determined using the following formula:

wherein s is_qFor query coding, i is more than or equal to 1 and less than or equal to 3, i is a positive integer,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

is a third key value code, j is more than or equal to 1 and less than or equal to K, j is a positive integer, K is the total number of the negative key value codes,

for the jth negative key value code, τ is the hyperparameter.

In some embodiments, the intra-frame loss function corresponding to each sample video is determined using the following formula:

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for third key value encoding, τ is a hyperparameter.

In some embodiments, the cross entropy loss function corresponding to each sample video is determined using the following formula:

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding the second key value,

for third key value encoding, y ∈ {0,1} indicates that s is queried in the true temporal order in the sample video_qIs at the second key value coding and the third key value coding

Before or after.

In some embodiments, the loss function of the visual feature extraction model is a weighted result of the first, second, and third loss functions.

According to further embodiments of the present disclosure, there is provided a motion recognition method including: extracting a first preset number of frames from a video to be identified; determining the code of each frame of image by using a visual feature extraction model obtained by the training method of any embodiment; and (4) inputting the coding of each frame of image into the action classification model to obtain the action type in the video to be identified.

According to still other embodiments of the present disclosure, a behavior recognition method is provided, including: extracting a second preset number of frames from the video to be identified; determining the code of each frame of image by using a visual feature extraction model obtained by the training method of any embodiment; and inputting the coding of each frame of image into the behavior classification model to obtain the behavior type in the video to be identified.

According to still further embodiments of the present disclosure, there is provided an object tracking method including: determining the code of each frame of image of the video to be recognized by using a visual feature extraction model obtained by the training method of any embodiment, wherein the position information of a labeled object in the first frame of image of the video to be recognized; and inputting the code of each frame image into the object tracking model to obtain the position information of the object in each frame image.

According to still other embodiments of the present disclosure, there is provided a method for extracting features of a video, including: extracting a third preset number of frames from the video; and determining the code of each frame of image by using the visual feature extraction model obtained by the training method of any embodiment.

According to still further embodiments of the present disclosure, there is provided an exercise device including: the extraction module is configured to select a plurality of frames of images of each sample video, extract image blocks from the plurality of frames of images respectively, and take one of the extracted image blocks as a query image block; the encoding module is configured to input each image block into the visual feature extraction model to obtain a code corresponding to each image block, wherein the code corresponding to the query image block is used as a query code; the loss function determining module is configured to determine a first comparison loss function according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and the codes corresponding to the image blocks in different sample videos, wherein the higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query code and the codes corresponding to the image blocks in different sample videos is, and the smaller the value of the first comparison function is; and the parameter adjusting module is configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and train the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises a first comparison loss function.

According to still further embodiments of the present disclosure, there is provided a motion recognition apparatus including: the extraction module is configured to extract a first preset number of frames from the video to be identified; the coding module is configured to determine the coding of each frame of image by using the visual feature extraction model obtained by the training method of any of the embodiments; and the action classification module is configured to input the codes of the frames of images into the action classification model to obtain the action types in the video to be recognized.

According to still other embodiments of the present disclosure, there is provided a behavior recognition apparatus including: the extraction module is configured to extract a second preset number of frames from the video to be identified; the coding module is configured to determine the coding of each frame of image by using the visual feature extraction model obtained by the training method of any of the embodiments; and the behavior classification module is configured to input the codes of the frames of images into the behavior classification model to obtain the behavior types in the video to be recognized.

According to still further embodiments of the present disclosure, there is provided an object tracking apparatus including: the coding module is configured to determine the coding of each frame of image of the video to be recognized by using the visual feature extraction model obtained by the training method in any of the embodiments, wherein the position information of the target is marked in the first frame of image of the video to be recognized; and the object tracking module is configured to input the codes of the frames of images into the object tracking model to obtain the position information of the target in the frames of images.

According to still other embodiments of the present disclosure, there is provided a video feature extraction apparatus including: an extraction module configured to extract the video by a third preset number of frames; and the coding module is configured to determine the coding of each frame of image by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments.

According to still further embodiments of the present disclosure, there is provided an electronic device including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform a training method according to any of the foregoing embodiments, or an action recognition method according to any of the foregoing embodiments, or a behavior recognition method according to any of the foregoing embodiments, or an object tracking method according to any of the foregoing embodiments, or a feature extraction method for a video according to any of the foregoing embodiments.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the training method of any of the foregoing embodiments, or the motion recognition method of any of the foregoing embodiments, or the behavior recognition method of any of the foregoing embodiments, or the object tracking method of any of the foregoing embodiments, or the feature extraction method of the video of any of the foregoing embodiments.

The method comprises the steps of extracting image blocks from a multi-frame image without labeling for each sample video, coding each image block by using a visual feature extraction model, determining a first contrast loss function by using the similarity between the query code and codes corresponding to other image blocks in the same sample video and the similarity between the query code and the codes corresponding to the image blocks in different sample videos, and then training the visual feature extraction model by adjusting parameters of the visual feature extraction model according to the first contrast loss function. The method disclosed by the invention omits a labeling process, improves the training efficiency, and only completely utilizes the inherent structure and the correlation of the data to perform unsupervised training, so that the visual extraction model can have good generalization capability. According to the method, the loss function is constructed to train the visual feature extraction model based on the relevance of multiple frames of images in the same sample video and the irrelevance of images in different videos according to the time-space coherence of the videos, so that the visual feature extraction model can well learn the features of the videos, and the trained visual feature extraction model can more accurately extract the features of the videos.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 illustrates a flow diagram of a training method of some embodiments of the present disclosure.

Fig. 2 shows a flow diagram of a training method of further embodiments of the present disclosure.

Fig. 3 illustrates a flow diagram of an action recognition method of some embodiments of the present disclosure.

Fig. 4 illustrates a flow diagram of a behavior recognition method of some embodiments of the present disclosure.

Fig. 5 illustrates a flow diagram of an object tracking method of some embodiments of the present disclosure.

Fig. 6 shows a schematic structural diagram of a training device of some embodiments of the present disclosure.

Fig. 7 illustrates a schematic structural diagram of a motion recognition device according to some embodiments of the present disclosure.

Fig. 8 illustrates a schematic structural diagram of a behavior recognition device according to some embodiments of the present disclosure.

Fig. 9 illustrates a schematic structural diagram of an object tracking apparatus of some embodiments of the present disclosure.

Fig. 10 shows a schematic structural diagram of an electronic device of some embodiments of the present disclosure.

Fig. 11 shows a schematic structural diagram of an electronic device of further embodiments of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The present disclosure proposes an unsupervised training method for a visual feature extraction model for extracting video features, which is described below with reference to fig. 1 to 2.

Fig. 1 is a flow chart of some embodiments of the training method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.

In step S102, for each sample video, a multi-frame image of the sample video is selected, image blocks are respectively extracted from the multi-frame image, and one of the extracted image blocks is used as a query image block.

A large number of sample videos constitute a training sample set

Multiple frames of images, namely more than two frames of images, can be randomly selected for each sample video. Data Augmentation (Data Augmentation) is performed on each frame image to extract image blocks. And one of the extracted image blocks is used as a Query (Query) image block to be used as a reference for comparison in subsequent contrast loss. The frame of the image in which the query image block is located may be used as an anchor frame. It may be sufficient to extract only one image block for each frame of images other than the anchor frame for training, although it is also possible to extract a plurality of image blocks. One image block may be additionally extracted for the anchor frame. Other image blocks than the query image block may be madeIs a Key value (Key) image block.

In some embodiments, three frame images(s) may be extracted for each sample video v¹,s2,s³) Extracted from the anchor frame and the query image block x_qA different other image block, as the first key-value image block x₁Respectively extracting an image block from two other frames of the same sample video as a second key value image block x₂And a third key value image block x₃。

Each image block is extracted through a random data enhancement method, namely, each image block is randomly cut according to a random proportion, and random color dithering, random gray scale, random fuzzy processing, random mirror image processing and the like are performed. If a plurality of image blocks are extracted from one frame of image, the plurality of image blocks are extracted in different enhancement modes. The different enhancement modes refer to that random parameters adopted during enhancement are different, for example, different clipping positions and sizes are adopted during random clipping, different dithering amplitudes are obtained randomly during color dithering, and the like.

In step S104, each image block is input into the visual feature extraction model, and a code corresponding to each image block is obtained.

The visual feature extraction model can comprise a query encoder and a key value encoder, wherein the query image block x corresponding to each sample video is coded by the key value encoder_qInput query encoder, encoding the corresponding key-value image blocks (e.g., x) of each sample video₁，x₂，x₃) An input key value encoder. The query encoder is used for obtaining codes corresponding to the query image blocks as query codes s_qThe key-value encoder is used to obtain the corresponding codes for other image blocks than the query image block, i.e. key-value codes for other key-value image blocks (e.g.,

)。

in step S106, a first contrast loss function is determined according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video, and the similarity between the query code of each sample video and the codes corresponding to image blocks in different sample videos.

The higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video, the lower the similarity between the query code and the codes corresponding to image blocks in different sample videos, and the smaller the value of the first comparison function.

Based on the spatio-temporal coherence features of the video, an inter-frame instance discrimination task is set, and the task checks the matching between query coding and key value coding at the video level. From a spatio-temporal perspective, the query code s_qEncoding with all key values in the same video (e.g.,

) Similarly, and with key-value coding of samples in other videos (e.g., denoted as

) Different. A determination method of a first comparison loss function is designed based on the interframe example discrimination task.

In some embodiments, under the condition that query encoding, first key value encoding, second key value encoding and third key value encoding are obtained for each sample video, determining an inter-frame loss function corresponding to the sample video according to similarity of the query encoding and the similarity of the query encoding and each negative key value encoding, wherein the similarity of the query encoding and each negative key value encoding corresponds to a first key value encoding, a second key value encoding and a third key value encoding respectively corresponding to a first key value image block, a second key value image block and a third key value image block, and the similarity of the query encoding and each negative key value encoding respectively; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

Or, for each sample video, extracting a query image block from one frame to further obtain a query code, and extracting a key-value image block from another frame to obtain a key-value code. And aiming at each sample video, determining an interframe loss function corresponding to the sample video according to the similarity between the query code and the key value code and the similarity between the query code and each negative key value code. Each negative key value code comprises key value codes corresponding to other sample videos. The number of frames and the number of image blocks extracted from the sample video can be set according to actual requirements, and for the inter-frame case discrimination task, the loss function is constructed by referring to the construction principle of the first comparison loss function in the embodiment.

In some embodiments, the similarity between two codes may be measured by way of a dot product, not limited to the illustrated example. For example, determining similarity between the query code and the first, second and third key value codes according to dot products of the query code and the first, second and third key value codes, respectively; and determining the similarity between the query code and each negative key value code according to the dot product of the query code and each negative key value code.

For example, the query corresponding to the same frame is encoded as s_qAnd the key value is coded as

And two key values from other frames in the same video are encoded as

In the inter instance differentiation task, the goal is to determine whether two image blocks are from the same video. All key values in the same video can be coded

Coding as positive key value, and coding with negative key value corresponding to image blocks sampled in other videos as negative samples

If the sample video is divided into a plurality of batches (Batch) in the training process, each Batch contains a preset number of sample videos, and the training is performed through a plurality of batches in an iteration mode, image blocks sampled in other videos in adjacent batches can be used as negative samples to correspond to negative key value codes

And are not limited to the examples given.

Query code s_qNeed to match to multiple key-value codes

The interframe loss function corresponding to each sample video in this task can be defined as all query coding and positive key value coding pairs(s)_q，

) The average sum of the contrast loss of (a) is expressed by the following formula, for example.

s_qFor query coding, i is more than or equal to 1 and less than or equal to 3, i is a positive integer,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for the jth negative key value code, τ is the hyperparameter. The first contrast loss function may be determined by weighting or summing the inter-frame loss functions corresponding to each sample video. By minimizing the first comparison loss function value, the visual feature extraction model can distinguish all positive key value codes in the same video

And query code s_qWith othersAll negative key value coding of video

The interframe loss function corresponding to each sample video can also be defined as each query coding and positive key value coding pair(s)_q，

) The weighted result of the contrast loss of (c) is not limited to the illustrated example.

In step S108, parameters of the visual feature extraction model are adjusted according to the loss function of the visual feature extraction model, and the visual feature extraction model is trained.

The loss function of the visual feature extraction model includes a first comparative loss function. In some embodiments, the visual feature extraction model includes a query encoder and a key-value encoder. The query encoder and the key-value encoder may employ different parameter adjustment strategies. For example, in each iteration, the parameters of the current iteration of the query encoder are adjusted according to the loss function of the visual feature extraction model, and the parameters of the current iteration of the key value encoder are adjusted according to the parameters of the last iteration of the query encoder and the parameters of the last iteration of the key value encoder.

Further, parameters (weights) of the query encoder can be adjusted and updated using SGD (random gradient descent) by minimizing the value of the loss function of the visual feature extraction model. For the key-value encoder, the parameter of the query encoder can be adjusted and updated by a Momentum Update (Momentum Update) strategy. The momentum updating strategy can reduce the loss of characteristic consistency of different key value codes caused by the drastic change of the key value encoder, and can also enable the key value encoder to be always in updating. The parameters of the key-value encoder may be updated according to the following formula.

t is the number of iterations,

for the parameter of the t-th iteration key-value encoder, f_kA representation of a key-value encoder,

for the parameters of the t-1 th iterative key-value encoder,

querying the encoder parameters for the t-1 st iteration, f_qRepresenting the query encoder, and α is the momentum coefficient.

The interframe instance discrimination task aims at learning the compatibility of video-level query image blocks and key-value image blocks. In this task, the trained visual feature extraction model can not only distinguish the query image blocks of the same frame in the video from the image blocks in other videos (as negative samples or unmatched samples), but also recognize the image blocks in other frames in the video as positive samples or matched samples. Such a design goes beyond traditional still image monitoring and captures more positive sample image blocks in the same video. By contrast learning, a new idea is provided for the learning of objects with a temporal evolution (e.g. new views/poses of the objects). The method well utilizes the advantages of the space-time structure in the video, thereby strengthening the unsupervised visual feature learning for video understanding.

The method of the embodiment extracts image blocks from a multi-frame image without labeling each sample video, encodes each image block by using a visual feature extraction model, wherein a code corresponding to one image block is used as a query code, determines a first contrast loss function by querying the similarity between the code and codes corresponding to other image blocks in the same sample video and the similarity between the query code and the codes corresponding to the image blocks in different sample videos, and further adjusts parameters of the visual feature extraction model according to the first contrast loss function to train the visual feature extraction model. The method of the embodiment omits a labeling process, improves the training efficiency, and only completely utilizes the inherent structure and the correlation of the data to perform unsupervised training, so that the visual extraction model can have good generalization capability. In addition, according to the method of the embodiment, based on the spatial-temporal coherence of the video, the relevance of multiple frames of images in the same sample video and the irrelevance of images in different videos, a loss function is constructed to train the visual feature extraction model, so that the visual feature extraction model can well learn the features of the videos, and the trained visual feature extraction model can more accurately extract the features of the videos.

In addition to having spatial and temporal consistency, the video also has characteristics of cross-frame variation and fixed sequence of frames, and in order to further improve the learning accuracy of the visual feature extraction model, the present disclosure also provides a further improvement of the foregoing training method, which is described below with reference to fig. 2.

FIG. 2 is a flow chart of further embodiments of the training method of the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S220.

In step S202, for each sample video, a multi-frame image of the sample video is selected, image blocks are respectively extracted from the multi-frame image, and one of the extracted image blocks is used as a query image block.

In step S204, each image block is input into the visual feature extraction model, and a code corresponding to each image block is obtained.

In step S206, a first contrast loss function is determined according to the similarity between the query encoding of each sample video and the encoding corresponding to other image blocks in the same sample video, and the similarity between the query encoding of each sample video and the encoding corresponding to image blocks in different sample videos.

In step S208, a second contrast loss function is determined according to the similarity between the query code of each sample video and the code corresponding to the first key-value image block, and the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video.

The higher the similarity between the query code and the code corresponding to the first key-value image block, the lower the similarity between the query code and the codes corresponding to image blocks extracted from other frames in the same sample video, and the smaller the value of the second contrast loss function.

Based on the cross-frame change characteristic of the video, an intra-frame instance discrimination task is designed, and the task determines whether two image blocks are derived from the same frame or not from a spatial perspective. Query code s_qA key-value code corresponding to the same frame (e.g.,

) Similarly, key-value coding corresponding to other frames

And not matched.

In some embodiments, in the case of obtaining a query code, a first key value code, a second key value code, and a third key value code for each sample video, determining, for each sample video, an intra-frame loss function corresponding to the sample video according to a similarity between the query code and the first key value code corresponding to the first key value image block, and a similarity between the query code and the third key value code corresponding to the second key value image block and the third key value image block respectively; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

Or for each sample video, extracting one query image block and one key-value image block from one frame as a first key-value image block to further obtain a query code and a first key-value code, and extracting one key-value image block from another frame as a second key-value image block to obtain a second key-value code. And aiming at each sample video, determining an inter-frame loss function corresponding to the sample video according to the similarity between the query code and the first key value code and the similarity between the query code and the second key value code. The intra-frame example judgment task needs to extract at least one extra image block in the frame where the query image block is located for comparison. It is also necessary to extract at least one image block in at least one other frame of the same video. In addition, the number of frames extracted from the same video, and the number of image blocks extracted from the same frame other than the query image block and the number of image blocks extracted from other frames are not limited. For the intra-frame example discrimination task, the loss function may be constructed by referring to the construction principle of the second comparison loss function in the above embodiment.

In some embodiments, the similarity between two codes may be measured by way of a dot product, not limited to the illustrated example. For example, in the coding corresponding to four image blocks sampled from one video (query coding s corresponding to the same frame)_qAnd a first key value code

Two key-value codes corresponding to the other two frames

) Will be

As a positive key value code, will

Encoded as a negative key value. Since the inter-frame instance discrimination task already utilizes key-value coding derived from other videos, the key-value coding of other videos already applied in this task is excluded from the contrast learning for the sake of simplicity. Specifically, the intra-frame loss function corresponding to each sample video can be determined by using the following formula.

s_qIn order to encode the query, the query is,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for third key value encoding, τ is a hyperparameter. The second contrast loss function may be determined by weighting or summing the intra-frame loss functions corresponding to the respective sample videos. The second contrast loss function is designed to encode the query s_qSimilar to positive key value coding extended from the same frame

And negative key value coding with other frames

Remain different, a temporally distinctive visual representation is obtained.

In the inter-frame instance discrimination task, all image blocks sampled at the video level are grouped together as a common class without exploiting the inherent spatial variation between frames within the same video. In order to alleviate this problem, the above-mentioned intra-frame example discrimination task is proposed to distinguish image blocks of the same frame from image blocks of other frames in the video, and clearly display changes from a spatial perspective. In this way, unsupervised feature learning is further guided by spatial supervision between frames, with the expectation that the learned visual representation will be differentiated between frames in the video.

In step S210, for each sample video, combining the query code and codes corresponding to image blocks extracted from other frames in the same sample video into a sequence code according to a preset sequence.

The frame where the image block is located may be used as an anchor frame, and in order to more easily determine the sequence of each image block, a first frame or a last frame arranged in time sequence in a multi-frame image extracted from the video may be selected as the anchor frame. In some embodiments, the sequence code is generated according to the query code, the second key value code corresponding to the second key value image block, and the third key value code corresponding to the third key value image block. The query encoding, the second key value encoding and the third key value encoding may be concatenated, and of course, the order may be reversed, which is not limited to the illustrated example.

And designing a time sequence verification task based on the sequence among the multiple frames of the video, and learning the inherent sequence structure of the video by predicting the correct time sequence of the image block sequence. In particular, given a query-encoded s_qAnd two key value codes

And forming sequence coding. The first key-value code may no longer be used here, since the query code and the first key-value code belong to the same frame and cannot distinguish between the orders.

In step S212, the sequence code is input into the classification model, and a prediction time sequence of the query image block and image blocks extracted from other frames in the same sample video in the sample video is obtained.

In some embodiments, the sequence encoding is input to a binary model, resulting in the result of the query image block before or after the second and third key-value image blocks as the prediction temporal order. The two cases of the output of the binary model are two, one is that the query image block is before the second key value image block and the third key value image block, and the other is that the query image block is after the second key value image block and the third key value image block.

In step S214, a third loss function is determined according to the predicted time sequence corresponding to each sample video and the real time sequence of the image blocks extracted from the image blocks and other frames in the same sample video in the sample video.

In some embodiments, a cross entropy loss function corresponding to each sample video is determined according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and a third loss function is determined according to the cross entropy loss function corresponding to each sample video.

And designing a time sequence verification task from the view point of the sequence between video frames, and aiming at verifying whether a series of image blocks are in a correct time sequence. The underlying rationale behind this is to encourage visual feature extraction models to reason about the temporal order of image blocks, thus making use of the sequential structure of the video for unsupervised feature learning.

For example, three frames are randomly sampled from an unlabeled video, and the first or last frame in temporal order is given the query code s as the anchor frame_qAnd two key value codes

Concatenated into an overall sequence representation, i.e., sequence code, and input into a classifier g (), which predicts that the query code precedes or follows the key-value code. The cross entropy loss function corresponding to each sample video can be determined using the following formula:

s_qin order to encode the query, the query is,

for the purpose of encoding the second key value,

Before or after. The third loss function may be determined by weighting or summing the cross-entropy loss functions corresponding to the respective sample videos. The order of the different frames can be distinguished by the visual feature extraction model by minimizing the third loss function value.

The steps S206, S208, and S210 to S214 may be executed in parallel, and S208 and S210 to S214 are optional steps.

In step S216, parameters of the visual feature extraction model are adjusted according to the first and second contrast loss functions, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first and second contrast loss functions.

In step S218, parameters of the visual feature extraction model are adjusted based on the first comparison loss function and the third loss function, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first comparative loss function and the third loss function.

In step S220, parameters of the visual feature extraction model are adjusted according to the first, second, and third comparison loss functions, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first, second, and third loss functions. For example, the loss function of the visual feature extraction model may be determined using the following formula.

How to update the parameters of the visual feature extraction model is described in the foregoing embodiments, and will not be described herein again. The interframe instance discrimination task, the intraframe instance discrimination task and the time sequence verification task can be combined to train the visual feature extraction model, and under the condition that the three tasks are implemented, the accuracy of the visual feature extraction model is highest, and the effect is best. In addition, the inherent characteristics of the video are utilized in the training process, and the visual feature extraction model has good generalization capability.

It can be seen through the above embodiments that the inter-frame instance discrimination task, the intra-frame instance discrimination task, and the time sequence verification task may be different in sampling methods, and if each task needs to be applied in combination, sampling manners of different tasks need to be unified, for example, in the above embodiments, three frames are sampled for each video, a first frame or a last frame is used as an anchor frame, an anchor frame is used to extract a query image block and a first key value image block, and two other frames are used to extract a second key value image block and a third key value image block respectively. But is not limited to the illustrated example, and the sampling manner may be as long as the determination policy of each loss function is satisfied.

The trained visual feature extraction model can be used for extracting the features of the video. In some embodiments, the video is extracted for a third preset number of frames; and determining the code of each frame of image by using the visual feature extraction model obtained by the training method of any embodiment.

Optionally, the method may further include determining the characteristics of the video according to the encoding of each frame of image. For example, the average value of the image encoding of each frame may be used as the feature of the video, or the image encoding of each frame may be directly used as the feature of the video, and the present invention is not limited to the illustrated examples.

In the above embodiment, an interframe instance discrimination task, an intraframe instance discrimination task, and a time sequence verification task are designed, and the visual feature extraction model is trained based on at least one characteristic of temporal-spatial continuity, cross-frame variability, and interframe sequence, so that the visual feature extraction model can learn the most characteristic features in the video. For example, according to the inter-frame case discrimination task, the image block similarity of different frames in the same video is close, and the image blocks of frames in different videos are not similar, so that the visual feature extraction model can learn the main features of a main body (target) in each video, for example, for a video that a person is riding and other videos (a person walks or slides, etc.), the visual feature extraction model can distinguish the contents in different videos through training, thereby extracting the most main features expressing the video.

For another example, according to the intra-frame instance discrimination task, the similarity of image blocks in the same frame is close, and the image blocks in different frames are not similar, so that the visual feature extraction model can learn the detail change features of the main body (target) in each frame, and the detail features can further improve the accuracy of the features extracted by the visual feature extraction model on the basis of the inter-frame instance discrimination task. For another example, according to the time sequence verification task, the sequence between frames needs to be kept accurate, so that the visual feature extraction model can learn the feature change rule of the main body (target) in each frame, and further the learning of the visual feature extraction model to the features is enriched on the basis of the two tasks, so that the extracted features are more accurate. The application of the three tasks can be that the visual feature extraction model can accurately learn the content which is expected to be expressed by the whole video no matter the video aims at any content. On the basis that the visual feature extraction model accurately understands the content of the video, the visual feature extraction model can give very good performance in combination with any downstream task (such as action recognition, behavior recognition, target tracking and the like).

Some embodiments of how the trained visual feature extraction model according to the previous embodiments can be applied are described below with reference to fig. 3-5.

Fig. 3 is a flow diagram of some embodiments of an Action Recognition (Action Recognition) method of the present disclosure. As shown in fig. 3, the method of this embodiment includes: steps S302 to S306.

In step S302, a first preset number of frames of the video to be recognized are extracted.

For example, 30 or 50 frames of the video to be recognized may be extracted, or image blocks may be extracted for each frame of the video, where the image blocks may be extracted in a fixed manner, for example, each frame is adjusted to a preset size, and the image blocks with preset length and width are cropped from the center.

In step S304, the coding of each frame of image is determined using the pre-trained visual feature extraction model.

The visual characteristic extraction model comprises a query encoder and a key value encoder, output results of the two encoders are required to be compared in the training process, and when the visual characteristic extraction model is used, the query encoder is not required to be used for encoding each frame of image (or each image block).

In step S306, the motion classification model is input to encode each frame of image, and the motion type in the video to be identified is obtained.

The encoding of each frame image may be averaged and then input to the motion classification model. The motion recognition model may be a combination of a visual feature extraction model and a motion classification model, which may be a simple linear model, and is not limited to the illustrated example.

The visual feature extraction model can accurately extract the features of the video through the training of the method of the embodiment, so that the accuracy of final action recognition is improved.

FIG. 4 is a flow diagram of some embodiments of an Activity Recognition method of the present disclosure. As shown in fig. 4, the method of this embodiment includes: steps S402 to S406.

In step S402, a second preset number of frames are extracted from the video to be recognized.

In step S404, the coding of each frame of image is determined using the pre-trained visual feature extraction model.

In step S406, the codes of the images of each frame are input into the behavior classification model, so as to obtain the behavior type in the video to be identified.

The encoding of each frame of image may be averaged and then input into the behavior classification model. The behavior recognition model may be a combination of a visual feature extraction model and a behavior classification model, which may be a simple linear model, and is not limited to the illustrated example.

The visual feature extraction model can accurately extract the features of the video through the training of the method of the embodiment, so that the accuracy of final behavior recognition is improved.

Since the methods and models of motion recognition and behavior recognition are similar, the two methods are described in the same application example.

Some embodiments of the visual feature extraction model are first described. The visual feature extraction model comprises a query encoder and a key value encoder, wherein the two encoders can adopt similar structures and can adopt neural network structures. For example, two encoders adopt the structure of ResNet50 (residual network 50) + MLP (multilayer perceptron). Further, a global pooling layer may be added between ResNet50 and MLP. MLP may only affect the training process and not participate in downstream tasks. In the training process, the MLP is added with the discriminating network structure of three tasks of the interframe example discriminating task, the intraframe example discriminating task and the time sequence verifying task in the embodiment. When the visual feature extraction model is used as a feature extraction part in the motion recognition model and the behavior recognition model, only the structure of ResNet50+ MLP may be applied. The visual feature extraction model can be pre-trained by adopting a training set containing various types of sample videos, so that the video feature extraction model can learn the features of various types of videos without labeling in the training process.

Further, the classification parts of the motion recognition model and the behavior recognition model, i.e., the motion classification model and the behavior classification model, may employ a linear model, for example, an SVM (support vector machine) may be employed. The overall structure of the action recognition model and the behavior recognition model may be ResNet50+ MLP + SVM. The visual feature extraction model may be pre-trained according to the method of the foregoing embodiment, and then combined with other linear models to obtain a motion recognition model and a behavior recognition model.

The action classification model and the behavior classification model need to be trained by using a training set, so that the whole model can complete action recognition or behavior recognition. The action classification model may be trained using an action class training set, such as a Kinetics400 data set, and the action classification model may be trained using an action class training set, such as an ActivityNet data set, without being limited to the illustrated example. In the process, the visual feature extraction model does not need to be trained again, and the number of samples of the training set of the action classification model and the action classification model can be far smaller than that of the training set of the video feature extraction model, so that the labeling amount is greatly reduced, and the efficiency is improved. Taking the action classification model as an example, in the action classification model training process, a preset number of frames can be extracted for each sample video, image blocks (the image of the whole frame can be used as the image block and is determined according to the training requirement of the specific action classification model) are extracted according to a preset mode, the visual feature extraction model is input to obtain the codes of the image blocks, the codes of the image blocks are averaged and input into the action classification model to obtain a classification result, a loss function is determined according to the classification result and the labeled action type, and the parameters of the action classification model are adjusted according to the loss function until the convergence condition is reached to complete the training. The specific loss function determining method and the parameter adjusting method may adopt the prior art, and are not described herein again.

The training of the action classification model and the behavior classification model is relatively simple, the visual feature extraction model can be combined with various downstream tasks after one-time training, training for different downstream tasks is not needed, and the efficiency is improved under the condition of multiple applications.

FIG. 5 is a flow diagram of some embodiments of an Object Tracking (Object Tracking) method of the present disclosure. As shown in fig. 5, the method of this embodiment includes: steps S502 to S504.

In step S502, the pre-trained visual feature extraction model is used to determine the code of each frame of image of the video to be recognized.

The object tracking may mark the position information of the object in the first frame image, for example, mark the position of a bounding box of the object. The images of each frame may be preprocessed and then input into the visual feature extraction model, for example, the spatial resolution is adjusted to a preset resolution.

In step S504, the object tracking model is input to the code of each frame image, and the position information of the object in each frame image is obtained.

Object tracking may be based on SiamFC (target tracking algorithm based on full convolution twin network). ResNet50+ MLP can be used as the encoder of the visual feature extraction model in the previous embodiment, and the query encoder is used to determine the encoding of each frame of image, in order to adapt to the SimFC algorithm and more accurately evaluate the effect of the visual feature extraction model. Adding 1x1 convolution after the query encoder of the visual feature extraction model, during training, learning of the tracking features is completed only by optimizing the parameters of 1x1 convolution. The structure of adding 1x1 convolution after querying the encoder can be used as a feature extraction part in the SiamFC algorithm. Meanwhile, the configuration of ResNet50 can be modified to be more suitable for the SimFC algorithm, the convolution with step 2 in { res4, res5} for ResNet50 is changed to step 1, and the expansion ratio of the 3x 3 convolution in res4 and res5 is modified from 1 to 2 and 4, respectively. The query encoder and 1x1 convolution portion may be used to transform the first frame image and other frame images, and then input the transformed code of the first frame image and the transformed code of the other frame images into the object tracking portion (i.e., the object tracking model) of the SiamFC algorithm. For a specific SiamFC algorithm and a training method of an object tracking partial model in the algorithm, reference may be made to the prior art, and details are not described here.

The visual feature extraction model can accurately extract the features of the video through the training of the method of the embodiment, so that the accuracy of final object tracking is improved.

The inventor carries out a comparison experiment on the visual characteristic extraction model trained by the training method disclosed by the invention and the visual characteristic extraction model trained by the existing multiple training methods, and the accuracy is improved in various downstream task scenes.

The present disclosure also provides an exercise device, described below in conjunction with fig. 6.

FIG. 6 is a block diagram of some embodiments of an exercise device of the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: an extraction module 610, an encoding module 620, a loss function determination module 630, and a parameter adjustment module 640.

The extracting module 610 is configured to, for each sample video, select a multi-frame image of the sample video, and extract image blocks from the multi-frame image respectively, and use one of the extracted image blocks as a query image block.

The encoding module 620 is configured to input each image block into the visual feature extraction model, and obtain a code corresponding to each image block, where the code corresponding to the query image block is used as a query code.

The loss function determining module 630 is configured to determine a first contrast loss function according to a similarity between the query encoding of each sample video and the encoding corresponding to the other image blocks in the same sample video and a similarity between the query encoding of each sample video and the encoding corresponding to the image blocks in the different sample videos, wherein the higher the similarity between the query encoding and the encoding corresponding to the other image blocks in the same sample video, the lower the similarity between the query encoding and the encoding corresponding to the image blocks in the different sample videos, the smaller the value of the first contrast loss function.

In some embodiments, a frame in which the query image block is located serves as an anchor frame, the extracted image blocks further include another image block extracted from the anchor frame and different from the query image block, the image block serves as a first key value image block, and one image block is respectively extracted from two other frames of the same sample video and serves as a second key value image block and a third key value image block. The loss function determining module 630 is configured to determine, for each sample video, an inter-frame loss function corresponding to the sample video according to similarities of query codes and third key value codes corresponding to the first key value image block, the second key value image block and the third key value image block, and similarities of the query codes and negative key value codes, wherein the negative key value codes include the first key value code, the second key value code and the third key value code corresponding to other sample videos; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

In some embodiments, the similarity of the query code and the first, second and third key value codes is determined according to the dot products of the query code and the first, second and third key value codes, respectively; and determining the similarity between the query code and each negative key value code according to the dot product of the query code and each negative key value code.

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for the jth negative key value code, τ is the hyperparameter.

In some embodiments, a frame in which the query image block is located serves as an anchor frame, and the extracted image block further includes another image block, which is different from the query image block and extracted from the anchor frame, as the first key-value image block. The loss function determining module 630 is further configured to determine a second contrast loss function according to a similarity between the query code of each sample video and the code corresponding to the first key-value image block and a similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video, wherein the higher the similarity between the query code and the code corresponding to the first key-value image block, the lower the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video, the smaller the value of the second contrast loss function. The loss functions of the visual feature extraction model further include a second comparative loss function.

In some embodiments, extracting the image blocks of the other frames in the same sample video includes extracting one image block from two other frames of the same sample video respectively as the second key-value image block and the third key-value image block corresponding to the sample video. The loss function determining module 630 is configured to determine, for each sample video, an intra-frame loss function corresponding to the sample video according to the similarity between the query code and the first key value code corresponding to the first key value image block, and the similarity between the query code and the second key value code corresponding to the second key value image block and the third key value code corresponding to the third key value image block, respectively; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for third key value encoding, τ is a hyperparameter.

In some embodiments, the anchor frame is the first or last frame in a chronological order in the multi-frame image. The loss function determining module 630 is further configured to combine, for each sample video, the query code and codes corresponding to image blocks extracted from other frames in the same sample video into a sequence code in a preset order; inputting the sequence codes into a classification model to obtain a prediction time sequence of a query image block and image blocks extracted from other frames in the same sample video in the sample video; determining a third loss function according to the corresponding predicted time sequence of each sample video and the real time sequence of the image blocks extracted from the image blocks and other frames in the same sample video in the sample video; the loss functions of the visual feature extraction model further include a third contrast loss function.

In some embodiments, the extracted image blocks further include another image block extracted from the anchor frame that is different from the query image block as a first key-value image block, and one image block is respectively extracted from two other frames of the same sample video as a second key-value image block and a third key-value image block. The loss function determining module 630 is configured to generate a sequence code according to an order of the query code, the second key value code corresponding to the second key value image block, and the third key value code corresponding to the third key value image block; inputting the sequence code into a binary model to obtain a result of the query image block before or after the second key value image block and the third key value image block as a prediction time sequence; and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding the second key value,

for third key value encoding, y ∈ {0,1} indicates as in thatTrue temporal order in sample video, query s_qIs at the second key value coding and the third key value coding

Before or after.

The parameter adjusting module 640 is configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and train the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises a first comparative loss function.

In some embodiments, the visual feature extraction model includes a query encoder for obtaining a query encoding and a key-value encoder for obtaining encodings corresponding to image blocks other than the query image block. The parameter adjusting module 640 is configured to, in each iteration, adjust a parameter of the current iteration of the query encoder according to the loss function of the visual feature extraction model, and adjust a parameter of the current iteration of the key value encoder according to a parameter of the last iteration of the query encoder and a parameter of the last iteration of the key value encoder.

The present disclosure also provides a motion recognition apparatus, described below in conjunction with fig. 7.

Fig. 7 is a block diagram of some embodiments of the motion recognition device of the present disclosure. As shown in fig. 7, the apparatus 70 of this embodiment includes: an extraction module 710, an encoding module 720, and an action classification module 730.

The extraction module 710 is configured to extract a first preset number of frames from the video to be recognized.

The encoding module 720 is configured to determine the encoding of each frame of image by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments.

The motion classification module 730 is configured to input the coding of each frame of image into the motion classification model, and obtain the motion type in the video to be recognized.

The present disclosure also provides a behavior recognition apparatus, which is described below with reference to fig. 8.

Fig. 8 is a block diagram of some embodiments of a behavior recognition device of the present disclosure. As shown in fig. 8, the apparatus 80 of this embodiment includes: an extraction module 810, an encoding module 820, and a behavior classification module 830.

The extraction module 810 is configured to extract a second preset number of frames from the video to be recognized.

The encoding module 820 is configured to determine the encoding of each frame of image by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments

The behavior classification module 830 is configured to input the codes of the frames of images into the behavior classification model, so as to obtain the behavior types in the video to be recognized.

The present disclosure also provides an object tracking apparatus, described below in conjunction with fig. 9.

FIG. 9 is a block diagram of some embodiments of an object tracking device of the present disclosure. As shown in fig. 9, the apparatus 90 of this embodiment includes: an encoding module 910 and an object tracking module 920.

The encoding module 910 is configured to determine, by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments, an encoding of each frame of image of the video to be recognized, where position information of the target is marked in a first frame of image of the video to be recognized.

The object tracking module 920 is configured to input the code of each frame of image into the object tracking model, and obtain the position information of the target in each frame of image.

The present disclosure also provides a video feature extraction device, including: an extraction module configured to extract the video by a third preset number of frames; and the coding module is configured to determine the coding of each frame of image by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments. Optionally, the apparatus may further include a feature determination module configured to determine a feature of the video according to the encoding of each frame of image.

The electronic devices in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 10 and 11.

Fig. 10 is a block diagram of some embodiments of an electronic device of the present disclosure. As shown in fig. 10, the electronic apparatus 100 of this embodiment includes: a memory 1010 and a processor 1020 coupled to the memory 1010, the processor 1020 configured to execute a training method, an action recognition method, a behavior recognition method, an object tracking method, a feature extraction method of a video in any of the embodiments of the present disclosure based on instructions stored in the memory 1010.

Memory 1010 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

Fig. 11 is a block diagram of further embodiments of an electronic device of the present disclosure. As shown in fig. 11, the electronic apparatus 110 of this embodiment includes: the memory 1110 and the processor 1120 are similar to the memory 1010 and the processor 1020, respectively. Input-output interfaces 1130, network interfaces 1140, storage interfaces 1150, etc. may also be included. These

interfaces

1130, 1140, 1150 and the memory 1110 and the processor 1120 may be connected via a bus 1160, for example. The input/output interface 1130 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 1140 provides a connection interface for various networked devices, such as may connect to a database server or a cloud storage server, etc. The storage interface 1150 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of training, comprising:

selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one of the extracted image blocks as a query image block;

inputting each image block into a visual feature extraction model to obtain a code corresponding to each image block, wherein the code corresponding to the query image block is used as a query code;

determining a first contrast loss function according to the similarity between the query code of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query code and the codes corresponding to image blocks in different sample videos is, the smaller the value of the first contrast function is;

and adjusting parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and training the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises the first comparison loss function.

2. The training method according to claim 1, wherein a frame in which the query tile is located serves as an anchor frame, the extracted tiles further include another tile extracted from the anchor frame, which is different from the query tile, as a first key-value tile, and the method further includes:

determining a second contrast loss function according to the similarity between the query code of each sample video and the code corresponding to the first key value image block and the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video, wherein the higher the similarity between the query code and the code corresponding to the first key value image block is, the lower the similarity between the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video is, and the smaller the value of the second contrast loss function is;

wherein the loss function of the visual feature extraction model further comprises a second comparative loss function.

3. The training method according to claim 1 or 2, wherein a frame in which the query image block is located is used as an anchor frame, and the anchor frame is a first frame or a last frame of the multi-frame image arranged in time sequence, and the method further comprises:

aiming at each sample video, combining the query codes and codes corresponding to image blocks extracted from other frames in the same sample video into sequence codes according to a preset sequence;

inputting the sequence codes into a classification model to obtain the predicted time sequence of the query image block and image blocks extracted from other frames in the same sample video in the sample video;

determining a third loss function according to the predicted time sequence corresponding to each sample video and the real time sequence of the image blocks extracted from the query image block and other frames in the same sample video in the sample video;

wherein the loss function of the visual feature extraction model further comprises a third contrast loss function.

4. The training method of claim 1, wherein the visual feature extraction model comprises a query encoder and a key-value encoder, the query encoder is configured to obtain the query encoding, and the key-value encoder is configured to obtain encodings corresponding to image blocks other than the query image block;

the adjusting the parameters of the visual feature extraction model according to the loss function of the visual feature extraction model comprises:

and in each iteration, adjusting the parameters of the current iteration of the query encoder according to the loss function of the visual feature extraction model, and adjusting the parameters of the current iteration of the key value encoder according to the parameters of the last iteration of the query encoder and the parameters of the last iteration of the key value encoder.

5. The training method according to claim 1, wherein a frame in which the query image block is located serves as an anchor frame, the extracted image blocks further include another image block extracted from the anchor frame and different from the query image block, the another image block is used as a first key value image block, and one image block is extracted from two other frames of the same sample video respectively and serves as a second key value image block and a third key value image block;

determining a first contrast loss function according to the similarity between the query code of each sample video and the codes corresponding to other image blocks in the same sample video and the similarity between the query code of each sample video and the codes corresponding to image blocks in different sample videos includes:

for each sample video, determining an inter-frame loss function corresponding to the sample video according to the similarity between the query code and a first key value code corresponding to the first key value image block, the similarity between the query code and a third key value code corresponding to the second key value image block and the similarity between the query code and each negative key value code, wherein each negative key value code comprises the first key value code, the second key value code and the third key value code corresponding to other sample videos;

and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

6. The training method according to claim 2, wherein extracting image blocks from other frames in the same sample video comprises extracting one image block from two other frames in the same sample video respectively as a second key-value image block and a third key-value image block corresponding to the sample video;

determining a second contrast loss function according to the similarity between the query code of each sample video and the code corresponding to the first key-value image block and the similarity between the query code and the codes corresponding to image blocks extracted from other frames in the same sample video includes:

for each sample video, determining an intra-frame loss function corresponding to the sample video according to the similarity between the query code and a first key value code corresponding to a first key value image block, and the similarity between the query code and a second key value code corresponding to a second key value image block and a third key value code corresponding to a third key value image block;

and determining the second contrast loss function according to the intra-frame loss function corresponding to each sample video.

7. The training method of claim 3, wherein the extracted patches further include another patch extracted from the anchor frame that is different from the query patch, as a first key-value patch, and one patch is extracted from two other frames of the same sample video, as a second key-value patch and a third key-value patch, respectively;

the combining the query code and codes corresponding to image blocks extracted from other frames in the same sample video according to the preset sequence into a sequence code comprises:

generating sequence codes according to the query codes, the second key value codes corresponding to the second key value image blocks and the third key value codes corresponding to the third key value image blocks;

the step of inputting the sequence codes into a classification model to obtain the prediction time sequence of the query image block and image blocks extracted from other frames in the same sample video in the sample video comprises the following steps:

inputting the sequence code into a binary model to obtain a result of the query image block before or after the second key value image block and the third key value image block as the prediction time sequence;

determining a third loss function according to the predicted time sequence corresponding to each sample video and the real time sequence of the image blocks extracted from the query image block and other frames in the same sample video in the sample video, wherein the determining the third loss function comprises:

and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

8. The training method of claim 5 or 6, further comprising:

determining similarity between the query code and the first, second and third key value codes according to dot products of the query code and the first, second and third key value codes, respectively;

and determining the similarity between the query code and each negative key value code according to the dot product of the query code and each negative key value code.

9. The training method of claim 8, wherein the interframe loss function corresponding to each sample video is determined using the following formula:

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for the jth negative key value code, τ is the hyperparameter.

10. The training method of claim 8, wherein the intra-frame loss function corresponding to each sample video is determined using the following formula:

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding a first key value,

for the purpose of encoding the second key value,

for third key value encoding, τ is a hyperparameter.

11. The training method according to claim 7, wherein the cross entropy loss function corresponding to each sample video is determined by using the following formula:

wherein s is_qIn order to encode the query, the query is,

for the purpose of encoding the second key value,

Before or after.

12. The training method of claim 3,

and the loss function of the visual feature extraction model is a weighted result of the first, second and third comparison loss functions.

13. A motion recognition method, comprising:

extracting a first preset number of frames from a video to be identified;

determining the code of each frame of image by using a visual feature extraction model obtained by the training method of any one of claims 1 to 12;

and inputting the coding of each frame of image into the action classification model to obtain the action type in the video to be identified.

14. A behavior recognition method, comprising:

extracting a second preset number of frames from the video to be identified;

and inputting the coding of each frame of image into a behavior classification model to obtain the behavior type in the video to be identified.

15. An object tracking method, comprising:

determining the code of each frame of image of the video to be identified by using the visual feature extraction model obtained by the training method of any one of claims 1 to 12, wherein the position information of the labeled object in the first frame of image of the video to be identified;

and inputting the code of each frame image into the object tracking model to obtain the position information of the object in each frame image.

16. A method for extracting features of a video comprises the following steps:

extracting a third preset number of frames from the video;

determining the coding of each frame of image by using the visual feature extraction model obtained by the training method of any one of claims 1 to 12.

17. An exercise device comprising:

the extraction module is configured to select a plurality of frames of images of each sample video, extract image blocks from the plurality of frames of images respectively, and take one of the extracted image blocks as a query image block;

the encoding module is configured to input each image block into the visual feature extraction model to obtain a code corresponding to each image block, wherein the code corresponding to the query image block is used as a query code;

a loss function determination module configured to determine a first comparison loss function according to a similarity between the query code of each sample video and codes corresponding to other image blocks in the same sample video and a similarity between the query code of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video, the lower the similarity between the query code and the codes corresponding to image blocks in different sample videos, the smaller the value of the first comparison loss function;

a parameter adjusting module configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and train the visual feature extraction model, wherein the loss function of the visual feature extraction model includes the first comparative loss function.

18. A motion recognition device comprising:

the extraction module is configured to extract a first preset number of frames from the video to be identified;

a coding module configured to determine a code of each frame of image by using the visual feature extraction model obtained by the training method of any one of claims 1 to 12;

and the action classification module is configured to input the codes of the frames of images into the action classification model to obtain the action types in the video to be identified.

19. A behavior recognition device comprising:

the extraction module is configured to extract a second preset number of frames from the video to be identified;

and the behavior classification module is configured to input the codes of the frames of images into the behavior classification model to obtain the behavior types in the video to be recognized.

20. An object tracking apparatus, comprising:

the coding module is configured to determine the coding of each frame of image of the video to be recognized by using the visual feature extraction model obtained by the training method of any one of claims 1 to 12, wherein the position information of the target is marked in the first frame of image of the video to be recognized;

and the object tracking module is configured to input the codes of the frames of images into the object tracking model to obtain the position information of the target in the frames of images.

21. A feature extraction apparatus of a video, comprising:

an extraction module configured to extract the video by a third preset number of frames;

an encoding module configured to determine an encoding of each frame of image using the visual feature extraction model obtained by the training method of any one of claims 1 to 12.

22. An electronic device, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the training method of any one of claims 1-12, or the action recognition method of claim 13, or the behavior recognition method of claim 14, or the object tracking method of claim 15, or the feature extraction method of video of claim 16.

23. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the training method of any one of claims 1 to 12, or the action recognition method of claim 13, or the behavior recognition method of claim 14, or the object tracking method of claim 15, or the feature extraction method of a video of claim 16.