CN113822134A

CN113822134A - Instance tracking method, device, equipment and storage medium based on video

Info

Publication number: CN113822134A
Application number: CN202110813442.8A
Authority: CN
Inventors: 杨澍生; 李昱; 单瀛; 方羽新; 王兴刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-12-21

Abstract

The application discloses a video-based instance tracking method realized by adopting an artificial intelligence technology, which comprises the following steps: acquiring a target characteristic diagram through a backbone network based on a target video frame in a video to be detected; acquiring N bounding box interesting regions ROI from the target feature map according to the N example bounding boxes; acquiring N first detection results through an example segmentation network based on the N example query vectors and the N bounding box ROIs; determining at least one example similarity according to the N first detection results and the M second detection results; and determining an example tracking result of the target video frame according to at least one example similarity. The application also provides a device, equipment and a storage medium. According to the method and the device, the end-to-end instance detection framework is constructed, instance detection independent of post-processing methods such as non-maximum inhibition is achieved, an instance target is tracked based on instance identification, and therefore the video-based instance tracking efficiency is improved.

Description

Instance tracking method, device, equipment and storage medium based on video

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for video-based instance tracking.

Background

Example segmentation is a vital preprocessing of image recognition and computer vision, and is widely applied in various fields. For example, instance segmentation can be used for tasks such as object recognition, object detection, and object tracking. In the object detection task, it is necessary to detect not only the type of an object in an image but also a bounding box (bounding box) of the object.

Currently, instance segmentation algorithms generally follow a "detection-before-segmentation" process, i.e., detection and segmentation of interesting instances in a video by object detection based on a priori boxes. Specifically, when a positive sample is screened in the training process, matching needs to be performed based on the intersection-parallel ratio between the prior frame and the real frame.

However, since the prior frame follows the principle of "one-to-many (i.e. one real frame corresponds to a plurality of prior frames)" in the training process, it is necessary to rely on Non-Maximum Suppression (NMS) and other post-processing methods to reduce the repeated prediction of instances in the testing stage, and it is difficult to perform end-to-end reasoning, resulting in low efficiency of instance tracking.

Disclosure of Invention

The embodiment of the application provides a video-based instance tracking method, device, equipment and storage medium. According to the method and the device, the instance detection independent of post-processing methods such as NMS is realized by constructing the end-to-end instance detection framework, and the instance target is tracked based on the instance identification, so that the video-based instance tracking efficiency is improved.

In view of the above, an aspect of the present application provides a video-based instance tracking method, including:

acquiring a target characteristic diagram through a backbone network based on a target video frame in a video to be detected, wherein the target video frame is the Tth video frame in the video to be detected, and T is an integer greater than 1;

acquiring N bounding box ROI (region of interest) from the target feature map according to N example bounding boxes, wherein each example bounding box is used for extracting a corresponding bounding box ROI, and N is an integer greater than or equal to 1;

acquiring N first detection results through an example segmentation network based on N example query vectors and N bounding box ROIs, wherein each first detection result comprises a first class probability value, a first example bounding box and a first example embedding vector;

determining at least one example similarity according to the N first detection results and M second detection results, wherein each second detection result comprises a second category probability value, a second example surrounding box and a second example embedding vector, the M second detection results are obtained according to previous (T-1) video frames in the video to be detected, each second detection result corresponds to an example identifier, and M is an integer greater than or equal to 1;

and determining an example tracking result of the target video frame according to at least one example similarity, wherein the example tracking result comprises at least one example identifier, and the same example identifier represents the same example in the video to be detected.

Another aspect of the present application provides an instance tracking apparatus, comprising:

the acquisition module is used for acquiring a target characteristic map through a backbone network based on a target video frame in a video to be detected, wherein the target video frame is the Tth video frame in the video to be detected, and T is an integer greater than 1;

the obtaining module is further used for obtaining N bounding box ROI (regions of interest) from the target feature map according to the N example bounding boxes, wherein each example bounding box is used for extracting a corresponding bounding box ROI, and N is an integer greater than or equal to 1;

the obtaining module is further used for obtaining N first detection results through the example segmentation network based on the N example query vectors and the N bounding box ROIs, wherein each first detection result comprises a first class probability value, a first example bounding box and a first example embedding vector;

the determining module is used for determining at least one example similarity according to the N first detection results and the M second detection results, wherein each second detection result comprises a second class probability value, a second example surrounding box and a second example embedding vector, the M second detection results are obtained according to the previous (T-1) video frames in the video to be detected, each second detection result corresponds to an example identifier, and M is an integer greater than or equal to 1;

the determining module is further configured to determine an instance tracking result of the target video frame according to at least one instance similarity, where the instance tracking result includes at least one instance identifier, and the same instance identifier represents the same instance in the video to be detected.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for performing point multiplication on each ROI in the N ROI in the bounding boxes by adopting N example query vectors to obtain N enhanced ROI in a characteristic dimension;

based on the N enhanced bounding boxes ROI, obtaining N first class probability values through a class discrimination network included in an example segmentation network, wherein the N first class probability values are contained in N first detection results;

obtaining N first example bounding boxes through a bounding box regression network included by the example segmentation network based on the N enhanced bounding boxes ROI, wherein the N first example bounding boxes are included in N first detection results;

based on the N enhanced bounding box ROIs, N first instance embedding vectors are obtained through an embedding vector network included by the instance segmentation network, wherein the N first instance embedding vectors are included in the N first detection results.

the acquisition module is specifically used for acquiring at least one group of bounding box dynamic parameters through a full connection layer based on the N example query vectors;

performing dot multiplication on each ROI in the N ROI (region of interest) frames on the characteristic dimension by adopting at least one group of dynamic parameters of the ROI frames to obtain N ROI (enhanced ROI) frames;

the obtaining module is further configured to obtain N mask ROIs from the target video frame according to the N instance bounding boxes, where each instance bounding box is further configured to extract one corresponding mask ROI;

the obtaining module is specifically configured to obtain N first detection results through an instance segmentation network based on the N instance query vectors, the N bounding box ROIs, and the N mask ROIs, where each first detection result further includes a first instance foreground mask.

performing point multiplication on each mask ROI in the N mask ROIs on the feature dimension by adopting N example query vectors to obtain N enhanced mask ROIs;

obtaining N first example embedded vectors through an embedded vector network included in an example segmentation network based on the N enhanced bounding box ROIs, wherein the N first example embedded vectors are contained in N first detection results;

based on the N enhanced masks ROI, N first instance foreground masks are obtained by a mask generation network included in the instance segmentation network, where the N first instance foreground masks are included in the N first detection results.

the acquisition module is specifically used for acquiring at least one group of bounding box dynamic parameters and at least one group of mask dynamic parameters through a full connection layer based on the N example query vectors, wherein each group of mask dynamic parameters comprises N mask dynamic sub-parameters;

performing point multiplication on each mask ROI in the N mask ROIs on the feature dimension by adopting at least one group of mask dynamic parameters to obtain N enhanced mask ROIs;

the determining module is specifically used for sequencing the N first detection results according to the descending order of the first class probability value to obtain N sequenced first detection results;

selecting the first K first detection results from the sorted N first detection results, wherein K is an integer which is greater than or equal to 1 and less than or equal to N;

and determining example similarity between each first detection result and each second detection result according to the K first detection results and the M second detection results to obtain (K M) example similarities.

a determining module, configured to determine an instance embedding vector similarity between each first detection result and each second detection result according to a first instance embedding vector included in each first detection result and a second instance embedding vector included in each second detection result;

determining the spatial similarity between each first detection result and each second detection result according to a first example surrounding frame included by each first detection result and a second example surrounding frame included by each second detection result;

determining the category similarity between each first detection result and each second detection result according to the first category probability value included by each first detection result and the second category probability value included by each second detection result;

and determining the example similarity between each first detection result and each second detection result according to the example embedding vector similarity, the space similarity, the category similarity and the first category probability value included by each first detection result between each first detection result and each second detection result.

In one possible design, in another implementation of another aspect of an embodiment of the present application, K is less than or equal to M;

the determining module is specifically used for constructing mapping relations between the K first detection results and the M second detection results according to (K x M) example similarities based on a bipartite graph matching algorithm to obtain K mapping relations;

if the example similarity corresponding to the P mapping relations in the K mapping relations is smaller than or equal to the example similarity threshold, deleting the P mapping relations from the K mapping relations to obtain (K-P) mapping relations, wherein P is an integer which is larger than or equal to 1 and smaller than or equal to K;

determining an example tracking result of the target video frame according to a second detection result corresponding to each mapping relation in the (K-P) mapping relations, wherein each second detection result corresponds to an example identifier;

and the determining module is further configured to take the first detection results corresponding to the P mapping relationships as second detection results to obtain (M + P) second detection results.

In one possible design, in another implementation of another aspect of an embodiment of the present application, K is greater than or equal to M;

the determining module is specifically used for constructing mapping relations between the K first detection results and the M second detection results according to (K x M) example similarities based on a bipartite graph matching algorithm to obtain M mapping relations;

if the example similarity corresponding to Q mapping relations in the M mapping relations is smaller than or equal to the example similarity threshold, deleting the Q mapping relations from the M mapping relations to obtain (M-Q) mapping relations, wherein Q is an integer which is larger than or equal to 1 and smaller than or equal to M;

determining an example tracking result of the target video frame according to a second detection result corresponding to each mapping relation in the (M-Q) mapping relations, wherein each second detection result corresponds to an example identifier;

and the determining module is further used for taking the Q mapping relations and the (K-M) first detection results as second detection results to obtain (Q + K) second detection results.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the instance tracking apparatus further includes a training module;

the acquisition module is further used for acquiring a sample characteristic diagram through a backbone network to be trained based on a first sample video frame in a video to be trained, wherein the first sample video frame is provided with a labeling category, a labeling example surrounding frame and a labeling example identifier;

the obtaining module is further used for obtaining N predicted enclosing frame ROIs from the sample feature map according to the N enclosing frames of the examples to be trained, wherein each enclosing frame of the examples to be trained is used for extracting a corresponding predicted enclosing frame ROI;

the obtaining module is further used for obtaining N first prediction results through the to-be-trained example segmentation network based on the N to-be-trained example query vectors and the N prediction bounding boxes ROI, wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box and a first prediction example embedding vector;

the determining module is further configured to determine at least one prediction instance similarity according to the N first prediction results and the N second prediction results, where each second prediction result includes a second prediction category probability value, a second prediction instance bounding box, and a second prediction instance embedding vector, the N second prediction results are derived from a second sample video frame in the video to be trained, and each second prediction result corresponds to an instance identifier;

the determining module is further used for determining a prediction example tracking result of the first sample video frame according to the similarity of at least one prediction example;

and the training module is used for updating parameters of the backbone network to be trained, the N enclosing frames of the examples to be trained, the N query vectors of the examples to be trained and the segmentation network of the examples to be trained through a loss function according to the tracking result of the prediction examples, the identification of the labeled examples, the N first prediction results, the labeled categories and the labeled example enclosing frames.

the acquisition module is further used for acquiring a sample characteristic diagram through a backbone network to be trained based on a first sample video frame in a video to be trained, wherein the first sample video frame has a labeling category, a labeling example surrounding frame, a labeling example identifier and a labeling example foreground mask;

the obtaining module is further used for obtaining N prediction mask ROIs from the sample feature map according to the N to-be-trained example enclosing frames, wherein each to-be-trained example enclosing frame is further used for extracting a corresponding prediction mask ROI;

the obtaining module is further used for obtaining N first prediction results through the to-be-trained example segmentation network based on the N to-be-trained example query vectors, the N prediction bounding boxes ROI and the N prediction masks ROI, wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box, a first prediction example embedding vector and a prediction example foreground mask;

and the training module is also used for updating parameters of the backbone network to be trained, the N example enclosure boxes to be trained, the N example query vectors to be trained and the example segmentation network to be trained through a loss function according to the prediction example tracking result, the label example identifier, the N first prediction results, the label category, the label example enclosure box and the label example foreground mask.

Another aspect of the present application provides a computer device, comprising: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

a processor for executing the program in the memory, the processor for performing the above-described aspects of the method according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a video-based instance tracking method is provided, wherein a target feature map is obtained through a backbone network based on a target video frame in a video to be detected, then N bounding box interesting regions ROI are obtained from the target feature map according to N instance bounding boxes, then N first detection results are obtained through an instance segmentation network based on N instance query vectors and the N bounding box ROIs, at least one instance similarity is determined based on the N first detection results and the M second detection results, and finally an instance tracking result of the target video frame can be determined according to the at least one instance similarity, the instance tracking result comprises at least one instance identifier, and the same instance identifier represents the same instance in the video to be detected. Through the method, in the video-based instance tracking process, the interested instance in the video frame can be directly detected by using the instance query vector, then the similarity matching is carried out on the instance in the video frame and the instance of the previous video frame in the video, and finally the instance tracking of the video frame is realized according to the matching condition. According to the method and the device, the instance detection independent of post-processing methods such as NMS is realized by constructing the end-to-end instance detection framework, and the instance target is tracked based on the instance identification, so that the video-based instance tracking efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of an environment of an example tracking system in an embodiment of the present application;

FIG. 2 is a block diagram of an end-to-end video instance partitioning framework based on instance queries according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an example video-based tracking method in an embodiment of the present application;

FIG. 4 is a diagram illustrating an example tracking result for a target video frame in an embodiment of the present application;

FIG. 5 is a diagram illustrating another example tracking result for a target video frame in an embodiment of the present application;

FIG. 6 is a diagram illustrating an implementation of dynamic convolution according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a bounding box ROI and a mask ROI in an embodiment of the present application;

FIG. 8 is a diagram illustrating another example tracking result for a target video frame in an embodiment of the present application;

FIG. 9 is a diagram illustrating another example tracking result for a target video frame in an embodiment of the present application;

FIG. 10 is a diagram illustrating an embodiment of the present application for implementing dynamic convolution;

FIG. 11 is a schematic diagram of an embodiment of the present application for implementing matching based on a bipartite graph matching algorithm;

FIG. 12 is another diagram illustrating an example matching implemented based on a bipartite graph matching algorithm according to an embodiment of the present application;

FIG. 13 is a schematic view of an example tracking device in an embodiment of the present application;

fig. 14 is a schematic structural diagram of a computer device in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With the continuous development of computer technology, image processing technology based on artificial intelligence is more and more mature. In the field of image recognition, a neural network generally needs to learn image features so as to realize detection of example targets in an image, wherein the example targets can be human faces, animals, objects, scenes, buildings, subtitles and the like. For example target detection, position rough detection is usually required to be performed on an example target to determine features of a candidate rectangular frame, the features are used as input of fine detection, specific positions of the example target can be determined through the fine detection, categories of the example target are determined in a classified mode, and the example target in an image is detected according to the categories and the specific positions of the example target. In the field of video identification, instance targets can be identified for each video frame in a video, and then the identified instance targets are compared with instance targets of other previous video frames to generate instance tracking results of the video frames. Several applications of video instance detection and segmentation will be described below, and in practical applications, other specific scenes may be involved, which are not exhaustive here.

Firstly, detecting subtitles;

based on video instance tracking, target instances (e.g., subtitles) in a video are detected and tracked, then target instances with the same instance identification are extracted, and an Optical Character Recognition (OCR) technology is adopted to recognize subtitle target instances.

Secondly, automatic driving;

and detecting and tracking target instances (such as obstacles such as vehicles and pedestrians) in the video based on video instance tracking, controlling the vehicles to avoid if the target instances with the same instance identifications are located on lanes, and controlling the vehicles to continue to run according to the tracks if the target instances are not located on the lanes.

Thirdly, changing scenes;

based on video instance tracking, a target instance (e.g., person a) in video a is segmented and tracked, and then the segmented target instance is taken as a foreground and placed in video B, thereby realizing background replacement.

In order to realize more efficient instance tracking in the above scenario, the present application proposes a video-based instance tracking method, which is applied to the instance tracking system shown in fig. 1, where as shown in the figure, the instance tracking system includes a terminal device, or the instance tracking system includes a terminal device and a server. The client is deployed on the terminal device, and the client may run on the terminal device in the form of a browser, or may run on the terminal device in the form of an independent Application (APP), and the specific presentation form of the client is not limited here. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, and the like, but is not limited thereto. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited. The scheme provided by the application can be independently completed by the terminal device, can also be independently completed by the server, and can also be completed by the cooperation of the terminal device and the server, so that the application is not particularly limited.

Illustratively, in one case, the user selects one video to be detected. Then, the terminal device can call a local network model to perform instance recognition on the video, and therefore, an instance tracking result of each video frame is output and displayed.

Illustratively, in another case, the user selects one video to be detected and uploads the video to the server. Then, the server can call a local network model to perform instance identification on the video, so that an instance tracking result of each video frame is output, the instance tracking result is fed back to the terminal device, and the instance tracking result of each video frame is displayed by the terminal device.

It should be noted that the process of detecting and segmenting the video frames involves Computer Vision (CV) and Machine Learning (ML). The computer vision technology is a science for researching how to make a machine see, and in particular, the computer vision technology is to use a camera and a computer to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine learning is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Both computer vision technology and machine learning belong to Artificial Intelligence (AI) technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The Query Video Instance Segmentation (Query Video Segmentation) framework employed by the training process will be described below. Referring to fig. 2, fig. 2 is a schematic diagram of an end-to-end video instance segmentation framework based on instance query in this embodiment of the present application, as shown in the figure, in a training phase, a query vis framework uses a sample pair (i.e., a reference video frame and an auxiliary video frame) sampled from the same video to be trained as an input, then, a backbone network to be trained performs feature extraction on the reference video frame and the auxiliary video frame, and then, a Region of Interest (ROI) feature is extracted from the extracted features. And the example query vector directly enhances the characteristics of the region of interest and then is input into task heads of different tasks, so that the related tasks are distinguished.

With reference to fig. 3, an example tracking method based on video in the present application will be described below, and an embodiment of the example tracking method in the present application includes:

110. acquiring a target characteristic diagram through a backbone network based on a target video frame in a video to be detected, wherein the target video frame is the Tth video frame in the video to be detected, and T is an integer greater than 1;

in one or more embodiments, the instance tracking device obtains a target video frame from the video to be detected, wherein the target video frame is the T-th frame in the video to be detected. Then, the target video frame is input to a Backbone network (Backbone), so as to obtain a target feature map corresponding to the target video frame. The backbone network belongs to a feature extraction network, network parameters of the backbone network can be initialized by weights pre-trained on an MS-COCO example segmentation data set, and the used data set is the COCO example segmentation data set.

It should be noted that the backbone network may adopt a residual network (ResNet), such as ResNet50, ResNet101, or ResNeXt 101. Alternatively, the backbone network may employ a shift window based Hierarchical visual network (swintformer), such as swintformer Tiny, swintformer Small, swintformer base, or swintformer Large, among others.

It should be noted that, the instance tracking apparatus may be deployed in the terminal device, or the instance tracking apparatus may be deployed in the server, or the instance tracking apparatus may be deployed in an instance tracking system composed of the terminal device and the server, which is not limited herein.

120. Acquiring N bounding box ROI (region of interest) from the target feature map according to N example bounding boxes, wherein each example bounding box is used for extracting a corresponding bounding box ROI, and N is an integer greater than or equal to 1;

in one or more embodiments, the example tracking device extracts corresponding N bounding box ROIs from the target feature map using N trained example bounding boxes, where the ROIs refer to regions of interest for the algorithm. Each example bounding box is used to extract one corresponding bounding box ROI, and each bounding box ROI typically has a fixed resolution, e.g., each bounding box ROI is 7 x 7 in size.

It should be noted that N may be an integer greater than or equal to 1, and in the present application, N may be set to 300, and optionally, N may also be set to 100 to obtain other values, which is not limited herein.

130. Acquiring N first detection results through an example segmentation network based on N example query vectors and N bounding box ROIs, wherein each first detection result comprises a first class probability value, a first example bounding box and a first example embedding vector;

in one or more embodiments, the instance bounding box has a one-to-one correspondence with the instance query vector, based on which the instance tracking device convolves the bounding box ROI with the instance query vector, and then inputs the convolved bounding box ROI to the instance segmentation network. And respectively outputting the first detection result corresponding to each convolved bounding box ROI by the example segmentation network, namely obtaining N first detection results. Each first detection result comprises a first class probability value, a first instance bounding box and a first instance embedding vector. The first class of probability values is the maximum probability value in the probability distribution. The first instance bounding box is denoted as (x1, y1, x2, y2), x1 and y1 may be the top left vertex coordinates of the first instance bounding box, and x2 and y2 may be the bottom right vertex coordinates of the first instance bounding box. The first example embedding vector may be represented as a 1 x 256 vector.

140. Determining at least one example similarity according to the N first detection results and M second detection results, wherein each second detection result comprises a second category probability value, a second example surrounding box and a second example embedding vector, the M second detection results are obtained according to previous (T-1) video frames in the video to be detected, each second detection result corresponds to an example identifier, and M is an integer greater than or equal to 1;

in one or more embodiments, the instance tracking device matches the N first detection results with M second detection results, wherein the M second detection results are detected from the first (T-1) video frames in the video to be detected. Assuming that T is 100, M second detection results, each including a second class probability value, a second instance bounding box, and a second instance embedding vector, may be detected from the 99 video frames based on steps 110 to 130. Thus, the example similarity between the first detection result and the second detection result can be calculated according to the first detection result and the second detection result.

150. And determining an example tracking result of the target video frame according to at least one example similarity, wherein the example tracking result comprises at least one example identifier, and the same example identifier represents the same example in the video to be detected.

In one or more embodiments, the instance tracking apparatus may match the first detection result and the second detection result using a matching algorithm (e.g., a nearest neighbor matching algorithm or a bipartite graph matching algorithm, etc.) after obtaining the at least one instance similarity. Assuming that the first detection result a and the second detection result B are successfully matched, and the first detection result C and the second detection result D are successfully matched (i.e. tracking of the instance target is achieved), it may be determined that the instance tracking result of the target video frame includes two instance identifiers, i.e. an instance identifier corresponding to the second detection result B and an instance identifier corresponding to the second detection result D.

In the embodiment of the application, a video-based instance tracking method is provided, and through the mode, in the video-based instance tracking process, an interested instance in a video frame can be directly detected by using an instance query vector, then similarity matching is carried out on the instance in the video frame and the instance of the previous video frame in the video, and finally, instance tracking of the video frame is realized according to the matching condition. According to the method and the device, the instance detection independent of post-processing methods such as NMS is realized by constructing the end-to-end instance detection framework, and the instance target is tracked based on the instance identification, so that the video-based instance tracking efficiency is improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the N example query vectors and the N bounding box ROIs, the obtaining N first detection results through the example segmentation network may specifically include:

performing point multiplication on each ROI in the N ROI on the characteristic dimension by adopting N example query vectors to obtain N enhanced ROI;

In one or more embodiments, a manner is described by which to implement instance target tracking based on an instance split network. For each ROI (region of interest) of the surrounding frame, a multi-head attention mechanism can be adopted to extract the features of the ROI of the whole surrounding frame, so that a better model effect can be obtained in a large-scale sensing range.

Specifically, for convenience of understanding, please refer to fig. 4, where fig. 4 is a schematic diagram illustrating an example tracking result for a target video frame in the embodiment of the present application, and as shown in the figure, the target video frame is input to a backbone network, so as to extract a target feature map. The ROI is then aligned using the N instance bounding boxes, e.g., the bounding box ROI is extracted using a RoIAlign operation, thereby resulting in N bounding box ROIs. And performing point multiplication on the bounding box ROI on the characteristic dimension by adopting the instance query vector to obtain the enhanced bounding box ROI, wherein the instance query vector and the instance bounding box have one-to-one correspondence, so that N enhanced bounding box ROIs can be obtained.

It should be noted that the example segmentation networks include a category discrimination network, a frame regression network, and an embedded vector network. Based on the method, the N enhanced bounding box ROIs are input to a category judgment network, and a first category probability value corresponding to each enhanced bounding box ROI is output by the category judgment network. And inputting the N enhanced bounding box ROIs into a bounding box regression network, and outputting a first example bounding box corresponding to each enhanced bounding box ROI by the bounding box regression network. Inputting the N enhancement bounding box ROIs into an embedding vector network, and outputting a first instance embedding vector corresponding to each enhancement bounding box ROI by the embedding vector network.

Based on the method, N first detection results obtained by predicting the target video frame are matched with M second detection results obtained by predicting the previous (T-1) video frames, and an example tracking result of the target video frame is generated based on the matching results. Additionally, instance tracking results may also be displayed, e.g., showing the instance type as "car", the location of the instance bounding box, and the instance identification as "101". It should be noted that the same instance identifier represents the same instance in the video to be detected.

Secondly, in the embodiment of the application, a method for tracking the example target based on the example segmentation network is provided, and by the method, the example target in the video can be detected and tracked under the condition that the example foreground mask is not extracted, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the model reasoning speed is favorably improved.

acquiring at least one group of bounding box dynamic parameters through a full connection layer based on the N example query vectors;

Specifically, for convenience of understanding, please refer to fig. 5, where fig. 5 is a schematic diagram illustrating another example tracking result for a target video frame in the embodiment of the present application, and as shown in the figure, the target video frame is input to a backbone network, so as to extract a target feature map. The ROI is then aligned using the N instance bounding boxes, e.g., the bounding box ROI is extracted using a RoIAlign operation, thereby resulting in N bounding box ROIs. And generating dynamic parameters by adopting the instance query vectors, and then performing point multiplication on the ROI on the characteristic dimension by adopting the dynamic parameters to obtain the enhanced ROI (region of interest), wherein the instance query vectors and the instance ROI have one-to-one correspondence, so that N enhanced ROIs of the frame can be obtained.

Based on the method, N first detection results obtained by predicting the target video frame are matched with M second detection results obtained by predicting the previous (T-1) video frames, and an example tracking result of the target video frame is generated based on the matching results. Additionally, instance tracking results may also be displayed.

The process of dynamic convolution will be described below with reference to fig. 6. In the following, an example of one bounding box ROI will be described, and it can be understood that N bounding boxes ROIs are all processed in a similar manner, which is not described herein. Referring to FIG. 6, FIG. 6 is a diagram illustrating an embodiment of the present application for implementing dynamic convolution, wherein after the ROI of the bounding box is extracted, the example query vector is dynamically convolved with the features of the ROI of the bounding box. Taking the dynamic convolution twice as an example, the example query vector is input into the full connection layer, and two sets of bounding box dynamic parameters are output by the full connection layer, wherein one set of bounding box dynamic parameters includes N dynamic parameters a, and the other set of bounding box dynamic parameters includes N dynamic parameters B. Based on this, the bounding box ROI is subjected to dot multiplication (i.e., convolution operation of 1 × 1) using the dynamic parameter a, and then the dynamically convolved bounding box ROI is subjected to dot multiplication (i.e., convolution operation of 1 × 1) using the dynamic parameter B, thereby obtaining the enhanced bounding box ROI.

In practical applications, it should be noted that 1 dynamic convolution or more than 2 dynamic convolutions may also be performed, and the present application takes the example of 2 dynamic convolutions as an example, however, this should not be construed as limiting the present application.

Secondly, in the embodiment of the application, a method for tracking the example target based on the example segmentation network is provided, and by the method, the example target in the video can be detected and tracked under the condition that the example foreground mask is not extracted, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the model reasoning speed is favorably improved. Meanwhile, after ROI feature extraction is carried out, dynamic convolution is carried out between the instance query vector and the ROI feature, so that enhanced instance features are generated, and a better model effect is obtained.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, another optional embodiment provided in the embodiments of the present application may further include:

acquiring N mask ROIs from the target video frame according to the N example bounding boxes, wherein each example bounding box is further used for extracting a corresponding mask ROI;

based on the N instance query vectors and the N bounding box ROIs, N first detection results are obtained through an instance segmentation network, which may specifically include:

based on the N instance query vectors, the N bounding box ROIs, and the N mask ROIs, N first detection results are obtained by an instance segmentation network, wherein each first detection result further includes a first instance foreground mask.

In one or more embodiments, a way to extract a masked ROI for tracking is presented. The instance tracking device may further extract N masked ROIs from the target video frame using N instance bounding boxes, i.e., each instance bounding box may not only extract one corresponding bounding box ROI, but may also provide one corresponding masked ROI. Wherein the size of the mask ROI is larger than the size of the bounding box ROI.

Specifically, N example query vectors are adopted to respectively convolve the N bounding box ROIs and the N mask ROIs, and example segmentation is further performed based on a result after convolution to obtain N first example foreground masks. For easy understanding, please refer to fig. 7, fig. 7 is a schematic diagram of a bounding box ROI and a mask ROI in the embodiment of the present application, and as shown in fig. 7 (a), an example target is a "car", and the corresponding bounding box ROI is a minimum bounding box for the example target. As shown in fig. 7 (B), taking the example target as "car", the corresponding mask ROI is the foreground segmentation result for the example target.

Secondly, in the embodiment of the application, a mode of extracting a mask ROI for tracking is provided, and through the mode, an instance target in a video can be further segmented, so that the improvement of a single-stage instance segmentation network is realized, the single-stage instance segmentation network is expanded to the field of video instance segmentation, the number of model parameters is reduced, and the segmentation fineness can be improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the N instance query vectors, the N bounding box ROIs, and the N mask ROIs, the obtaining N first detection results through the instance segmentation network may specifically include:

In one or more embodiments, a manner is described by which to implement instance target tracking based on an instance split network. For each bounding box ROI and each mask ROI, a multi-head attention mechanism can be adopted to respectively extract the features of the whole bounding box ROI and the whole mask ROI, so that a better model effect can be obtained in a large-scale perception range.

Specifically, for convenience of understanding, please refer to fig. 8, and fig. 8 is a schematic diagram illustrating another example tracking result for a target video frame in the embodiment of the present application. The ROI is then aligned using the N instance bounding boxes, e.g., the bounding box ROI and the mask ROI are extracted using a RoIAlign operation, thereby resulting in N bounding box ROIs and N mask ROIs. And performing point multiplication on the ROI of the bounding box on the feature dimension by adopting the example query vector to obtain the ROI of the enhanced bounding box. Similarly, the mask ROI is point-multiplied in the feature dimension using the instance query vector to obtain an enhanced mask ROI. Since the instance query vector and the instance bounding box have a one-to-one correspondence, N enhanced bounding box ROIs and N enhanced bounding box ROIs are obtained.

It should be noted that the example segmentation networks include a category discrimination network, a frame regression network, an embedded vector network, and a mask generation network. Based on the method, the N enhanced bounding box ROIs are input to a category judgment network, and a first category probability value corresponding to each enhanced bounding box ROI is output by the category judgment network. And inputting the N enhanced bounding box ROIs into a bounding box regression network, and outputting a first example bounding box corresponding to each enhanced bounding box ROI by the bounding box regression network. Inputting the N enhancement bounding box ROIs into an embedding vector network, and outputting a first instance embedding vector corresponding to each enhancement bounding box ROI by the embedding vector network. Inputting the N enhancement masks ROI into a mask generation network, and outputting a first instance foreground mask corresponding to each enhancement mask ROI by the mask generation network.

In the embodiment of the application, a method for tracking the example target based on the example segmentation network is provided, and by the method, the example target in the video can be detected and tracked under the condition of extracting the example foreground mask, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the model reasoning speed is favorably improved.

acquiring at least one group of bounding box dynamic parameters and at least one group of mask dynamic parameters through a full connection layer based on the N example query vectors, wherein each group of mask dynamic parameters comprises N mask dynamic sub-parameters;

Specifically, for convenience of understanding, please refer to fig. 9, where fig. 9 is a schematic diagram illustrating another example tracking result for a target video frame in the embodiment of the present application, and as shown in the figure, the target video frame is input to a backbone network, so as to extract a target feature map. The ROI is then aligned using the N instance bounding boxes, e.g., the bounding box ROI and the mask ROI are extracted using a RoIAlign operation, thereby resulting in N bounding box ROIs and N mask ROIs. And generating a group of dynamic parameters by adopting the instance query vector, and then performing point multiplication on the bounding box ROI on the characteristic dimension by adopting the group of dynamic parameters to obtain the enhanced bounding box ROI. Similarly, another set of dynamic parameters is generated by using the example query vector, and then the mask ROI is subjected to point multiplication on the feature dimension by using the set of dynamic parameters to obtain an enhanced mask ROI.

The process of dynamic convolution will be described below with reference to fig. 10. In the following, an outline ROI and a mask ROI are taken as examples for description, and it is understood that the N outline ROIs and the N mask ROIs are processed in a similar manner, which is not described herein again. Referring to fig. 10, fig. 10 is a schematic diagram of implementing dynamic convolution according to an embodiment of the present application, as shown in the figure, after a bounding box ROI is extracted, an example query vector is dynamically convolved with features of the bounding box ROI. Taking the dynamic convolution twice as an example, the example query vector is input to the full connection layer 1, and the full connection layer 1 outputs two sets of bounding box dynamic parameters, wherein one set of bounding box dynamic parameters includes N dynamic parameters a, and the other set of bounding box dynamic parameters includes N dynamic parameters B. Based on this, the bounding box ROI is subjected to dot multiplication (i.e., convolution operation of 1 × 1) using the dynamic parameter a, and then the dynamically convolved bounding box ROI is subjected to dot multiplication (i.e., convolution operation of 1 × 1) using the dynamic parameter B, thereby obtaining the enhanced bounding box ROI.

Similarly, after the mask ROI is extracted, the instance query vector is dynamically convolved with the features of the mask ROI. Taking the dynamic convolution twice as an example, the example query vector is input to the full connection layer 2, and the full connection layer 2 outputs two sets of bounding box dynamic parameters, wherein one set of bounding box dynamic parameters includes N dynamic parameters C, and the other set of bounding box dynamic parameters includes N dynamic parameters D. Based on this, the mask ROI is dot-multiplied by the dynamic parameter C (i.e., convolution operation of 1 × 1), and then the mask ROI after the dynamic convolution is dot-multiplied by the dynamic parameter D (i.e., convolution operation of 1 × 1), thereby obtaining the enhanced mask ROI.

In the embodiment of the application, a method for tracking the example target based on the example segmentation network is provided, and by the method, the example target in the video can be detected and tracked under the condition of extracting the example foreground mask, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the model reasoning speed is favorably improved. Meanwhile, after ROI feature extraction is carried out, dynamic convolution is carried out between the instance query vector and the ROI feature, so that enhanced instance features are generated, and a better model effect is obtained.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, determining at least one example similarity according to the N first detection results and the M second detection results may specifically include:

sequencing the N first detection results according to the descending order of the first class probability value to obtain N sequenced first detection results;

In one or more embodiments, a way to screen out K first detection results for instance matching is presented. In order to improve the matching effect of the instances, an online instance connection mode can be adopted, and meanwhile, the effect of online instance connection is further improved by adopting bidirectional softmax and instance similarity. Since the calculated (K × M) example similarities are represented in a matrix form, the (K × M) example similarities can be normalized by using the bidirectional softmax.

Specifically, for each video frame in the video to be detected, a corresponding detection result (including a category probability value, an instance enclosure box and an instance embedding vector) can be output through QueryVIS, and then, the detection result of each frame is stored in a candidate pool. Moreover, when the target video frame is detected, there are M second detection results in the candidate pool, and therefore, it is necessary to perform similarity calculation between the K first detection results for the target video frame and the M second detection results in the candidate pool. How to screen K first detection results from the N first detection results will be described below.

Exemplarily, assuming that N is 300 and K is 10, based on this, for the first class probability value included in each of the 300 first detection results, the N first detection results are reordered according to the descending order of the first class probability values, thereby obtaining N sorted first detection results. Then, K first detection results arranged at the top are selected from the N first detection results after sorting, and for example, 10 first detection results are obtained. And finally, carrying out pairwise similarity calculation on the K first detection results and the M second detection results to obtain (K M) example similarities, wherein the (K M) example similarities can be expressed in a similarity matrix form.

Secondly, in the embodiment of the application, a mode of screening K first detection results for instance matching is provided, and through the above mode, considering that matching N first detection results can consume more resources, and the efficiency of instance target tracking can be caused to be lower, therefore, the first K first detection results with the maximum class probability value are screened out from N first detection results, and the first K detection results are used for subsequent matching, thereby improving the matching efficiency and simultaneously improving the matching accuracy.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, determining an example similarity between each first detection result and each second detection result according to the K first detection results and the M second detection results may specifically include:

determining the example embedding vector similarity between each first detection result and each second detection result according to the first example embedding vector included by each first detection result and the second example embedding vector included by each second detection result;

In one or more embodiments, a manner of calculating instance similarity is presented. As can be seen from the foregoing embodiments, based on the K first detection results and the M second detection results, (K × M) example similarities can be calculated. For convenience of introduction, the following description will use example similarity between a first detection result and a second detection result as an example, and example similarities between other detection results are also calculated in a similar manner, which is not described herein again.

Specifically, the example similarity can be calculated as follows:

a Similarity ═ a ═ B ═ C ═ D; formula 1

The Similarity represents the example Similarity between the two detection results, A represents the example embedding vector Similarity between the two detection results, B represents the spatial Similarity between the two detection results, C represents the category Similarity between the two detection results, and D represents the first category probability value included in the first detection result.

The way of calculating the example embedding vector similarity, spatial similarity and category similarity will be described in turn below.

Firstly, embedding vector similarity into an instance;

an instance embedding vector similarity between the first instance embedding vector and the second instance embedding vector may be calculated using cosine similarity. Or taking the inner product of the first instance embedding vector and the second instance embedding vector as the instance embedding vector similarity. Alternatively, other similarity algorithms may be used to calculate the instance embedding vector similarity.

Secondly, spatial similarity;

the spatial correlation is an Intersection Over Union (IOU) value of the first instance bounding box and the second instance bounding box.

Thirdly, category similarity;

the category similarity can ensure that the process of instance connection is only performed between instances of the same category. That is, a first category may be determined based on a first category probability value, e.g., a first category probability value of 0.9, and a first category corresponding to 0.9 is a "puppy". The second category may be determined from a second category probability value, e.g., the second category probability value is 0.8 and the corresponding second category of 0.8 is "kitten". The first category and the second category are different, and thus the category similarity of both is "0".

For another example, the first class probability value is 0.9, and the first class corresponding to 0.9 is "puppy". The second category may be determined from a second category probability value, e.g., the second category probability value is 0.8 and the corresponding second category of 0.8 is "puppy". The first category and the second category are the same, and thus the category similarity of both is "1".

In the embodiment of the application, a way of calculating the example similarity is provided, and through the way, the example similarity between two detection results can be directly calculated by utilizing the product of the example embedded vector similarity, the spatial similarity, the category similarity and the category probability value, so that other weight parameters do not need to be trained, the calculation complexity is reduced, and the parameter training amount can be reduced. Meanwhile, the (K × M) example similarity can be normalized by adopting bidirectional softmax, and the value range interval of the probability is ensured to be between 0 and 1, so that the calculation precision is improved, and the convergence speed is accelerated.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided in the embodiments of the present application, K is less than or equal to M;

determining an instance tracking result of the target video frame according to at least one instance similarity, which may specifically include:

based on a bipartite graph matching algorithm, according to (K × M) example similarities, constructing mapping relations between K first detection results and M second detection results to obtain K mapping relations;

the method can also comprise the following steps:

and taking the first detection results corresponding to the P mapping relations as second detection results to obtain (M + P) second detection results.

In one or more embodiments, a way to match K first detection results with M second detection results is presented. As can be seen from the foregoing embodiments, the K value may be an integer less than or equal to the M value, and each of the first detection result and the second detection result is calculated to obtain the example similarity. And according to the (K x M) example similarities, calculating an optimal matching result based on a bipartite graph matching algorithm, wherein the sum of the example similarities corresponding to the matching result is the minimum under the optimal matching result. Based on this, mapping relationships between the K first detection results and the M second detection results can be constructed, thereby obtaining K mapping relationships.

Specifically, for convenience of understanding, please refer to fig. 11, where fig. 11 is a schematic diagram of an implementation example matching based on a bipartite graph matching algorithm in the embodiment of the present disclosure, and as shown in the figure, it is assumed that K first detection results include a first detection result a, a first detection result B, and a first detection result C, and M second detection results include a second detection result V, a second detection result W, a second detection result X, a second detection result Y, and a second detection result Z. It is assumed that K mapping relationships obtained after matching are shown in table 1.

TABLE 1

Mapping relationships	Example similarity
		A-X	0.9
B-V	0.3
		C-Y	0.8

It can be seen that, in the case where K is 3 and M is 5, 3 mapping relationships can be obtained, and each mapping relationship has a corresponding instance similarity. Assuming that the threshold value of the example similarity is 0.4, only the example similarity corresponding to the "B-Y" mapping relationship is smaller than the threshold value of the example similarity (i.e., P is 1), and therefore, this mapping relationship is removed from the K mapping relationships, and the remaining 2 mapping relationships (i.e., the remaining K-P mapping relationships) are obtained.

Based on this, the instance identifier corresponding to the second detection result X is used as the instance identifier corresponding to the first detection result a, and the instance identifier corresponding to the second detection result Y is used as the instance identifier corresponding to the first detection result C, so as to obtain the instance tracking result of the target video frame.

Further, since the P mapping relationships are not successfully matched, the first detection results corresponding to the P mapping relationships may also be added to the candidate pool as the second detection results for subsequent matching. For example, the first detection result B is taken as the second detection result B and added to the candidate pool. Therefore, the candidate pool includes the original M second detection results and P newly added second detection results, thereby obtaining (M + P) second detection results.

It should be noted that the bipartite graph matching algorithm adopted in the present application may be a hungarian algorithm or a maximum matching algorithm of bipartite graphs. Or a perfect match algorithm, etc., without limitation herein.

In the embodiment of the application, a way of matching the K first detection results with the M second detection results is provided, and in the way, under the condition that K is less than or equal to M, a bipartite graph matching algorithm is adopted to preferentially match the K first detection results, so that a result with a better overall matching degree is obtained, and therefore the accuracy and consistency of example target tracking are improved.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided in the embodiments of the present application, K is greater than or equal to M;

based on a bipartite graph matching algorithm, according to (K × M) example similarities, constructing mapping relations between K first detection results and M second detection results to obtain M mapping relations;

the method can also comprise the following steps:

and taking the Q mapping relations and the (K-M) first detection results as second detection results to obtain (Q + K) second detection results.

In one or more embodiments, a way to match K first detection results with M second detection results is presented. As can be seen from the foregoing embodiments, the value K may be an integer greater than or equal to the value M, and each of the first detection result and the second detection result is calculated to obtain the example similarity. And according to the (K x M) example similarities, calculating an optimal matching result based on a bipartite graph matching algorithm, wherein the sum of the example similarities corresponding to the matching result is the minimum under the optimal matching result. Based on this, mapping relationships between the K first detection results and the M second detection results may be constructed, thereby obtaining M mapping relationships.

Specifically, for convenience of understanding, please refer to fig. 12, where fig. 12 is another schematic diagram of implementing example matching based on a bipartite graph matching algorithm in the embodiment of the present disclosure, and as shown in the figure, it is assumed that K first detection results include a first detection result a, a first detection result B, a first detection result C, a first detection result D, and a first detection result E, and M second detection results include a second detection result X, a second detection result Y, and a second detection result Z. It is assumed that M mapping relationships obtained after matching are shown in table 2.

TABLE 2

Mapping relationships	Example similarity
		A-Z	0.7
B-Y	0.9
		D-X	0.1

It can be seen that, in the case where K is 5 and M is 3, 3 mapping relationships can be obtained, and each mapping relationship has a corresponding instance similarity. Assuming that the threshold value of the example similarity is 0.4, only the example similarity corresponding to the "D-X" mapping relationship is smaller than the threshold value of the example similarity (i.e., Q is 1), and therefore, this mapping relationship is eliminated from the M mapping relationships, and the remaining 2 mapping relationships (i.e., the remaining M-Q mapping relationships) are obtained.

Based on this, the instance identifier corresponding to the second detection result Z is used as the instance identifier corresponding to the first detection result a, and the instance identifier corresponding to the second detection result Y is used as the instance identifier corresponding to the first detection result B, so as to obtain the instance tracking result of the target video frame.

Further, since the Q mapping relationships are not successfully matched, the first detection results corresponding to the Q mapping relationships may also be added to the candidate pool as the second detection result for subsequent matching. For example, the first detection result D is taken as the second detection result B and added to the candidate pool. In addition, there are also first detection results C and first detection results E that are not successfully matched, so the remaining (K-M) first detection results may also be added to the candidate pool as second detection results for matching later. Based on this, the candidate pool includes the original M second detection results, Q newly added second detection results, and (K-M) newly added second detection results, thereby obtaining (Q + K) (i.e., M + Q + K-M) second detection results.

In the embodiment of the application, a way of matching the K first detection results with the M second detection results is provided, and in the way, when K is greater than or equal to M, a bipartite graph matching algorithm is adopted to preferentially match the M second detection results, so that a result with a better overall matching degree is obtained, and therefore, the accuracy and consistency of example target tracking are improved.

obtaining a sample characteristic diagram through a backbone network to be trained based on a first sample video frame in a video to be trained, wherein the first sample video frame is provided with a labeling category, a labeling example surrounding frame and a labeling example identifier;

acquiring N predicted bounding box ROIs from the sample feature map according to the N to-be-trained example bounding boxes, wherein each to-be-trained example bounding box is used for extracting a corresponding predicted bounding box ROI;

acquiring N first prediction results through a to-be-trained example segmentation network based on N to-be-trained example query vectors and N prediction bounding boxes ROI, wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box and a first prediction example embedding vector;

determining at least one prediction instance similarity according to the N first prediction results and N second prediction results, wherein each second prediction result comprises a second prediction category probability value, a second prediction instance surrounding box and a second prediction instance embedding vector, each second prediction result corresponds to an instance identifier, and the N second prediction results are derived from a second sample video frame in the video to be trained;

determining a prediction instance tracking result of the first sample video frame according to at least one prediction instance similarity;

and updating parameters of the backbone network to be trained, the N enclosing frames of the examples to be trained, the N query vectors of the examples to be trained and the segmentation network of the examples to be trained through a loss function according to the tracking result of the prediction examples, the identification of the labeled examples, the N first prediction results, the labeled category and the labeled example enclosing frame.

In one or more embodiments, a way to train a backbone network, an instance bounding box, an instance query vector, and an instance split network is presented. Based on this, the QueryVIS to be trained comprises a backbone network to be trained, N example enclosing frames to be trained, N example query vectors to be trained and an example segmentation network to be trained. In training, sample video frames from the same video to be trained may be used. Each sample video frame may then be scaled to a certain size, e.g., the short edge of the sample video frame is scaled to (320, 800) pixels and the long edge of the sample video frame is scaled to 1333. In addition, Adaptive momentum estimation weight decay (AdamW) can be used as an optimizer for training on 8 graphics cards, where the number of frames used for one gradient descent is 32. During training, 12 full iterations may be performed on the video instance segmentation dataset, and the initial learning rate is set to 0.000025, which is divided by 10 after the 8 th and 11 th full iterations, respectively.

In particular, there are N instance query vectors for QueryVIS, and one instance query vector needs to take charge of one real bounding box (i.e., label instance bounding box) when performing sample join and loss function computation. Based on the method, in the training stage, after the example query vector is subjected to ROI (region of interest) prediction of the bounding box, the loss between the example query vector and the real bounding box is calculated one by one, and a two-dimensional loss matrix between the prediction of the example query vector and the real bounding box is obtained. And then, carrying out one-to-one matching between the instance query vector and the real bounding box by a bipartite graph matching algorithm to obtain a matching result between the instance query and the real bounding box. The instance query vector that matches between real bounding boxes is responsible for predicting the class of the real bounding box (i.e., the first prediction class probability value), the bounding box (i.e., the first prediction instance bounding box), and the instance embedding vector (i.e., the first prediction instance embedding vector), i.e., the loss function predicts the instance-level loss between the instance query vector and its corresponding real bounding box. It should be noted that, for example query vectors that do not match to a true bounding box, the corresponding loss function only contains classification losses for which the true bounding box is a negative sample.

The QueryVIS training process specifically comprises the steps of taking a first sample video frame in a video to be trained as input of a backbone network to be trained, and extracting features by using the backbone network to be trained to obtain a sample feature map. Then, N predicted bounding boxes ROI are extracted from the sample feature map by utilizing the N bounding boxes of the examples to be trained. And then, using the N query vectors of the to-be-trained examples to perform dynamic convolution operation or common convolution operation on the N predicted bounding box ROIs to generate the enhanced N predicted bounding box ROIs. And taking the enhanced N predicted bounding boxes ROI as the input of the example segmentation network to be trained, and outputting N first predicted category probability values, N first predicted example bounding boxes and N first predicted example embedding vectors by the example segmentation network to be trained.

It should be noted that, since the training uses pairs of sample video frames, the second sample video frame also needs to be similarly processed to obtain N second prediction results. And the second sample video frame is derived from any one frame in the video to be trained. Then, the similarity of the prediction examples between the first prediction result and the second prediction result can be calculated by using the above formula 1, thereby obtaining at least one similarity of the prediction examples. Thus, a prediction instance tracking result, i.e. an instance identification that is predictably obtainable by the prediction instance tracking result, may be determined from the first sample video frame.

Designing a loss function, for example, a central (Focal) loss function may be used to calculate the loss values of the N first prediction class probability values and the true label class. And scaling the calculated loss function by taking the number of positive samples as a scaling factor. Illustratively, the loss values of the N first prediction instance bounding boxes and the true annotation instance bounding box may be calculated using a Generalized IoU (GIOU) loss function and a L1 loss function. Illustratively, classification loss can be employed to compute a loss value for the instance identifier included in the predicted instance tracking result and the actual annotated instance identifier.

And carrying out addition or weighted summation and other processing on the loss value to obtain a comprehensive loss value. And training QueryVIS by using a back propagation and gradient descent algorithm based on the comprehensive loss value, namely updating parameters of the backbone network to be trained, the N example enclosure frames to be trained, the N example query vectors to be trained and the example segmentation network to be trained.

Secondly, in the embodiment of the application, a method for training a backbone network, an example surrounding frame, an example query vector and an example segmentation network is provided, and through the method, one-to-one sampling samples are connected with real values, so that interested examples are detected and segmented, and complete end-to-end training and reasoning can be performed.

obtaining a sample characteristic diagram through a backbone network to be trained based on a first sample video frame in a video to be trained, wherein the first sample video frame is provided with a labeling category, a labeling example surrounding frame, a labeling example identifier and a labeling example foreground mask;

acquiring N prediction mask ROIs from the sample feature map according to the N to-be-trained example enclosing frames, wherein each to-be-trained example enclosing frame is also used for extracting a corresponding prediction mask ROI;

acquiring N first prediction results through a to-be-trained example segmentation network based on N to-be-trained example query vectors, N prediction bounding boxes ROI and N prediction masks ROI, wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box, a first prediction example embedding vector and a prediction example foreground mask;

determining at least one prediction instance similarity according to the N first prediction results and N second prediction results, wherein each second prediction result comprises a second prediction category probability value, a second prediction instance surrounding box and a second prediction instance embedding vector, the N second prediction results are derived from a second sample video frame in the video to be trained, and each second prediction result corresponds to an instance identifier;

and updating parameters of the backbone network to be trained, the N enclosing frames of the examples to be trained, the N query vectors of the examples to be trained and the segmentation network of the examples to be trained through a loss function according to the tracking result of the prediction examples, the identification of the labeled examples, the N first prediction results, the labeled category, the labeled example enclosing frame and the labeled example foreground mask.

In one or more embodiments, a way to train a backbone network, an instance bounding box, an instance query vector, and an instance split network is presented. The training related parameters are as described in the foregoing embodiments, and therefore are not described herein again.

In particular, similar to the previous embodiment, there are N instance query vectors for QueryVIS, and one instance query vector needs to take charge of one real bounding box (i.e., label instance bounding box) when performing sample join and loss function computation. In the training phase, the instance query vector that matches between the real bounding box is also responsible for predicting the instance mask of the real bounding box (i.e., predicting the instance foreground mask). It should be noted that other training processes are similar to those described in the foregoing embodiments, and are not described herein again.

The QueryVIS training process specifically comprises the steps of taking a first sample video frame in a video to be trained as input of a backbone network to be trained, and extracting features by using the backbone network to be trained to obtain a sample feature map. Then, N predicted bounding boxes ROI and N predicted masks ROI are extracted from the sample feature map by utilizing the N bounding boxes of the examples to be trained. Next, using the N to-be-trained instance query vectors, performing a dynamic convolution operation or a normal convolution operation on the N prediction bounding box ROIs and the N prediction mask ROIs, and generating enhanced N prediction bounding box ROIs and enhanced N prediction mask ROIs. And taking the enhanced N prediction bounding boxes ROI and the enhanced N prediction masks ROI as the input of the example segmentation network to be trained, and outputting N first prediction class probability values, N first prediction example bounding boxes, N first prediction example embedded vectors and N prediction example foreground masks by the example segmentation network to be trained.

Designing a loss function, for example, a central (Focal) loss function may be used to calculate the loss values of the N first prediction class probability values and the true label class. And scaling the calculated loss function by taking the number of positive samples as a scaling factor. Illustratively, the loss values of the N first prediction instance bounding boxes and the true annotation instance bounding box may be calculated using a Generalized IoU (GIOU) loss function and a L1 loss function. Illustratively, classification loss can be employed to compute a loss value for the instance identifier included in the predicted instance tracking result and the actual annotated instance identifier. Illustratively, the loss values of the N prediction instance foreground masks and the true annotation instance foreground mask may be calculated using a Die (DICE) loss function and an L1 loss function.

Based on the example tracking method provided by the application, the example detection, the segmentation and the tracking can be accurately and quickly carried out on the input video. The lead results are taken on a plurality of open source data sets, wherein the open source data sets comprise a 2019 yearly data set of the oil pipe video instance segmentation (YouTube-VIS (2019)) and a 2021 yearly data set of the oil pipe video instance segmentation (YouTube-VIS (2021)). Referring specifically to table 3, table 3 shows the results of a system level comparison performed on the YouTube-VIS (2019) dataset.

TABLE 3

Wherein, the Average accuracy (mAP) is an evaluation index of the video instance segmentation algorithm. Therefore, the VIS method based on instance Query (Query) is provided to construct an end-to-end video instance segmentation model without post-processing, meanwhile, the number of artificial parameters in the tracking task head is reduced by constructing a uniform tracking head, and meanwhile, the method has better tracking performance on different tracking tasks. On the public data set YouTube-VIS verification set, the method exceeds the current most advanced video instance segmentation algorithm in both speed and precision.

Referring to fig. 13, fig. 13 is a schematic view of an embodiment of an example tracking device in an embodiment of the present application, and the example tracking device 20 includes:

an obtaining module 210, configured to obtain a target feature map through a backbone network based on a target video frame in a video to be detected, where the target video frame is a tth video frame in the video to be detected, and T is an integer greater than 1;

the obtaining module 210 is further configured to obtain N bounding box regions of interest ROIs from the target feature map according to N example bounding boxes, where each example bounding box is used to extract a corresponding bounding box ROI, and N is an integer greater than or equal to 1;

the obtaining module 210 is further configured to obtain N first detection results through the instance segmentation network based on the N instance query vectors and the N bounding box ROIs, where each first detection result includes a first class probability value, a first instance bounding box, and a first instance embedding vector;

a determining module 220, configured to determine at least one instance similarity according to the N first detection results and M second detection results, where each second detection result includes a second category probability value, a second instance bounding box, and a second instance embedding vector, the M second detection results are obtained according to previous (T-1) video frames in the video to be detected, each second detection result corresponds to an instance identifier, and M is an integer greater than or equal to 1;

the determining module 220 is further configured to determine an instance tracking result of the target video frame according to at least one instance similarity, where the instance tracking result includes at least one instance identifier, and the same instance identifier represents the same instance in the video to be detected.

In the embodiment of the application, an example tracking device is provided, and by adopting the device, in the example tracking process based on the video, an interested example in a video frame can be directly detected by using an example query vector, then similarity matching is carried out on the example in the video frame and the example of the previous video frame in the video, and finally, the example tracking of the video frame is realized according to the matching condition. According to the method and the device, the instance detection independent of post-processing methods such as NMS is realized by constructing the end-to-end instance detection framework, and the instance target is tracked based on the instance identification, so that the video-based instance tracking efficiency is improved.

Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the example tracking device 20 provided in the embodiment of the present application,

the obtaining module 210 is specifically configured to perform point multiplication on each ROI of the N bounding boxes in the feature dimension by using the N instance query vectors to obtain N enhanced ROIs;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, the detection and tracking of the example target in the video can be realized under the condition that the example foreground mask is not extracted, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the speed of model reasoning is favorably improved.

an obtaining module 210, specifically configured to obtain at least one group of bounding box dynamic parameters through a full connection layer based on the N instance query vectors;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, the detection and tracking of the example target in the video can be realized under the condition that the example foreground mask is not extracted, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the speed of model reasoning is favorably improved. Meanwhile, after ROI feature extraction is carried out, dynamic convolution is carried out between the instance query vector and the ROI feature, so that enhanced instance features are generated, and a better model effect is obtained.

an obtaining module 210, configured to obtain N mask ROIs from a target video frame according to N instance bounding boxes, where each instance bounding box is further configured to extract a corresponding mask ROI;

the obtaining module 210 is specifically configured to obtain N first detection results through an instance segmentation network based on the N instance query vectors, the N bounding box ROIs, and the N mask ROIs, where each first detection result further includes a first instance foreground mask.

In the embodiment of the application, the example tracking device is provided, and by adopting the device, the example target in the video can be further segmented, so that the improvement of a single-stage example segmentation network is realized, the network is expanded to the field of video example segmentation, the number of model parameters is reduced, and the fine degree of segmentation can be improved.

In the embodiment of the application, an example tracking device is provided, and by adopting the device, under the condition of extracting an example foreground mask code, the detection and tracking of an example target in a video can be realized, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the speed of model reasoning is favorably improved.

an obtaining module 210, configured to obtain, based on the N instance query vectors, at least one group of bounding box dynamic parameters and at least one group of mask dynamic parameters through a full connection layer, where each group of mask dynamic parameters includes N mask dynamic sub-parameters;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, under the condition of extracting an example foreground mask code, the detection and tracking of an example target in a video can be realized, and an end-to-end video example segmentation framework is designed, so that the complexity of a video example segmentation model is simplified, and the speed of model reasoning is favorably improved. Meanwhile, after ROI feature extraction is carried out, dynamic convolution is carried out between the instance query vector and the ROI feature, so that enhanced instance features are generated, and a better model effect is obtained.

the determining module 220 is specifically configured to sort the N first detection results according to the descending order of the first class probability value, so as to obtain N sorted first detection results;

In the embodiment of the application, an example tracking device is provided, adopt above-mentioned device, consider that all match not only can consume more resource to N first testing results, but also can lead to the efficiency of example target tracking lower, consequently, sieve out the first K first testing results of the biggest of classification probability value from N first testing results to be used for subsequent matching with K first testing results, thereby when promoting matching efficiency, can promote the degree of accuracy of matching.

the determining module 220 is specifically configured to determine an instance embedding vector similarity between each first detection result and each second detection result according to a first instance embedding vector included in each first detection result and a second instance embedding vector included in each second detection result;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, the example similarity between two detection results can be directly calculated by utilizing the product of the example embedding vector similarity, the spatial similarity, the category similarity and the category probability value, so that other weight parameters do not need to be trained, the calculation complexity is reduced, and the parameter training amount can be reduced. Meanwhile, the (K × M) example similarity can be normalized by adopting bidirectional softmax, and the value range interval of the probability is ensured to be between 0 and 1, so that the calculation precision is improved, and the convergence speed is accelerated.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the example tracking device 20 provided in the embodiment of the present application, K is less than or equal to M;

the determining module 220 is specifically configured to construct, based on a bipartite graph matching algorithm, mapping relationships between the K first detection results and the M second detection results according to (K × M) instance similarities, so as to obtain K mapping relationships;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, under the condition that K is less than or equal to M, a bipartite graph matching algorithm is adopted, and K first detection results are preferentially matched, so that a result with a good overall matching degree is obtained, and therefore the accuracy and consistency of example target tracking are improved.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the example tracking device 20 provided in the embodiment of the present application, K is greater than or equal to M;

the determining module 220 is specifically configured to construct, based on a bipartite graph matching algorithm, mapping relationships between the K first detection results and the M second detection results according to (K × M) instance similarities, so as to obtain M mapping relationships;

In the embodiment of the application, an example tracking device is provided, and by adopting the device, under the condition that K is greater than or equal to M, a bipartite graph matching algorithm is adopted, and M second detection results are preferentially matched, so that a result with a better overall matching degree is obtained, and therefore the accuracy and consistency of example target tracking are improved.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the example tracking apparatus 20 provided in the embodiment of the present application, the example tracking apparatus 20 further includes a training module 203;

the obtaining module 210 is further configured to obtain a sample feature map through a backbone network to be trained based on a first sample video frame in a video to be trained, where the first sample video frame has a labeling category, a labeling instance bounding box, and a labeling instance identifier;

the obtaining module 210 is further configured to obtain N predicted bounding boxes ROI from the sample feature map according to the N to-be-trained instance bounding boxes, where each to-be-trained instance bounding box is used to extract a corresponding predicted bounding box ROI;

the obtaining module 210 is further configured to obtain N first prediction results through the to-be-trained instance segmentation network based on the N to-be-trained instance query vectors and the N prediction bounding boxes ROI, where each first prediction result includes a first prediction category probability value, a first prediction instance bounding box, and a first prediction instance embedding vector;

the determining module 220 is further configured to determine at least one prediction instance similarity according to N first prediction results and N second prediction results, where each second prediction result includes a second prediction category probability value, a second prediction instance bounding box, and a second prediction instance embedding vector, the N second prediction results are derived from a second sample video frame in the video to be trained, and each second prediction result corresponds to an instance identifier;

a determining module 220, further configured to determine a prediction instance tracking result of the first sample video frame according to at least one prediction instance similarity;

and the training module 230 is configured to perform parameter updating on the backbone network to be trained, the N example bounding boxes to be trained, the N query vectors to be trained, and the example segmentation network to be trained through a loss function according to the prediction example tracking result, the labeled example identifier, the N first prediction results, the labeled category, and the labeled example bounding box.

In the embodiment of the application, an example tracking device is provided, and by adopting the device, a one-to-one sampling sample is connected with a real value, so that an interested example is detected and segmented, and complete end-to-end training and reasoning can be performed.

the obtaining module 210 is further configured to obtain a sample feature map through a backbone network to be trained based on a first sample video frame in a video to be trained, where the first sample video frame has a label category, a label instance bounding box, a label instance identifier, and a label instance foreground mask;

the obtaining module 210 is further configured to obtain N prediction mask ROIs from the sample feature map according to the N to-be-trained instance bounding boxes, where each to-be-trained instance bounding box is further configured to extract one corresponding prediction mask ROI;

the obtaining module 210 is further configured to obtain N first prediction results through the to-be-trained instance segmentation network based on the N to-be-trained instance query vectors, the N prediction bounding boxes ROIs, and the N prediction masks ROIs, where each first prediction result includes a first prediction category probability value, a first prediction instance bounding box, a first prediction instance embedding vector, and a prediction instance foreground mask;

the training module 230 is further configured to perform parameter updating on the backbone network to be trained, the N to-be-trained instance bounding boxes, the N to-be-trained instance query vectors, and the to-be-trained instance segmentation network through a loss function according to the prediction instance tracking result, the labeling instance identifier, the N first prediction results, the labeling category, the labeling instance bounding box, and the labeling instance foreground mask.

Fig. 14 is a schematic structural diagram of a computer device 30 according to an embodiment of the present application. Computer device 30 may include an input device 310, an output device 320, a processor 330, and a memory 340. The output device in the embodiments of the present application may be a display device.

Memory 340 may include both read-only memory and random-access memory, and provides instructions and data to processor 330. A portion of Memory 340 may also include Non-Volatile Random Access Memory (NVRAM).

Memory 340 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof:

and (3) operating instructions: including various operational instructions for performing various operations.

Operating the system: including various system programs for implementing various basic services and for handling hardware-based tasks.

Processor 330 controls the operation of computer device 30, and processor 330 may also be referred to as a Central Processing Unit (CPU). Memory 340 may include both read-only memory and random-access memory, and provides instructions and data to processor 330. A portion of the memory 340 may also include NVRAM. In particular applications, the various components of computer device 30 are coupled together by a bus system 350, where bus system 350 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are labeled in the figures as bus system 350.

The method disclosed in the embodiments of the present application can be applied to the processor 330, or implemented by the processor 330. The processor 330 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 330. The processor 330 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 340, and the processor 330 reads the information in the memory 340 and performs the steps of the above method in combination with the hardware thereof.

The related description of fig. 14 can be understood with reference to the related description and effects of the method portion of fig. 3, and will not be described in detail herein.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for video-based instance tracking, comprising:

acquiring N first detection results through an example segmentation network based on N example query vectors and the N bounding boxes ROI, wherein each first detection result comprises a first class probability value, a first example bounding box and a first example embedding vector;

determining at least one instance similarity according to the N first detection results and M second detection results, wherein each second detection result comprises a second category probability value, a second instance bounding box and a second instance embedding vector, the M second detection results are obtained according to previous (T-1) video frames in the video to be detected, each second detection result corresponds to an instance identifier, and M is an integer greater than or equal to 1;

and determining an instance tracking result of the target video frame according to the at least one instance similarity, wherein the instance tracking result comprises at least one instance identifier, and the same instance identifier represents the same instance in the video to be detected.

2. The instance tracking method according to claim 1, wherein the obtaining N first detection results by an instance segmentation network based on the N instance query vectors and the N bounding box ROIs comprises:

performing point multiplication on each ROI in the N ROI on the characteristic dimension by adopting the N example query vectors to obtain N enhanced ROI;

based on the N enhanced bounding boxes ROI, obtaining N first class probability values through a class discrimination network included in the example segmentation network, wherein the N first class probability values are contained in the N first detection results;

obtaining N first example bounding boxes through a bounding box regression network included by the example segmentation network based on the N enhanced bounding boxes ROI, wherein the N first example bounding boxes are included in the N first detection results;

based on the N enhanced bounding box ROIs, obtaining N first instance embedding vectors through an embedding vector network included by the instance segmentation network, wherein the N first instance embedding vectors are included in the N first detection results.

3. The instance tracking method according to claim 1, wherein the obtaining N first detection results by an instance segmentation network based on the N instance query vectors and the N bounding box ROIs comprises:

performing dot multiplication on each ROI in the N ROI in the bounding boxes on the characteristic dimension by adopting the at least one group of dynamic parameters of the bounding boxes to obtain N ROIs of the enhanced bounding boxes;

4. The instance tracking method of claim 1, further comprising:

the obtaining N first detection results through an instance segmentation network based on the N instance query vectors and the N bounding box ROIs includes:

obtaining N first detection results through an instance segmentation network based on the N instance query vectors, the N bounding box ROIs and the N mask ROIs, wherein each first detection result further comprises a first instance foreground mask.

5. The instance tracking method of claim 4, wherein obtaining N first detection results over an instance segmentation network based on the N instance query vectors, the N bounding box ROIs, and the N masked ROIs comprises:

performing point multiplication on each mask ROI in the N mask ROIs on the feature dimension by adopting the N example query vectors to obtain N enhanced mask ROIs;

obtaining N first example embedded vectors through an embedded vector network included in the example segmentation network based on the N enhanced bounding box ROIs, wherein the N first example embedded vectors are included in the N first detection results;

based on the N enhancement masks ROI, N first instance foreground masks are obtained by a mask generation network included in the instance segmentation network, where the N first instance foreground masks are included in the N first detection results.

6. The instance tracking method of claim 4, wherein obtaining N first detection results over an instance segmentation network based on the N instance query vectors, the N bounding box ROIs, and the N masked ROIs comprises:

performing point multiplication on each mask ROI in the N mask ROIs on the characteristic dimension by adopting the at least one group of mask dynamic parameters to obtain N enhanced mask ROIs;

7. The method according to any one of claims 1 to 6, wherein determining at least one instance similarity from the N first detection results and the M second detection results comprises:

selecting the first K first detection results from the N sorted first detection results, wherein K is an integer which is greater than or equal to 1 and less than or equal to N;

8. The example tracking method according to claim 7, wherein the determining the example similarity between each first detection result and each second detection result according to the K first detection results and the M second detection results comprises:

determining an instance embedding vector similarity between each first detection result and each second detection result according to a first instance embedding vector included by each first detection result and a second instance embedding vector included by each second detection result;

determining category similarity between each first detection result and each second detection result according to a first category probability value included by each first detection result and a second category probability value included by each second detection result;

determining the example similarity between each first detection result and each second detection result according to the example embedding vector similarity between each first detection result and each second detection result, the space similarity, the category similarity and the first category probability value included by each first detection result.

9. The instance tracking method of claim 7, wherein K is less than or equal to M;

determining an instance tracking result of the target video frame according to the at least one instance similarity, including:

constructing mapping relations between the K first detection results and the M second detection results according to the (K x M) instance similarity based on a bipartite graph matching algorithm to obtain K mapping relations;

if the example similarity corresponding to P mapping relations in the K mapping relations is smaller than or equal to an example similarity threshold value, deleting the P mapping relations from the K mapping relations to obtain (K-P) mapping relations, wherein P is an integer which is larger than or equal to 1 and smaller than or equal to K;

determining the instance tracking result of the target video frame according to a second detection result corresponding to each mapping relation in the (K-P) mapping relations, wherein each second detection result corresponds to an instance identifier;

the method further comprises the following steps:

10. The instance tracking method of claim 7, wherein K is greater than or equal to M;

constructing mapping relations between the K first detection results and the M second detection results according to the (K x M) instance similarity based on a bipartite graph matching algorithm to obtain M mapping relations;

if the example similarity corresponding to Q mapping relations in the M mapping relations is smaller than or equal to an example similarity threshold value, deleting the Q mapping relations from the M mapping relations to obtain (M-Q) mapping relations, wherein Q is an integer which is larger than or equal to 1 and smaller than or equal to M;

determining the instance tracking result of the target video frame according to a second detection result corresponding to each mapping relation in the (M-Q) mapping relations, wherein each second detection result corresponds to an instance identifier;

the method further comprises the following steps:

11. The instance tracking method of claim 1, further comprising:

acquiring N first prediction results through a to-be-trained example segmentation network based on N to-be-trained example query vectors and the N prediction bounding boxes ROI, wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box and a first prediction example embedding vector;

determining a prediction instance tracking result of the first sample video frame according to the at least one prediction instance similarity;

and updating parameters of the backbone network to be trained, the N example bounding boxes to be trained, the N query vectors of the examples to be trained and the segmentation network of the examples to be trained through a loss function according to the tracking result of the prediction examples, the identification of the labeled examples, the N first prediction results, the labeled categories and the labeled example bounding boxes.

12. The instance tracking method of claim 4, further comprising:

acquiring N prediction mask ROIs from the sample feature map according to the N to-be-trained example bounding boxes, wherein each to-be-trained example bounding box is further used for extracting a corresponding prediction mask ROI;

acquiring N first prediction results through the to-be-trained example segmentation network based on N to-be-trained example query vectors, N prediction bounding boxes (ROIs) and N prediction masks (ROIs), wherein each first prediction result comprises a first prediction category probability value, a first prediction example bounding box, a first prediction example embedding vector and a prediction example foreground mask;

and updating parameters of the backbone network to be trained, the N example enclosure frames to be trained, the N example query vectors to be trained and the example segmentation network to be trained through a loss function according to the prediction example tracking result, the label example identification, the N first prediction results, the label category, the label example enclosure frame and the label example foreground mask.

13. An instance tracking apparatus, comprising:

the system comprises an acquisition module, a detection module and a processing module, wherein the acquisition module is used for acquiring a target characteristic map through a backbone network based on a target video frame in a video to be detected, the target video frame is the Tth video frame in the video to be detected, and T is an integer larger than 1;

the obtaining module is further configured to obtain N bounding box regions of interest ROIs from the target feature map according to N example bounding boxes, where each example bounding box is used to extract a corresponding bounding box ROI, and N is an integer greater than or equal to 1;

the obtaining module is further configured to obtain N first detection results through an instance segmentation network based on the N instance query vectors and the N bounding boxes ROI, where each first detection result includes a first class probability value, a first instance bounding box, and a first instance embedding vector;

a determining module, configured to determine at least one instance similarity according to the N first detection results and M second detection results, where each second detection result includes a second category probability value, a second instance bounding box, and a second instance embedding vector, the M second detection results are obtained according to previous (T-1) video frames in the video to be detected, each second detection result corresponds to an instance identifier, and M is an integer greater than or equal to 1;

the determining module is further configured to determine an instance tracking result of the target video frame according to the at least one instance similarity, where the instance tracking result includes at least one instance identifier, and the same instance identifier represents the same instance in the video to be detected.

14. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor for executing a program in the memory, the processor for performing the instance tracking method of any of claims 1 to 12 according to instructions in program code;

15. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the instance tracking method of any of claims 1 to 12.