CN114240992A

CN114240992A - Method and system for labeling target object in frame sequence

Info

Publication number: CN114240992A
Application number: CN202111565626.3A
Authority: CN
Inventors: 昝智
Original assignee: Beijing Anjie Zhihe Technology Co ltd
Current assignee: Beijing Anjie Zhihe Technology Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-25

Abstract

The invention discloses a method for labeling a target object in a frame sequence, which comprises the following steps: pre-training a target tracking model and a target detection model; inputting the frame sequence into a target tracking model to acquire first associated position information of the target object on each frame of image; for each frame image, the following steps are performed: detecting at least one second associated position information of at least one object of the same category as the target object in the current frame image by adopting the target detection model; and determining target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one piece of second associated position information. The target detection model is used for carrying out auxiliary detection on the associated position information of the object on each frame of image determined by the target tracking model, so that the accuracy of target object detection is improved.

Description

Method and system for labeling target object in frame sequence

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a system for labeling a target object in a frame sequence.

Background

When the continuous frame pictures are labeled, group-by-group labeling is adopted, and the same labeling person labels groups of the continuous frame pictures (the time interval between two frames is larger, for example, the time interval between two frames is larger than 0.5s, namely, two pictures are collected per frame). However, in the process of labeling, since there are many labeled pictures and many labeled objects, the problems of slow labeling speed and low accuracy rate are often caused.

The label consistency of the same object in the continuous frame pictures can be realized in the continuous frame labeling process in the industry. And setting the continuous frame pictures as a labeling subject, and labeling by the same labeling person. After a certain marked object is marked on the first sheet, automatically copying a marked frame of the first sheet to include a label and a position size on the second sheet; and the annotator adjusts the position of the frame according to the actual position of the second picture object, the third picture can automatically copy the annotation frame of the second picture, and the annotation of all the pictures of the object is finished by analogy. The whole labeling process needs a labeling operator to label the pictures one by one, the labeling efficiency is low, and fatigue is easily caused by manually labeling a plurality of pictures manually, so that a labeling error occurs.

In the related art, a video object tracking method is used to automatically track an object existing in each frame of image in a video. For example, the method comprises the steps of 1: selecting a target corresponding to the current frame and the previous frame with high probability and a target (including an interferent) around the target through a basic convolutional neural network; step 2: extracting the characteristics of the targets through a convolutional neural network; and step 3: calculating a similarity matrix by using the characteristics of the current frame and the previous frame; and 4, step 4: and using the similarity matrix to estimate the real target position in the current frame.

However, in the process of implementing the present invention, the inventor finds that the video object tracking method in the related art performs better for video tracking, but on a sequence after frame extraction at large intervals (low frequency, such as one frame per second or two frames per second), although the approximate position of the target can be calculated, the accuracy of the detected bounding box is poor, and the method cannot be used on a labeling platform. The reason is that the features used in the related art are heavily dependent on the sample distribution of the training data set, i.e. the training samples are all adjacent per frame, and there is no frame extraction.

For the background requiring large-interval frame extraction, the features learned by the model are not applicable in the scene, especially the motion amplitude features, so that the accuracy of the bounding box is greatly reduced.

Disclosure of Invention

An embodiment of the present invention provides a method and a system for labeling a target object in a frame sequence, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for labeling a target object in a frame sequence, including:

pre-training a target tracking model and a target detection model;

inputting the frame sequence into a target tracking model to acquire first associated position information of the target object on each frame of image;

for each frame image, the following steps are performed:

detecting at least one second associated position information of at least one object of the same category as the target object in the current frame image by adopting the target detection model;

and determining target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one piece of second associated position information.

In some embodiments, the first associated position information of the target object on each frame of image comprises: first position information of a labeling frame of the target object on each frame of image;

the at least one second associated position information of the at least one object in the current frame image comprises: second position information of at least one labeling frame of the at least one object in the current frame image;

the target associated position information of the target object on the current frame image comprises: and target position information of a labeling frame of the target object on the current frame image.

In some embodiments, determining the target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one second associated position information comprises:

and determining the target position information of the target object on the marking frame of the current frame image according to the first position information corresponding to the current frame image and the at least one second position information.

In some embodiments, the determining the target position information of the target object in the annotation box on the current frame image according to the first position information corresponding to the current frame image and the at least one second position information includes:

determining at least one intersection ratio value according to the first position information corresponding to the current frame image and the at least one second position information;

and taking the second position information corresponding to the maximum value in the at least one intersection ratio value as the target position information of the target object in the labeling frame on the current frame image.

In some embodiments, the target tracking model and/or the target detection model employ a lightweight network architecture.

In some embodiments, the lightweight network mechanism comprises one of SqueezeNet, MobileNet, ShuffleNet, and Xception.

In some embodiments, the pre-training of the target tracking model and the target detection model comprises: and carrying out quantitative processing on the target tracking model and the target detection model.

In a second aspect, an embodiment of the present invention further provides a system for labeling a target object in a frame sequence, including:

the pre-training program module is used for pre-training the target tracking model and the target detection model;

the target tracking program module is used for inputting the frame sequence into a target tracking model so as to acquire first associated position information of the target object on each frame of image;

a position information determination program module for performing the following steps for each frame of image:

In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above methods for labeling a target object in a frame sequence according to the present invention.

In a fourth aspect, an electronic device is provided, comprising: the device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the method for labeling the target object in the frame sequence according to any one of the methods.

In a fifth aspect, the embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute any one of the above methods for labeling a target object in a frame sequence.

In this embodiment, a target tracking model and a target detection model of preselected training are simultaneously used when labeling a target object in a frame sequence. Wherein the target tracking module is configured to track a target object (e.g., a selected vehicle) in a sequence of frames to determine corresponding first associated location information; the target detection model is used for detecting objects of the same category as the target object in each frame of image and acquiring second associated position information of the objects of the same category, wherein the number of the objects of the same category is at least one. And further determining target associated position information of the target object in the current frame based on two factors of the first associated position information and the second associated position information of the current frame. The target detection model is used for carrying out auxiliary detection on the associated position information of the object on each frame of image determined by the target tracking model, so that the accuracy of target object detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for labeling a target object in a frame sequence according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for labeling a target object in a frame sequence according to another embodiment of the present invention;

FIG. 3 is a precision diagram of a keep track algorithm used in the prior art;

FIG. 4 is a precision diagram of a method for labeling a target object in a frame sequence according to the present invention;

FIG. 5 is a graph of the success of the prior art using the keep track algorithm;

FIG. 6 is a diagram of the success of the labeling method for the target object in the frame sequence according to the present invention;

FIG. 7 is a schematic block diagram of an embodiment of a system for labeling a target object in a frame sequence according to the present invention;

fig. 8 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should also be noted that, in this document, the terms "comprises" and "comprising" include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In order to solve the technical problems in the related art, the following methods are generally adopted: limiting to the need to implement class-independent large-interval target tracking, the corresponding training data set may be re-created to retrain the tracking model.

In the process of implementing the invention, the goal is achieved by making a large interval frame extraction training data set at first, but the scheme has higher cost and long construction period.

In order to solve the technical problems in the related art, the inventor develops and designs a labeling method for a target object in a sequence frame according to an embodiment of the invention.

Referring to fig. 1, which is a flowchart illustrating a method for labeling a target object in a frame sequence according to an embodiment of the present invention, a sequence frame may be extracted from a plurality of frames of images in a surveillance video, and may be extracted at an interval of 0.5s per frame. The surveillance video may be a traffic surveillance video, and the method for labeling the target object in the frame sequence according to this embodiment may be implemented as a method for labeling a vehicle in the frame sequence. The marking method provided by the embodiment of the invention can be used for marking platforms. The method in this embodiment comprises:

and S10, pre-training a target tracking model and a target detection model.

And S20, inputting the frame sequence into a target tracking model to acquire first associated position information of the target object on each frame of image.

S30, executing the following steps for each frame image:

s31, detecting at least one second associated position information of at least one object in the same category with the target object in the current frame image by adopting the target detection model.

S32, determining the target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one second associated position information.

Because two network models are applied to the large-interval frame-drawing tracking task, the inference delay of the whole tracking process is necessarily greatly optimized, and the embodiment of the invention performs the following optimization:

in some embodiments, the method for labeling a target object in a frame sequence of the present invention further includes: and optimizing the reasoning delay of the target detection model and the target tracking model.

Illustratively, for step S10: and training a target tracking model and a target detection model in advance. And the target tracking model and/or the target detection model adopt a lightweight network structure. Illustratively, the lightweight network mechanism includes one of SqueezeNet, MobileNet, ShuffleNet, and Xception.

In this embodiment, a backbone network (backbone) of the target detection model and the target tracking model is replaced. The backbone networks (backbones) of the two models are all replaced by network structures (e.g., SqueezeNet, MobileNet, ShuffleNet and Xception) with lighter weight and less parameters, and MobilenetV3 is adopted in the embodiment of the invention. The post-replacement target tracking model is retrained to be optimal using the public data set. The target detection model is retrained to reach the optimal value by using self-made multi-type data sets.

In some embodiments, the method for labeling a target object in a frame sequence of the present invention, wherein the pre-training of the target tracking model and the target detection model includes: and carrying out quantitative processing on the target tracking model and the target detection model.

In this embodiment, an Nvidia TensorRT based technique is used to accelerate the reasoning. Illustratively, the target detection model and the target tracking model in the invention are deployed on the GPU. The GPU may be Nvidia P4. Because the target detection model and the target tracking model are deployed on the Nvidia P4, the GPU supports int8 precision inference, and the invention carries out int8 quantization processing on the target tracking model and the target detection model based on a TensorRT8 inference library.

The method of the above embodiment enables the inference delay to be greatly reduced.

In some embodiments, for step S20: inputting the frame sequence into a target tracking model to acquire first associated position information of the target object on each frame of image.

Wherein the target objects may be different vehicles in the image, each vehicle may be of a category. And processing each frame of image in the frame sequence based on the target tracking model to obtain first associated position information of the target object on each frame of image. For example, the frame sequence includes n frames of images, and the target tracking model processes the n frames of images to obtain corresponding n pieces of first associated position information.

Illustratively, the first associated position information of the target object on each frame of image comprises: and the first position information of the labeling frame of the target object on each frame of image. The labeling frame on each frame of image may be a frame for framing a target object, and the first position information is coordinate information of a center point of the corresponding frame.

In some embodiments, for S31: and detecting at least one second associated position information of at least one object of the same category as the target object in the current frame image by adopting the target detection model. Wherein the at least one second associated position information of the at least one object in the current frame image comprises: and second position information of at least one labeling frame of the at least one object in the current frame image.

In some embodiments, the target associated position information of the target object on the current frame image comprises: and target position information of a labeling frame of the target object on the current frame image.

In some embodiments, determining the target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one second associated position information comprises: and determining the target position information of the target object on the marking frame of the current frame image according to the first position information corresponding to the current frame image and the at least one second position information.

Fig. 2 is a flowchart illustrating a method for labeling a target object in a frame sequence according to an embodiment of the present invention. In this embodiment, the determining the target position information of the target object in the annotation box on the current frame image according to the first position information corresponding to the current frame image and the at least one second position information includes:

s321, determining at least one intersection ratio according to the first position information corresponding to the current frame image and the at least one second position information;

and S322, taking the second position information corresponding to the maximum value in the at least one intersection ratio value as the target position information of the target object in the labeling frame on the current frame image.

In this embodiment, a target detection model is introduced to assist in detecting the bounding box after the target tracking model is positioned on the target. Since the target tracking model locates the target position in the sequence frame, its bounding box (t _ bboxes) accuracy is low. After the object class in the bounding box is known, since the objects of the same class in the whole image are detected in advance by using an object detection model, a plurality of bounding boxes (d _ blobs, screened by a confidence threshold of 0.5) can be obtained. For each frame, a plurality of ious (Intersection over Union) are obtained by using t _ bbox and d _ bboxes, and the bounding box with the largest iou in the d _ bboxes is selected as the bounding box required by the user.

In some embodiments, the method for labeling a target object in a frame sequence of the present invention comprises the following steps:

step 1: a lightweight target tracking model is designed, the parameter quantity of the model is small, and the training and reasoning speed is high.

Step 2: the target tracking model is suitable for general targets (without limitation to categories), so good results can be achieved by using public data set training. After training is completed, the model needs to be further quantified using a TensorRT8 inference library.

And step 3: and a lightweight target detection model is designed, the parameter quantity of the model is small, and the training and reasoning speed is higher.

And 4, step 4: the target detection model is suitable for targets of limited classes, and the method is trained by using a pre-constructed multi-class target data set. After training is completed, the model needs to be further quantified using a TensorRT8 inference library.

And 5: and inputting the extracted video frame sequence, and acquiring a surrounding frame coordinate t _ bbox of the target object on each frame by using a target tracking model.

Step 6: and acquiring a bounding box d _ bboxes of the target class of the target object on each frame by using the target detection model.

And 7: and (5) solving the iou by using t _ bbox and d _ bboxes, and obtaining the bounding box corresponding to the maximum iou value.

The invention downsamples samples of the UAV123 public data set in a way of selecting one frame from five frames, and then performs model evaluation on the bike category of the downsampled data set.

1. And introducing a target detection model to assist in detecting the enclosure frame after the target tracking model is positioned to the target.

The properties were as follows: two evaluation indexes commonly used in the field of target tracking are given:

1) precision plot.

Fig. 3 shows a precision diagram of a keep track algorithm in the prior art. Fig. 4 shows a precision diagram of a method for labeling a target object in a frame sequence according to the present invention. The accuracy map refers to the center point of the target position (bounding box) estimated by the algorithm and the center point of the target of the manual-marking (ground-route), and the distance between the center point and the center point is smaller than the percentage of video frames of a given threshold value. Where the abscissa identifies the threshold and the ordinate identifies the percentage. Different thresholds, resulting in different percentages, may result in a curve. The average precision can be obtained by averaging the precision corresponding to each threshold.

As can be seen from fig. 3 and 4, the keep _ track in the prior art performs very poorly in the large-interval frame-decimation tracking scenario, and the average tracking accuracy is only 6.7, whereas the algorithm of the present invention can reach 75.6 in the scenario, which is more accurate.

2) Success Plot.

Fig. 5 shows a work diagram of the prior art using the keep track algorithm. Fig. 6 shows a result diagram of the labeling method for target objects in a frame sequence according to the present invention. Wherein the abscissa is the overlap threshold and the ordinate is the overlap accuracy. The IOU is calculated by a target box (bounding box) obtained by the algorithm and a ground-route. When the IOU of a certain frame is greater than the set threshold, the frame is regarded as successful (Success), and the Success rate (Success rate) is determined as the percentage of the total successful frames in all frames.

As can be seen from fig. 5 and 6, the mainstream tracking algorithm keep _ track performs very badly in the large-interval frame-extraction tracking scene, and the average tracking success rate is only 2.7, whereas the algorithm of the embodiment of the present invention can reach 47.5 in the scene, which is more accurate.

2. Optimizing inference delays for target detection models and target tracking models

The properties were as follows:

keep _ track is mainstream tracking algorithm, and the processing speed of obtaining one frame of picture after averaging the processing speed of a plurality of pictures is 114ms, and the total time consumption of the tracking and detecting module after int8 quantization optimization of the product is 93ms, which shows that the inference speed is faster than Keep _ track.

Fig. 7 is a schematic block diagram of an embodiment of a system for labeling a target object in a frame sequence according to the present invention, which includes:

a pre-training program module 710 for pre-training the target tracking model and the target detection model;

a target tracking program module 720, configured to input the frame sequence into a target tracking model to obtain first associated position information of the target object on each frame of image;

a position information determination program module 730, configured to perform the following steps for each frame of image:

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above methods for labeling a target object in a frame sequence according to the present invention.

In some embodiments, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, when the program instructions are executed by a computer, the computer executes any one of the above methods for labeling a target object in a frame sequence.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: the device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute a labeling method for a target object in a frame sequence.

Fig. 8 is a schematic hardware structure diagram of an electronic device for performing a method for labeling a target object in a frame sequence according to another embodiment of the present application, as shown in fig. 8, the device includes:

one or more processors 810 and a memory 820, with one processor 810 being an example in FIG. 8.

The apparatus for performing a method of labeling a target object in a frame sequence may further include: an input device 830 and an output device 840.

The processor 810, the memory 820, the input device 830, and the output device 840 may be connected by a bus or other means, such as the bus connection in fig. 8.

The memory 820, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the labeling method for the target object in the frame sequence in the embodiment of the present application. The processor 810 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 820, namely, implementing the method for labeling the target object in the frame sequence according to the above method embodiment.

The memory 820 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of a marking device for a target object in a frame sequence, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 820 may optionally include memory remotely located from the processor 810, and these remote memories may be connected over a network to a means for annotating target objects in a sequence of frames. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 830 can receive input numeric or character information and generate signals related to user settings and function control of the marking device for a target object in a frame sequence. The output device 840 may include a display device such as a display screen.

The one or more modules are stored in the memory 820 and when executed by the one or more processors 810 perform the method of tagging a target object in a sequence of frames in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for labeling a target object in a frame sequence comprises the following steps:

pre-training a target tracking model and a target detection model;

for each frame image, the following steps are performed:

2. The method of claim 1,

the first associated position information of the target object on each frame of image comprises: first position information of a labeling frame of the target object on each frame of image;

3. The method of claim 2, wherein determining the target associated position information of the target object on the current frame image according to the first associated position information corresponding to the current frame image and the at least one second associated position information comprises:

4. The method of claim 3, wherein the determining the target position information of the labeling box of the target object on the current frame image according to the first position information corresponding to the current frame image and the at least one second position information comprises:

5. The method of claim 1, wherein the target tracking model and/or the target detection model employ a lightweight network architecture.

6. The method of claim 5, wherein the lightweight network mechanism comprises one of SqueezeNet, MobileNet, ShuffleNet, and Xception.

7. The method of claim 5, wherein the pre-training of the target tracking model and the target detection model comprises: and carrying out quantitative processing on the target tracking model and the target detection model.

8. A system for labeling a target object in a sequence of frames, comprising:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.