CN114117128A

CN114117128A - Method, system and equipment for video annotation

Info

Publication number: CN114117128A
Application number: CN202010890640.XA
Authority: CN
Inventors: 谢凯源; 姚亚强; 白小龙; 戴宗宏
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-08-29
Filing date: 2020-08-29
Publication date: 2022-03-01

Abstract

The application provides a video labeling method, which comprises the following steps: the method comprises the steps of extracting a plurality of video frames from an unmarked video, displaying at least one video frame in the plurality of video frames to a user through a display interface, obtaining a user marking result of marking the at least one video frame in the display interface by the user, and obtaining marking results of other video frames in the plurality of video frames according to the user marking result.

Description

Method, system and equipment for video annotation

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and more particularly, to a method, system, and apparatus for video annotation.

Background

In the current AI field, in order to train and optimize an AI model, a large number of labeled picture samples and video samples are required to train and learn the AI model. For example, in model training in the image fields of classification, detection, segmentation, and the like, frames of a video are usually extracted first, then each frame of image is labeled, and a labeled video stream, image set, or audio set can be used for training an AI model.

Because the current video marking is carried out manually, one video comprises tens of thousands of video frames, and the manual video marking is a very time-consuming and labor-consuming process, and the labor of people is limited, the video marking is low in precision and efficiency.

Disclosure of Invention

The application provides a video annotation method, a system and equipment, which are used for solving the problems of low efficiency and low precision of manual video annotation.

In a first aspect, a video annotation method is provided, which includes the following steps: extracting a plurality of video frames from the un-labeled video, presenting at least one video frame (which can be called as a key frame) in the plurality of video frames to a user through a display interface, so that the user labels a required target in the key frame to obtain a user labeling result, and finally automatically labeling the target in the rest non-key frames according to the user labeling result to obtain a labeling result of the whole video.

Optionally, the annotation result is used for learning by the AI model.

According to the method, other non-key frames are automatically marked according to the marking result of the key frame, so that the marking result of the whole video is obtained, the marking result of the key frame can be marked by a user, and can also be confirmed by the user after the video marking system automatically recommends the key frame to the user, so that the user only needs to mark the key frame once in the whole marking process, and even only confirms whether the key frame is marked correctly once, so that the complete marking result of the video can be obtained, the marking operation of the user is greatly reduced, and the video marking efficiency and the use experience of the user are improved.

In a possible implementation of the first aspect, the image similarity between the plurality of video frames extracted from the unlabeled video is lower than a first threshold, or the amount of object change between the plurality of video frames is higher than a second threshold.

Alternatively, multiple video frames can be extracted from the unlabeled video at a fixed frame rate.

Alternatively, a plurality of video frames may be extracted from the unlabeled video according to the frame rate manually set by the user.

Optionally, the frame rate extraction may be dynamically adjusted according to the video content of the unlabeled video, and multiple video frames are extracted from the unlabeled video, in a specific implementation, the image similarity between adjacent frames may be determined first, and then the image similarity is compared with a pre-stored similarity mapping relationship, so as to determine a frame rate corresponding to the image similarity, for example, when the similarity is 0.1, the frame rate is 1, and when the similarity is 0.2, the frame rate is 2; similarly, the variation of the inter-frame object may be determined, and then the variation of the object is compared with a pre-stored variation mapping relationship to determine the frame rate corresponding to the variation of the inter-frame object, where the pre-stored similarity mapping relationship and the pre-stored variation mapping relationship may be in a form of a mathematical formula, a mapping relationship table, or the like, which is not limited in the present application. It should be understood that when a moving object exists in a video, adjacent frames or adjacent multiple frames have difference in gray level, and based on this, the change rate of the inter-frame object can be obtained, and of course, a known video segment and the corresponding known inter-frame object change amount can also be used as training samples to train the deep neural network, and the trained model can deduce the corresponding inter-frame object change amount according to the input video segment; or, generating a sample set from the historical unlabeled video and the corresponding frame rate, training the AI model by using the sample set, and inputting the currently processed unlabeled video into the trained AI model, so as to obtain the frame rate corresponding to each video time period.

It should be understood that, according to the video content of the unlabeled video, the frame extraction rate is dynamically adjusted, multiple video frames are extracted from the unlabeled video, and it is possible to avoid extracting too many redundant frames or failing to extract video frames containing targets, so that the finally labeled video frames can be used as high-quality training samples to perform AI model training, thereby improving the user experience.

In a possible implementation manner of the first aspect, the key frame may be manually selected by a user, and specifically, a plurality of video frames obtained after frame extraction may be presented to the user, and the user selects one or more video frames as the key frame to prepare for key frame annotation; the key frame may also be determined according to video content, and specifically, a fixed frame number in a plurality of video frames may be used as a key frame, for example, a first frame or a last frame in the plurality of video frames may be used as a key frame, or a key frame in the plurality of video frames may be determined according to a key frame selection model, where the key frame selection model may be obtained by using a plurality of known video frames and a corresponding known key frame as training samples and training an AI model using the training samples, and the trained key frame selection model may output the corresponding key frame according to an input video frame.

It should be understood that the key frames are automatically recommended to the user, so that the operation times of video annotation of the user can be reduced, the user experience is improved, and the video annotation efficiency is improved.

In one possible implementation manner of the first aspect, the user annotation result can be obtained in the following three manners.

The method comprises the steps of obtaining a user labeling result on a key frame in an automatic labeling mode, specifically, after the key frame is presented to a user, recommending a plurality of target frames (a plurality of central points, masks and the like) at the same time, automatically labeling a plurality of targets, automatically labeling the targets without any operation of the user, generating the user labeling result, improving user experience and improving labeling efficiency.

In a specific implementation, a user labeling result on a key frame can be recommended to a user according to video content, for example, a surveillance video on a highway usually labels vehicles as targets, when the un-labeled video is the surveillance video on the highway, the vehicle can be selected in the key frame to recommend the labeling result to the user, or a small amount of keywords can be obtained from the user, for example, "vehicles" is input, and the system can label the keywords according to the keywords and obtain the keywords from the user. Further, the unlabelled video may be input into a recommended labeling model to obtain a recommended labeling result, where the recommended labeling model may be an AI model, and specifically, the known unlabelled video and the corresponding known labeling result may be used as training samples, and after a deep neural network is trained by using a computer vision algorithm such as an objectness algorithm, a trained recommended labeling model is obtained, or other computer vision algorithms mature in the industry are used to implement the above functions, it should be understood that the above examples are used for illustration and cannot constitute specific limitations.

And in the second mode, a user labeling result can be obtained through a human-computer interaction mode, specifically, after the key frame is presented to the user, a plurality of target frames, a central point or a mask can be simultaneously recommended for the user to select, the user can select the target by clicking any position in one target frame at will without pulling down the frame or selecting a frame with a stroke, the user experience is improved, and the labeling efficiency is improved. In specific implementation, after recommending the labeling result to the user, the user can manually correct the labeling result through the display interface, so that the labeling precision is further improved.

And in a third mode, a user labeling result can be obtained through a manual labeling mode, specifically, after the key frame is presented to the user, the target frame, the central point or the mask can be drawn manually by the user, and the method is not limited in the application.

It can be understood that if the user selects "automatic annotation", the video annotation system can automatically generate a user annotation result on the key frame without any annotation action by the user, thereby greatly improving the user experience and the annotation efficiency of the user; if the user selects 'human-computer interaction', the video annotation system can recommend the annotation result to the user, the user only needs to confirm the recommended annotation result manually, for example, the user can automatically generate the central point by moving the mouse to any position of the object, the user does not need to find the central point of the object manually, the annotation precision is ensured, and meanwhile, the annotation efficiency is improved; if the user selects 'manual labeling', the user can label the key frame by himself, for example, the target frame is drawn on the object manually, the key frame labeling unit can record the drawing information of the user and use the drawing information as a new sample training recommendation labeling model, so that the user labeling result obtained by an automatic labeling and human-computer interaction mode is more accurate, and the user experience is improved.

In one possible implementation manner of the first aspect, the labeling result includes one or more of a target box, a center point, and a mask. For example, the shape of the target frame is a preset shape such as a rectangle, a circle, or an ellipse, and when the video annotation system adopts the AI model based on the mask RCNN to perform the fine positioning on the target, the shape of the target frame is a shape close to the target, for example, the shape of the contour of the target, that is, the target frame is the contour frame of the target. No matter which algorithm is used for video annotation, the target frame may be a line frame or a scatter frame composed of a plurality of scatter points, and the present application is not particularly limited.

It should be understood that the video annotation method provided by the application supports various annotation formats, especially the annotation modes of the mask and the center point, and if the traditional manual annotation mode is used for annotating the video, a lot of time is consumed.

In one possible implementation manner of the first aspect, a plurality of computing units may be invoked to label the remaining non-key frames according to the user labeling result on the key frames. Several ways of invoking multiple compute units to label non-key frames in parallel are described below.

In a first mode, a video frame to be annotated can be divided into a plurality of video segments, and the plurality of video segments are randomly allocated to each arithmetic unit, and one arithmetic unit processes one video segment, where one video segment may include at least two video frames, for example, the arithmetic unit 1 processes the video frames 1 to 10 to generate annotation results of the video frames 1 to 10, and the arithmetic unit 2 processes the video frames 11 to 20 to generate annotation results of the video frames 11 to 20, where the video segments processed by each arithmetic unit may be randomly allocated by an automatic annotation unit, which is not limited in this application. Therefore, the automatic labeling unit can process a plurality of video clips simultaneously, and the processing efficiency of video labeling is improved.

In a second mode, the video frames to be labeled can be randomly distributed to each operation unit, wherein one operation unit can process one video frame, namely one operation unit uses the automatic labeling model to reason one video frame and outputs the labeling result of the video frame, so that the automatic labeling unit can process a plurality of video frames simultaneously, the processing efficiency of video labeling is improved, and each operation unit respectively and independently processes each video frame, so that forward labeling and reverse labeling can be realized, and the use experience of a user is improved.

And finally, the output results of the plurality of operation units are superposed and then the labeling result is output. For example, suppose that the arithmetic unit 1 loads an algorithm for labeling a target frame for a video frame, and the arithmetic unit 2 loads an algorithm for labeling a mask for a video frame, so that the arithmetic unit 1 and the arithmetic unit 2 process the frame a1 at the same time, the arithmetic unit 1 outputs the frame a11 labeled with a target frame, the arithmetic unit 2 outputs the frame a12 labeled with a mask, and finally the automatic labeling unit can superimpose the frame 11 and the frame 12 to obtain the frame a1 labeled with a target frame and a mask at the same time. It should be understood that the above examples are for illustration, and the automatic labeling unit may randomly assign the video frames to the operation unit for processing, and is not limited in itself.

In a specific implementation, one computing unit may be a program or a process, or may be an independent processor with a software and hardware system, and the computing units may be connected in a wired or wireless manner, and specifically, the physical form of the computing unit may be determined according to the processing capability of the computing unit.

It should be understood that by calling a plurality of computing units to process the automatic annotation process in parallel, the efficiency of video annotation can be improved, and the use experience of a user can be improved.

In a possible implementation manner of the first aspect, obtaining, according to the user annotation result, an annotation result of another video frame of the plurality of video frames includes: and according to the user labeling result, labeling the video frame behind or in front of the key frame to obtain a labeling result. In brief, the user can select a forward annotation mode or a reverse annotation mode for automatic annotation, wherein the forward annotation refers to the automatic annotation of the video frame after the key frame, and the reverse annotation refers to the automatic annotation of the video frame before the key frame.

For example, a user uses a manual frame extraction mode, when watching a video, a frame is selected as a key frame when the video is about to end, and a target is labeled in a manual labeling mode (also a man-machine interactive labeling mode or an automatic labeling mode), at this time, the user can select a reverse labeling mode and automatically label the target in other frames before the key frame, and similarly, if the user selects a frame as a key frame when the video starts, the user can also select a forward labeling mode and automatically label the target in other frames after the key frame, so that the key frame can automatically label the remaining non-key frames at any frame position of the video, the video labeling efficiency of the user can be improved to a great extent, and the user experience is improved.

In a possible implementation manner of the first aspect, the method further includes the following steps: and receiving modification information of the labeling result, generating a new key frame and a new user labeling result according to the modification information, and automatically labeling the new non-key frame again according to the new key frame and the new user labeling result to obtain a modified labeling result.

Optionally, the modification information may be generated by a user through a display interface, specifically, a labeling result of automatic labeling may be presented to the user through the display interface, if a modification request of the user is not received, the next video frame may be continuously processed until the video playing is finished, if the user sees the labeling result of a certain frame, the labeling result is considered to be modified by mistake, the user may pause the playing, switch to a labeling modification mode, and then modify the position or size of the target frame, the center point, or the mask through mouse dragging or the like, and the video frame corresponding to the modification information may be used as a new key frame to perform automatic labeling again.

Alternatively, the modification information may be obtained by inputting the annotation result into an annotation modification model, where the annotation modification model may be obtained by training the AI model in advance using a known annotation result and corresponding known modification information as training samples.

It can be understood that whether the labeling result is accurate or not is automatically detected through the labeling correction model, the labeling accuracy can be improved, the operation times of a user are reduced, and the user experience is further improved. It should be understood that, in the above circular modification and automatic annotation manner, the automatic frame extraction unit, the key frame processing unit, and the annotation modification unit may form a closed loop, and continuously optimize the annotation result to improve the annotation precision, and the user may also select whether to perform annotation modification according to the training target, or may select the number of times of optimization, for example, the annotation precision required by the training target is low, the user may not use annotation modification, thereby achieving the purpose of quickly obtaining the video annotation result, for example, the annotation precision required by the training target is high, the user may select to use annotation modification, and the improvement modification is set to 5 times, for example, it should be understood that the above example is used for illustration, and cannot be specifically limited.

In a second aspect, a video annotation system is provided, which includes an automatic frame extraction unit for extracting a plurality of video frames from an unlabeled video; the key frame processing unit is used for displaying at least one video frame in the plurality of video frames to a user through a display interface; the automatic labeling unit is used for acquiring a user labeling result for labeling at least one video frame in a display interface by a user, wherein the user labeling result comprises an image area of a target in the at least one video frame; and the annotation correcting unit is used for obtaining the annotation result of other video frames in the plurality of video frames according to the user annotation result, wherein the annotation result comprises the image area of the target in the other video frames.

In a possible implementation manner of the second aspect, the automatic frame extraction unit is configured to extract a plurality of video frames from the unlabeled video according to the video content of the unlabeled video, where an image similarity between the plurality of video frames is lower than a first threshold, or an object variation between the plurality of video frames is higher than a second threshold.

In one possible embodiment of the second aspect, the at least one video frame is a leading frame or a trailing frame of the plurality of video frames; or at least one video frame is obtained by inputting a plurality of video frames into a key frame selection model, and the key frame selection model is obtained by training a neural network model by using a plurality of known video frames and corresponding known key frames as training samples.

In a possible implementation manner of the second aspect, the key frame processing unit is configured to input the at least one video frame into a recommended annotation model, and obtain a recommended annotation result, where the recommended annotation result includes an image area of the at least one recommended target in the at least one video frame; the key frame marking unit is used for displaying the recommended marking result to the user through the display interface and acquiring the user marking result selected by the user in the recommended marking result.

In one possible implementation manner of the second aspect, the labeling result includes one or more of a target box, a center point, and a mask.

In a possible implementation manner of the second aspect, the automatic labeling unit is configured to invoke a plurality of computing units according to a user labeling result, and process other video frames in parallel to obtain labeling results of the other video frames, where one computing unit processes one video frame, or at least one computing unit processes one video frame, and each computing unit in the at least one computing unit generates one labeling result.

In a possible implementation manner of the second aspect, the automatic labeling unit is configured to label, according to the user labeling result, the video frame after or before the key frame to obtain a labeling result.

In a possible implementation manner of the second aspect, the system further includes an annotation modification unit, and the key frame processing unit is configured to receive, through the display interface, modification information of the annotation result by a user, where the modification information is from a modification of an image area of the target in another video frame by the user; or the label correction unit is used for obtaining the modification information of the label result by labeling the correction model, and the label correction model is obtained by training the neural network model by using a plurality of known label results and corresponding known modification information as training samples; and the automatic labeling unit is used for modifying the labeling results of other video frames in the plurality of video frames according to the modification information.

In one possible implementation of the second aspect, the annotation result is used for learning by an Artificial Intelligence (AI) model.

It should be understood that the video annotation system of this embodiment may be a physical machine, such as an X86 server, a virtual machine, or a computer cluster composed of multiple physical machines or virtual machines, and the unit modules inside the video annotation system may also have multiple partitions, and each module may be a software module, a hardware module, or a part of a software module and a part of a hardware module, which is not limited in this application.

In summary, the video annotation system provided by the application can automatically label other non-key frames according to the labeling result of the key frame, so as to obtain the labeling result of the whole video, and the labeling result of the key frame can be labeled by a user, or can be confirmed by the user after the video annotation system automatically recommends the key frame to the user, so that in the whole labeling process, the user only needs to label the key frame once, and even only confirms whether the key frame is labeled correctly once, so that a complete video annotation result can be obtained, the labeling operation of the user is greatly reduced, and the video annotation efficiency and the user experience are improved.

In a third aspect, a computer program product is provided, comprising a computer program which, when read and executed by a computing device, implements the method as described in the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium comprising instructions which, when run on a computing device, cause the computing device to carry out the method as described in the first aspect.

In a fifth aspect, there is provided a computing device comprising a processor and a memory, the processor executing code in the memory to implement the method as described in the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a system architecture diagram provided herein;

FIG. 2 is a schematic deployment diagram of a video annotation system provided herein;

FIG. 3 is a flowchart illustrating steps of a video annotation method provided herein;

FIG. 4 is a schematic diagram of an interface for manually extracting frames in a video annotation method provided in the present application;

FIG. 5 is a schematic interface diagram of key frame annotation in a video annotation method provided in the present application;

FIG. 6 is a schematic flow chart of automatic annotation in a video annotation method provided by the present application;

FIG. 7 is a schematic flow chart illustrating annotation modification in a video annotation method provided in the present application;

FIG. 8 is a schematic view of an exemplary display interface provided herein;

FIG. 9 is a schematic structural diagram of a video annotation system provided in the present application;

fig. 10 is a schematic structural diagram of a computing device provided in the present application.

Detailed Description

In order to facilitate understanding of the technical solutions of the present invention, first, some terms related to the present invention are explained, and it should be understood that terms used in the embodiment section of the present application are only used for explaining specific examples of the present application, and are not intended to limit the present application.

AI: AI is a theory, method, technique and application system that uses a digital computer or a computing device controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain optimal results. The application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing application, and the like.

Labeling: in the field of AI, annotation refers to the process of adding labels in corresponding scenes to unlabeled data. For example, the unlabelled data is an unlabelled image, the class to which the unlabelled image belongs is added to the scene in which the image is classified, and the position information and the class are added to the target in the unlabelled image in the scene in which the target is detected.

Mask (mask): the mask is obtained by shielding the image to be processed by the selected image, the graph or the object and is used for controlling the image processing area or the processing process, in the field of computer vision, the image area selected by the mask is shielded in the image processing process, the image can be segmented by using the mask, for example, the region of interest is extracted by using the mask, only the region of interest selected by the mask is subjected to image processing, the mask can also play a role of shielding, and the region selected by the mask does not participate in the processing process.

Cloud computing: the core attribute of cloud computing is shared resource service, which means that a third-party provider provides a cloud infrastructure and services that can be used through a public network (such as the Internet) for a user, and the user obtains the use authority of the cloud infrastructure and services by paying. The video annotation method provided by the application can be used by a user in a cloud computing service mode.

Next, an explanation will be given of a "video annotation" application scenario related to the present application.

AI is a theory, method, technique and application system that uses a digital computer or a computing device controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain optimal results. The application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing application, and the like.

An AI model is a set of mathematical methods to implement AI. The trained AI model can be trained using a large number of training samples to obtain a predictive capability, for example, a model for classifying spam mails is trained, in the training stage, a training sample set labeled with a plurality of spam labels and a plurality of non-spam labels is trained on a neural network, the neural network continuously captures the connection between the mails and the labels to self-adjust and refine network structure parameters, and then in the prediction stage, the neural network can classify new mails without labels as to whether the new mails are spam mails. It is to be understood that the above description is intended to be illustrative, and not restrictive. It is noted that reference herein to an AI model is a broad reference and that AI models can include deep learning models, machine learning models, and the like.

Therefore, the collection and processing process of the training samples is one of the important links of the AI technology, and how to efficiently collect the training samples becomes the current effort direction of many developers. In the field of computer vision, a training sample for training an AI model is often obtained by processing an acquired video, and specifically, a video may be firstly framed at a certain frame rate to obtain a plurality of pictures, and then each picture is manually labeled, for example, a labeling tool presents an image to be labeled to a user through a Graphical User Interface (GUI), the user labels an object in the image in the interface by inputting content or dragging a mouse, and a background of the labeling tool generates a label of the image in the object. It should be understood that the labeled video stream, image set or audio set can be used as a training sample for training an AI model such as a target detection model, a target recognition model or a classification model.

Firstly, when a video is subjected to frame extraction, a fixed frame extraction rate is often adopted, so that the number of pictures is too large, a plurality of similar redundant frames are extracted, a large number of similar pictures are used for training an AI model, and the prediction precision of the AI model cannot be improved; if the frame extraction rate is too low, video frames containing targets cannot be extracted, and the video frames containing targets are samples required by training of the AI model, for example, when the AI model for identifying vehicles is trained and monitored videos on two sides of a road are extracted, if the frame extraction rate is too low, a large number of video frames which only the road and do not have vehicles may be extracted, and when the AI model is trained by using the video frames which do not contain vehicles as samples, the prediction accuracy for identifying the vehicle AI model cannot be improved. Moreover, the information content in the video is not uniformly distributed, for example, most vehicles in the red light period are still images, and most vehicles in the green light period are moving images, so that the frame extraction at the fixed frame rate may not only result in that the video frames containing the target cannot be extracted, but also result in that a large number of redundant frames are extracted. Secondly, after the video is subjected to frame extraction, a large number of pictures to be labeled are obtained, for example, a video with the duration of 1 hour is extracted by adopting the frame extraction rate of 10fps, 36000 pictures are generated, and for each image, a user still needs to perform complex operation to ensure the labeling precision, so that a large amount of time and manpower are needed to label the image, the labeling cost is increased, and the labeling efficiency cannot meet the business requirements.

In order to solve the problems of high labor cost, low efficiency and poor precision of video annotation, the video annotation system can dynamically adjust the frame extraction rate according to the video content to extract a proper video frame for annotation, can recommend an object which can be annotated for a user when the user annotates a key frame and can automatically mark the object in other non-key frames after the user marks the object in the key frame in the annotation process, reduces the workload and the required time of manual video annotation and improves the annotation precision and the annotation efficiency.

Fig. 1 is an architecture diagram of a video annotation system provided in the present application, and as shown in fig. 1, the video annotation system architecture includes a data acquisition node 110, a video annotation system 120, and a storage node 130. The data acquisition node 110, the video annotation system 120, and the storage node 130 are connected to each other, specifically, the data acquisition node may be connected to a wireless network or a wired network, and the data acquisition node may be connected to an external network or an internal network, which is not limited in this application.

The data collection node 110 is configured to collect various unlabeled videos, where the unlabeled videos include unlabeled video files, video streams, image sets, or point cloud (point cloud) sets, which is not limited in this application. The data collection node 110 may specifically be a surveillance camera, an electronic police, a depth camera, a drone, etc., and may also be a radar or a satellite, and of course, the data collection node may also be a cloud server for storing unlabeled videos, and the cloud server may be deployed with specific services, such as Kafka and/or Flume, where Kafka is used to provide a high-throughput, high-scalability distributed message queuing service, and Flume is a high-reliability, high-availability, distributed mass log collection, aggregation, and transmission system.

The storage node 130 is configured to store the annotated video stream, the image set, or the audio set output by the video annotation system 120, and may specifically be a physical server, such as an X86 server; the virtual machine may also be a Virtual Machine (VM) implemented based on a general physical server in combination with a Network Function Virtualization (NFV) technology, where the VM refers to a complete computer system that has a complete hardware system function and runs in a completely isolated environment, such as a virtual machine in cloud data, and is simulated by software; the present invention may also be a computer cluster formed by the above physical servers and/or virtual machines, for example, a computer cluster deployed with a distributed file system (HDFS), which is not limited in this application. It should be appreciated that the annotated video stream, image set, or audio set stored by storage node 130 may be used for training of the AI model.

The storage video annotation system 120 is configured to perform video annotation according to the unmarked video sent by the data acquisition device 110, and then send the annotated video stream, image set, and/or audio set to the storage node 130.

The video annotation system 120 is flexible to deploy, and can be deployed in an edge environment, and specifically, can be one or more edge computing devices in the edge environment or a software system running on one or more edge computing devices. Edge environments refer to clusters of edge computing devices geographically close to the data collection node 110 for providing computing, storage, and communication resources, such as edge computing kiosks, computing boxes, etc. located on both sides of a roadway. For example, the storage video labeling system 120 may be deployed in one or more edge computing devices near an intersection or a software system of an edge computing device running near an intersection, where two cameras, i.e., a camera 1 and a camera 2, are arranged in the intersection to monitor the intersection, the camera 1 and the camera 2 are used as the data acquisition node 110, and after acquiring and sending a monitoring video (video not labeled) to the edge computing device, the edge computing device may label the monitoring video, and send the labeled monitoring video to the storage node 130 for storage.

The video annotation system 120 can also be deployed on an end device. The end device includes, but is not limited to, a desktop computer, a notebook computer, a smart phone, and other user terminals. Video annotation can be achieved by running a video annotation system on these user terminals. The end device may also be used as an image providing device for annotating the video, and in some possible embodiments, the video annotation system 120 may also be the same device as the data acquisition node, for example, the data acquisition node and the video annotation system 120 are deployed in the same camera, which is not specifically limited in this application.

The video annotation system 120 can also be deployed in a cloud environment, which is an entity that utilizes underlying resources to provide cloud services to users in a cloud computing mode. The cloud environment includes a cloud data center including a large number of infrastructure resources (including computing resources, storage resources, and network resources) owned by a cloud service provider and a cloud service platform. The video annotation system 120 may be a server of a cloud data center, a virtual machine created in the cloud data center, or a software system deployed on the server or the virtual machine in the cloud data center, where the software system may be deployed in a distributed manner on multiple servers, or in a distributed manner on multiple virtual machines, or in a distributed manner on the virtual machine and the server, and the present application is not limited in particular. For example, the video tagging system 120 may be deployed in a cloud data center far away from an intersection, two cameras, namely, a camera 1 and a camera 2, are disposed in the intersection to monitor the intersection, the camera 1 and the camera 2 serve as the data acquisition node 110 to acquire and send a monitoring video (video not tagged) to the cloud data center, and the cloud data center may tag the monitoring video and send the tagged monitoring video to the storage node 130 for storage. It should be appreciated that when the videotagging system 120 is deployed in a cloud environment, the method of videotagging may be provided for use by a user in the form of a cloud service.

It should be understood that, as shown in fig. 1, the video annotation system 120 includes a plurality of unit modules, and thus, the unit modules of the video annotation system 120 can also be distributively deployed in different environments, such as in any two or three of a cloud environment, an edge environment, and an end device environment, to respectively deploy some of the unit modules in the video annotation system 120. For example, as shown in fig. 2, part of the units (such as the automatic frame extraction unit 121 in fig. 1) of the video annotation system 120 are deployed in the edge computing device, part of the units (such as the key frame processing unit 122, the automatic annotation unit 123 and the annotation modification unit 124 in fig. 1) are deployed in the cloud data center, and part of the units (such as the display unit 125 in fig. 1) are deployed in the end device, it should be understood that fig. 2 is merely illustrative and should not be construed as being particularly limited.

It should be noted that the unit modules inside the video annotation system 120 may also be divided into multiple parts, and each module may be a software module, a hardware module, or a part of a software module and a part of a hardware module, which are not limited in this application. Fig. 1 is an exemplary partitioning manner, as shown in fig. 1, a video annotation system 120 includes an automatic frame extraction unit 121, a key frame processing unit 122, an automatic annotation unit 123, an annotation modification unit 124, and a display unit 125, where the automatic frame extraction unit 121, the key frame processing unit 122, the automatic annotation unit 123, the annotation modification unit 124, and the display unit 125 may be connected through a communication link 126, and the communication link 126 may be an internal bus, or may be another communication link such as the internet, and the present application is not limited in particular.

The automatic frame extracting unit 121 includes a frame extracting unit 1211 and a frame rate adjusting unit 1212, where the frame rate adjusting unit is configured to dynamically adjust a frame extracting rate of each video segment according to a video content of an unlabeled video, and the frame extracting unit 1211 is configured to perform frame extraction on the unlabeled video according to the frame extracting rate to obtain a plurality of video frames, where the frame extracting rate used by the frame extracting unit 1211 may be the frame extracting rate output by the frame rate adjusting unit 1212, or may be a frame extracting rate manually set by a user.

The key frame processing unit 122 includes a key frame extracting unit 1221 and a key frame labeling unit 1222, wherein, the key frame selecting unit 1221 is configured to determine at least one video frame as a key frame in the plurality of video frames, and presents the key frame to the user through the display unit 1223, the user can manually mark the target in the key frame to obtain the user mark result, or automatically mark the key frame through the key frame mark unit 1222, the user manually determines whether the key frame mark unit 1222 marks correctly, so as to obtain the user mark result, the user labeling result shows the targets labeled by the user in the key frame, for example, if the user needs to train the network model for pedestrian recognition, the key frame processing unit 122 will label pedestrians in the video frame, if the user needs to train the network model for vehicle identification, the keyframe processing unit 122 will mark the vehicle in the video frame.

The automatic labeling unit 123 includes a labeling unit 1231 and a plurality of operation units 1232, fig. 1 illustrates 3 operation units (operation units 1232a to 1232c), and the present application does not limit the number of the operation units 1232, where the labeling unit 1231 is configured to invoke the plurality of operation units 1232, and simultaneously mark out the same target in the remaining non-key frames according to the key frame and the user labeling result, if the user marks all pedestrians in the key frame using the target frame, the automatic labeling unit 123 marks out all pedestrians using the target frame in other video frames, and if the user marks out the pedestrian a using the central point, the automatic labeling unit 123 also marks out the pedestrian a using the central point in other video frames. It should be noted that, an operation unit may be understood as a program or a processing unit with independent software and hardware systems and a computing function, such as a processor, a virtual machine, or a physical machine, etc., and the operation unit may determine a specific form thereof according to a service requirement, if the automatic labeling unit 123 is deployed in a cloud data center, the operation unit 1232 may be any physical machine or virtual machine in the cloud data center, and if the automatic labeling unit 123 is deployed in multiple pieces of multi-core computing devices, the operation unit 1232 may be a processor of the computing device, it should be understood that the foregoing examples are used for illustration, and the form of the operation unit 1232 is not limited in this application.

The annotation correcting unit 124 is configured to generate modification information corresponding to the annotation result through the annotation modification model, modify the annotation result, and obtain a new annotation result, where the new annotation result generated by the annotation correcting unit 124 can be used as a new key frame and a new user annotation result, the input key frame processing unit 122 modifies the user annotation result of the recommended annotation, and the input automatic annotation unit 123 modifies the annotation result of the automatic annotation, and so on, thereby continuously improving the annotation precision. The label correction model may be obtained by training the AI model using a plurality of known label results and corresponding known modification information as training samples.

Wherein, the display unit 125 is used for presenting a partial interface to a user to obtain user requirements, such as obtaining a frame extracting manner required by the user, a labeling format required by the user, a key frame required by the user, a target required to be labeled in the key frame by the user, and the like, it should be understood that the automatic frame extracting unit 121, the key frame processing unit 122, and the automatic labeling unit 123 are connected to the display unit 125, each unit can receive the user requirements obtained by the display unit 125 through a network, and execute corresponding operations in response to the user requirements, for example, the display unit 125 can display the key frame to the user, receive a user labeling result manually drawn by the user, and feed it back to the automatic labeling unit 123 to label a non-key frame according to the key frame and the user labeling result, and for example, the display unit 125 can display the labeling result to the user, receive a labeling result manually modified by the user, and feeds it back to the automatic labeling unit 123, so that it labels the non-key frames again according to the modified labeling result, and the video labeling system can continuously obtain the modification information of the user through the display unit 125, thereby continuously improving the labeling precision.

In summary, the video annotation system provided by the application can automatically label other non-key frames according to the labeling result of the key frame, so as to obtain the labeling result of the whole video, and the labeling result of the key frame can be labeled by a user, or can be confirmed by the user after the video annotation system automatically recommends to the user, so that in the whole labeling process, the user only needs to label the key frame once, or even only confirms whether the key frame is labeled correctly once, so that the complete labeling result of the video can be obtained, the labeling operation of the user is greatly reduced, and the video labeling efficiency and the user experience are improved.

The following describes in detail specific steps of how the video annotation system identifies an overloaded vehicle, with reference to the accompanying drawings.

As shown in fig. 3, the present application provides a method for video annotation, which includes the following steps:

s310: a plurality of video frames are extracted from the unlabeled video. The description of the unlabeled video may refer to the embodiment in fig. 1, which is not repeated here, and this step may be implemented by the automatic frame extracting unit 121 in the embodiment in fig. 1, and the automatic frame extracting unit 121 may extract a plurality of video frames from the unlabeled video in a variety of ways.

Optionally, the frame extraction may be performed in a "fixed frame rate", that is, a plurality of video frames are extracted from the unlabeled video at the fixed frame rate, where the fixed frame rate may be a fixed frame rate set by a user, for example, the user sets the frame extraction rate of 10fps to perform frame extraction on an unlabeled video with a total duration of 1 hour, so as to obtain 36000 video frames; the fixed frame rate may also be determined according to a historical frame rate, for example, if the content of the unlabelled video is a surveillance video of an intersection, and the historical record shows that the historical frame rate corresponding to the surveillance video of the intersection is 20fps, then a plurality of video frames may be extracted from the unlabelled video at the frame rate of 20 fps. It is to be understood that the above description is illustrative, and that the present application is not limited to this description.

Alternatively, the frame extraction may be performed in a "manual frame extraction" manner, that is, the frame extraction rate is input by the user, specifically, a frame extraction interface may be presented to the user through the display unit 125, for example, a video playing window is displayed, the user may set a frame rate at any video segment position, and the frame extraction is performed in segments by inputting the start time, the end time, and the frame extraction rate of the video segment.

For example, fig. 4 is an exemplary interface that can be displayed by the display unit 125 after the user selects the "manual frame-extracting" mode, and the interface may include a video area 410, a playing control 420, and a frame-extracting input box 430, where the video area 410 is used to display a video frame of an unmarked video, the playing control 420 is used to control a playing progress, and the frame-extracting input box 430 is used to obtain a start time, an end time, and a corresponding frame-extracting rate of a video segment input by the user. Under an application scenario, if a user needs to mark a non-motor vehicle, the video can be played through the playing control 420, when the non-motor vehicle appears in the video for the first time at 00:00:01 seconds and appears in the video for the last time at 00:00:06 seconds, the user can fill in the starting time at 00:00:01, fill in the stopping time at 00:00:06, and the frame rate is 10fps, while the frame rate of the starting time at 00:00:00 and the stopping time at 00:00:01 is 0, that is, the frame extraction is not needed. It should be understood that the user interface shown in FIG. 4 is for illustration purposes and is not intended to be limiting in this application.

Optionally, frame extraction may be performed in an "automatic frame extraction" manner, that is, the frame extraction rate is dynamically adjusted according to the content of the video, and dynamic frame extraction is performed on the video that is not marked. In a specific implementation, the image similarity between adjacent frames may be determined first, and then the image similarity is compared with a pre-stored similarity mapping relationship to determine a frame rate corresponding to the image similarity, for example, when the similarity is 0.1, the frame rate is 1, and when the similarity is 0.2, the frame rate is 2; similarly, the variation of the inter-frame object may be determined, and then the variation of the object is compared with a pre-stored variation mapping relationship to determine the frame rate corresponding to the variation of the inter-frame object, where the pre-stored similarity mapping relationship and the pre-stored variation mapping relationship may be in a form of a mathematical formula, a mapping relationship table, or the like, which is not limited in the present application. It should be understood that when a moving object exists in a video, adjacent frames or adjacent multiple frames have difference in gray level, and based on this, the change rate of the inter-frame object can be obtained, and of course, a known video segment and the corresponding known inter-frame object change amount can also be used as training samples to train the deep neural network, and the trained model can deduce the corresponding inter-frame object change amount according to the input video segment; or, generating a sample set from the known unlabeled video and the corresponding frame rate, training the AI model by using the sample set, and inputting the currently processed unlabeled video into the trained AI model, so as to obtain the frame rate corresponding to each video time period. It should be understood that, the frame extraction is performed on the video that is not marked by using an automatic frame extraction manner, and the frame extraction rate can be dynamically adjusted according to the video content, so that excessive redundant frames are prevented from being extracted or a video frame containing a target cannot be extracted, so that the finally marked video frame can be used as a high-quality training sample to perform AI model training, and the use experience of a user is improved.

S320: at least one video frame of the plurality of video frames is displayed to the user through the display interface, wherein for the convenience of understanding, the at least one video frame is collectively referred to as a key frame hereinafter, and it should be understood that the key frame may be determined by the key frame extracting unit 1221 in the key frame processing unit 122 in the embodiment of fig. 1 from the plurality of video frames extracted by the automatic frame extracting unit 121.

In an embodiment, the key frame may be manually selected by a user, and specifically, the display unit 125 may present a plurality of video frames obtained after frame extraction to the user, and the user selects one or more video frames as the key frame to prepare for key frame annotation; the key frame may also be automatically selected by the key frame selecting unit 1221, and specifically, a fixed frame number in the multiple video frames may be used as the key frame, for example, a first frame or a last frame in the multiple video frames may be used as the key frame, or a key frame in the multiple video frames may be determined according to a key frame selection model, where the key frame selection model may be obtained by using a known multiple video frames and a corresponding known key frame as a training sample and training an AI model using the training sample, and the trained key frame selection model may output the corresponding key frame according to the input video frame.

S330: and acquiring a user annotation result for annotating at least one video frame in a display interface by a user, wherein the user annotation result comprises an image area of the target in the at least one video frame.

Optionally, the annotation result may include the target frame, specifically, the video annotation is performed by using different AI algorithms, the representation form of the target frame may be different, for example, the shape of the target frame may be a preset shape such as a rectangle, a circle, or an ellipse, and when the video annotation system performs fine positioning on the target by using the AI model based on the mask RCNN, the shape of the target frame is a shape close to the target, for example, a shape of an outline of the target, that is, the target frame is an outline frame of the target. No matter which algorithm is used for video annotation, the target frame may be a line frame or a scatter frame composed of a plurality of scatter points, and the present application is not particularly limited. It should be noted that the target frame may be a line frame, and specifically includes a line frame formed by a solid line or a dashed line, such as the solid-line rectangular frame 510 shown in fig. 5. In some implementations, the target box may also be a scatter box formed of a plurality of discrete dots. The embodiments of the present application do not limit this.

Optionally, the labeling result may include a center point, such as the center point 520 shown in fig. 5, in a specific implementation, the rectangular frame may be used to select the target first, and then the center point of the target is obtained by combining with other information, for example, the center point may be determined by an object centroid detection method, a unique point (particle) where the target does not change its position due to its rigid motion is detected by methods such as weighting of maximum likelihood estimation according to the video frame of the target selected by the rectangular frame and other information fed back by the wireless sensor, and the position of the target in the video frame is represented by the particle position; the central point can also be determined by 3D detection, the original 2D object detection is converted into 3D object detection by methods such as a point cloud picture, the height or the depth of a newly added object and the like, 3D modeling of the target object is obtained, a certain position is determined as the central point according to the 3D model, and the central point represents the position of the target; the center point may also be determined directly by a rectangular box on the 2D pixel picture in connection with the video content. For example, when the target is a motor vehicle, the straight-going vehicle is basically consistent in horizontal or vertical direction, and therefore the middle point of the lower edge of the rectangular frame is often selected as the center point of the target; because the close-range vehicle has a large size and is in front-back perspective deformation, the lower right corner point of the rectangular frame is often selected as the central point of the target; since the size of the long-range vehicle is small and the rectangular frame is small, the center point of the rectangular frame is often selected as the center point of the target. It should be understood that the above-listed methods for obtaining the plurality of center points are only used for illustration, and other methods may be used to obtain the center points, which are not specifically limited in the present application.

Alternatively, the annotation result can include a mask, such as mask 530 shown in FIG. 5. The detailed description of the mask can be explained with reference to the aforementioned terms, and thus, the detailed description thereof is not repeated here.

Optionally, the labeling result may also include one or more of the target box, the center point and the mask, such as the labeling result 540 shown in fig. 5, which includes the rectangular box, the center point and the mask.

It should be understood that, since the labeling formats are various, in order to better understand the present application, the labeling result will be described below by taking the labeling box as an example, but the labeling formats are not limited in the present application.

In an embodiment, a user labeling result can be obtained through an automatic labeling mode, specifically, after a key frame is presented to a user, a recommended labeling result of the key frame can be generated at the same time, a plurality of targets which the user may need to label are automatically labeled, the user can automatically label the targets without any operation, the recommended labeling result is directly used as the user labeling result, user experience is improved, and labeling efficiency is improved.

In a specific implementation, the recommended labeling result of the key frame may be generated according to the video content, for example, a surveillance video on a highway usually labels vehicles as targets, when the unlabeled video is the surveillance video on the highway, a vehicle may be selected from the key frame to recommend the labeling result to a user, or a small amount of keywords may be obtained from the user, for example, "vehicles" may be input, the system may label the keywords according to the keywords, and all vehicles may be selected from the key frame to obtain the user labeling result. Further, the keyframe of the unlabeled video may be input into the recommended labeling model to obtain a user labeling result, where the user labeling result may be an AI model, and specifically, the keyframe of the known unlabeled video and the corresponding known user labeling result may be used as training samples, and after the deep neural network is trained by using a computer vision algorithm such as an objectness algorithm, a trained recommended labeling model is obtained, or other computer vision algorithms mature in the industry are used to implement the above functions.

In an embodiment, the user annotation result may be obtained through a human-computer interaction manner, and specifically, the recommended annotation result may be presented to the user through a display unit for the user to select, and the user annotation result selected by the user in the recommended annotation result may be obtained. For example, as shown in fig. 5, the key frame processing unit 122 may generate a plurality of recommendation targets on the key frame, and present a recommendation labeling result including the plurality of recommendation targets to the user through the display interface shown in fig. 5, and the user may select a target by clicking any position in one target frame at will, without selecting a drop-down frame or a stroke frame in a mouse or touch manner, so as to improve user experience and improve labeling efficiency. In the specific implementation, after the annotation result is recommended to the user, the user can manually correct the recommended annotation result through the display interface, so that the annotation precision is further improved.

In an embodiment, the user labeling result may be obtained through a manual labeling manner, specifically, after the key frame is presented to the user, the target frame, the center point, or the mask may also be manually drawn by the user, which is not limited in the present application.

For example, before the display unit 125 displays the key frame to the user, the display unit may obtain the annotation mode of the current key frame annotation to the user, and if the user selects "automatic annotation", the video annotation system may automatically generate the user annotation result on the key frame without any annotation action by the user, thereby greatly improving the user experience and annotation efficiency of the user; if the user selects 'human-computer interaction', the video annotation system can recommend the annotation result to the user, the user only needs to confirm the recommended annotation result manually, for example, the user can automatically generate the central point by moving the mouse to any position of the object, the user does not need to find the central point of the object manually, the annotation precision is ensured, and meanwhile, the annotation efficiency is improved; if the user selects "manual labeling", the user can label the key frame by himself, for example, manually draw a target frame on the object, and the key frame labeling unit can record the drawing information of the user and train the recommended labeling model as a new sample.

S340: and obtaining the labeling result of other video frames in the plurality of video frames according to the labeling result of the user, namely labeling the rest non-key frames, wherein the labeling result comprises the image areas of the target required by the user in the other video frames. It should be understood that this step can be implemented by the automatic labeling unit 123 in the embodiment of fig. 1. In a specific implementation, a user may select to export the annotation result of other video frames in a certain format, for example, export a plurality of video frames in jpeg format and an annotation file in xml format, and then use the export result as a training sample for training an AI model. It is to be understood that the above format is for illustration and should not be construed as limiting in any way.

In an embodiment, the user can select a forward annotation mode or a reverse annotation mode for automatic annotation, wherein the forward annotation refers to the automatic annotation of the video frame after the key frame, and the reverse annotation refers to the automatic annotation of the video frame before the key frame. For example, a user selects a frame as a key frame when a video is about to end by using a manual frame extraction mode, and marks a target by using a manual marking mode (also can be a man-machine interactive marking mode or an automatic marking mode), at this time, the user can select a reverse marking mode and automatically marks the target in other frames before the key frame, and similarly, if the user selects a frame as a key frame when the video starts, the user can also select a forward marking mode and automatically marks the target in other frames after the key frame, so that it can be understood that the user performs video marking by marking one or more frames of key frames and then performing all automatic marking modes, thereby greatly improving the video marking efficiency of the user and improving the user experience.

For example, as shown in fig. 6, assuming that the user uses frame 1 as a key frame, and marks a truck in the key frame with a rectangular frame, and marks a central point of a bus, and assuming that the user selects a forward marking mode to automatically mark, the automatic marking unit 123 may automatically mark the following frames 2 and 3, mark the truck in frame 2 with a rectangular frame, mark the central point of the bus in frame 2, mark the truck in frame 3 with a rectangular frame, and mark the central point of the bus in frame 3 with a rectangular frame, it should be understood that fig. 6 is used for illustration, and the application does not limit an automatic marking interface.

In an embodiment, the automatic labeling unit may input the key frame and the user labeling result into the automatic labeling model, and the automatic labeling model may perform inference on the non-key frame according to the input key word and the user labeling result to obtain an image area of the target meeting a preset condition, thereby determining the position and size of the target frame, the center point, or the mask, and obtaining the labeling results of other video frames. The automatic labeling model can be obtained by training an AI model by combining with mature computer vision algorithms in the industries of target tracking, target recognition, target detection, segmentation and the like, and the training method of the automatic labeling model is not limited in the application.

In the specific implementation, as shown in fig. 1, the automatic labeling unit 123 may invoke a plurality of operation units 1232 at the same time, execute inference algorithms on a plurality of video frames at the same time, and then superimpose the operation results to generate a labeling result, which can be understood as the use of a plurality of operation units to process inference algorithms in parallel, can improve the processing efficiency of the automatic labeling unit 123, improve the video labeling efficiency, and improve the user experience.

Optionally, the automatic labeling unit 123 may divide a video frame to be labeled into a plurality of video segments, and randomly allocate the plurality of video segments to each operation unit, and each operation unit processes one video segment, where one video segment may include at least two video frames, for example, the operation unit 1 processes the video frames 1 to 10 to generate labeling results of the video frames 1 to 10, and the operation unit 2 processes the video frames 11 to 20 to generate labeling results of the video frames 11 to 20, where the video segments processed by each operation unit may be randomly allocated by the automatic labeling unit 123, which is not limited in this application. In this way, the automatic annotation unit 123 can process multiple video clips simultaneously, thereby improving the processing efficiency of video annotation.

Optionally, the automatic labeling unit 123 may randomly allocate the video frames to be labeled to each operation unit, where one operation unit may process one video frame, that is, one operation unit infers one video frame by using the automatic labeling model, and outputs the labeling result of the video frame, so that the automatic labeling unit may process multiple video frames at the same time, thereby improving the processing efficiency of video labeling, and each operation unit independently processes each video frame, thereby achieving forward labeling and reverse labeling, and improving the user experience.

Optionally, the automatic labeling unit 123 may allocate the same video frame to multiple operation units according to the user labeling result, where one operation unit may load one algorithm, multiple operation units may process the same video frame at the same time, and finally, output the labeling result after superimposing the output results of the multiple operation units. For example, it is assumed that the arithmetic unit 1 loads an algorithm for labeling a target frame for a video frame, and the arithmetic unit 2 loads an algorithm for labeling a mask for a video frame, so that the arithmetic unit 1 and the arithmetic unit 2 process the frame a1 at the same time, the arithmetic unit 1 outputs the frame a11 labeled with a target frame, the arithmetic unit 2 outputs the frame a12 labeled with a mask, and finally the automatic labeling unit 123 superimposes the frame 11 and the frame 12 to obtain the frame a1 labeled with a target frame and a mask at the same time. It should be understood that the above example is used for illustration, and the automatic labeling unit 123 may randomly allocate the video frames to the arithmetic unit for processing, and is not limited in itself.

It should be noted that, in the embodiment of the present application, the computer vision algorithm such as the target detection algorithm, the target tracking algorithm, the segmentation algorithm, and the like may use any one of neural network models that have superior effects in the industry, for example: a one-stage Unified Real-Time Object Detection (young Only Look one: Unified, Real-Time Object Detection, young) model, a Single Shot multi-box Detector (SSD) model, a Regional ConvolutioNal Neural Network (RCNN) model, or a Fast regional ConvolutioNal Neural Network (Fast-RCNN) model, and the like, which is not limited in this application.

In an embodiment, after step S340, the key frame processing unit 122 may receive modification information of the annotation result, generate a new key frame and a new user annotation result according to the modification information, and the automatic annotation unit 123 may modify the annotation result of the non-key frame according to the new key frame and the new user annotation result, and so on, and continuously improve the annotation precision.

Optionally, the modification information may be generated by the user through the display unit 125, specifically, the automatic labeling unit 123 processes a video frame, and may present an automatically labeled labeling result to the user through the display unit 125, if a modification request of the user is not received, the automatic labeling unit 123 may continue to process a next video frame until the video playing is finished, if the user sees a labeling result of a certain frame, the labeling result is considered to be mistakenly required to be modified, the user may pause playing, switch to a label modification mode, and then modify the position or size of the target frame, the center point, or the mask by mouse dragging or the like, after the display unit receives the modification information of the user, the automatic labeling unit may use the video frame corresponding to the modification information as a new key frame, and continue to perform automatic labeling according to the modified user labeling result.

Still taking the application scenario shown in fig. 6 as an example, assuming that the automatically labeled frame 3 is shown in fig. 7, the central points of the rectangular frame for labeling a truck and the central point for labeling a bus in the frame 3 are both shifted, after the user corrects the frame 3 to obtain a new frame 3, the frame 3 may be used as a new key frame, and the automatic labeling unit 123 labels the frame 4 according to the new user labeling result to obtain a labeling result of the frame 4 shown in fig. 7, it should be understood that fig. 7 is used for illustration, and the interface form is not limited in the present application.

Alternatively, the modification information may also be generated by the annotation modification unit 124 shown in fig. 1, and the annotation modification unit 124 may obtain the modification information of the annotation result by labeling a modification model, where the annotation modification model may be obtained after training the AI model by using a known annotation result and corresponding known modification information as training samples in advance. In brief, if the user selects to use the function of "automatic optimization", the automatic labeling unit 123 will input the labeling result generated after the automatic labeling of the non-key frame into the labeling correction unit 124, automatically detect whether the labeling is wrong, and if the labeling is wrong, generate the corresponding modification information, and feed back the modification information to the key frame processing unit 122, so as to generate a new key frame and a new user labeling result, the automatic labeling unit 123 can generate a new labeling result according to the new key frame and the new user labeling result, so that the automatic frame extracting unit, the key frame processing unit, and the labeling correction unit form a closed loop, continuously optimize the labeling result, and improve the labeling accuracy, in the specific implementation, the user can select whether to use the labeling correction unit 124 to perform the labeling correction according to the training target, and can also select the number of optimization times, such as the labeling accuracy required by the training target is low, then, the user may not use the annotation correcting unit 124 to perform annotation correction, so as to achieve the purpose of obtaining the video annotation result quickly, and for example, the annotation precision required for the training target is higher, then the user may select to use the annotation correcting unit 124 to perform annotation correction, and the number of times of correction is increased, for example, set to 5 times, it should be understood that the above example is used for illustration, and cannot be limited specifically.

In order to facilitate understanding of the beneficial effects of the video annotation method provided by the embodiment of the present application, for example, fig. 8 is an exemplary display interface of a video annotation system provided by the present application, and before step S310, the display unit may present the interface shown in fig. 8 to a user, and the interface may exemplarily include a title bar 810, a file loading bar 820, a frame extraction setting bar 830, a key frame selection setting bar 840, a key frame annotation setting bar 850, an annotation format setting bar 860, an automatic optimization setting bar 870, and a save control 880.

The title bar 810 includes an interface name and a window adjustment control, for example, the title bar 810 shown in fig. 8 includes a closing control for closing a window, a reducing control for reducing the window, and a full-screen control for maximizing the window, in a specific implementation, the title bar 810 may further include more elements, and the presentation form of the title bar 810 may be text information, an icon, or other forms, which is not specifically limited in this application.

The file loading column 820 is used for searching the video file matched with the character according to the character input by the user, namely the unlabeled video in the content.

The frame extracting setting column 830 is used for extracting frames from the video file according to a frame extracting mode selected by a user, so as to obtain a plurality of video frames. The display unit 125 may send the frame extracting mode selected by the user to the automatic frame extracting unit 121, so that the automatic frame extracting unit 121 performs frame extraction according to the frame extracting mode selected by the user. Specifically, if the user selects "automatic frame extraction", the automatic frame extraction unit 121 may dynamically adjust the frame extraction rate according to the content of the video, and dynamically extract frames from the video that is not marked; if the user selects "fixed frame rate", the automatic frame extracting unit 121 may perform frame extraction according to the fixed frame rate ", and if the user selects" manual frame extraction ", the display unit 125 may first display a video playing window, such as the video playing interface shown in fig. 4, to the user, and then send the start time, the end time, and the frame extraction rate of the video segment input by the user to the automatic frame extracting unit 121, so that the automatic frame extracting unit performs frame extraction. It should be understood that, for a specific implementation of the above frame extraction manner, reference may be made to the foregoing step S310 and optional steps thereof, and details are not repeated here.

The key frame selection setting column 840 is configured to perform key frame selection on the plurality of video frames obtained after frame extraction according to a key frame selection manner selected by a user, and the display unit 125 may send the key frame selection manner selected by the user to the key frame processing unit 122, so that the key frame processing unit 122 selects one or more key frames from the plurality of video frames for labeling based on the key frame selection manner. Specifically, if the user selects "automatic selection", the key frame processing unit 122 may use the video frame with the fixed frame number as the key frame, such as the first frame or the last frame, or input a plurality of video frames obtained after frame extraction into the key frame selection model, and determine the key frame according to the output result of the model, which may specifically refer to the foregoing step S320 and optional steps, which are not described herein again.

The key frame labeling setting column 850 is used for labeling the key frame according to the key frame labeling mode selected by the user, and obtaining a user labeling result. Specifically, the display unit 125 can send the key frame annotation mode selected by the user to the key frame processing unit 122, so that the key frame processing unit 122 can annotate the key frame based on the key frame annotation mode, and if the user selects "automatic annotation", the video annotation system can automatically generate a user annotation result on the key frame without any annotation action by the user, thereby greatly improving the user experience and the annotation efficiency of the user; if the user selects 'human-computer interaction', the video annotation system can recommend the annotation result to the user, the user only needs to confirm the recommended annotation result manually, for example, the user can automatically generate the central point by moving the mouse to any position of the object, the user does not need to find the central point of the object manually, the annotation precision is ensured, and meanwhile, the annotation efficiency is improved; if the user selects "manual labeling", the user can label the key frame by himself, for example, manually draw a target frame on the object, obtain a manual drawing center point at the center of the object, and the like.

The annotation format setting column 860 is configured to automatically annotate, according to an annotation format selected by a user, a non-key frame in a plurality of video frames obtained after frame extraction, where the annotation format may include, for example, a rectangular frame, a central point, a mask, a polygon, and the like, and of course, other annotation formats may also be included, which is not specifically limited in this application.

The automatic optimization setting column 870 is configured to correct the annotation result according to the selection result of the user, and if the user selects "no", the annotation result of the automatic annotation unit 124 is directly output, and if the user selects "yes", the video annotation system calls the annotation correction unit 124 to perform annotation correction, and further, the automatic optimization setting column 870 may further include an annotation correction level, and the user may select the number of times of annotation correction according to a training target, and if the training target requires a sample with higher annotation precision, the number of times of annotation correction may be increased, which does not limit the specific form of the automatic optimization setting column 870.

The save control 880 is configured to receive a user operation (e.g., a mouse click or a touch operation), and in response to the detected user operation, the display unit 125 may send options input by the user in the file loading field 820, the frame extraction setting field 830, the key frame selection setting field 840, the key frame annotation setting field 850, the annotation format setting field 860, and the automatic optimization setting field 870 to the corresponding unit module for processing.

It should be understood that after the user clicks the save control 880, the display unit 125 may continue to display the next interface according to the selection of the user, for example, when the user selects "manual selection" in the key frame selection setting column 840, the display unit 124 may then display the interface shown in fig. 4, if the user selects "human-computer interaction" in the key frame annotation setting column 850, the display unit 124 may further display the interface shown in fig. 5, after obtaining the user annotation result, the display unit 125 may display the automatic annotation interface shown in fig. 6, and present the annotation result of the automatic annotation to the user, if the user selects "yes" in the automatic optimization setting column 870, the display unit may further display the annotation modification interface shown in fig. 7, and the user may modify the annotation result on the interface to generate a new key frame and a new user annotation result, which may specifically refer to the foregoing contents, and will not be repeated here.

It is understood that fig. 8 only illustrates an interface presented by the display unit 150 to the user before step S310, and should not constitute a limitation on the embodiment of the present application.

In summary, the video annotation method provided by the application can automatically label other non-key frames according to the labeling result of the key frame, so as to obtain the labeling result of the whole video, and the labeling result of the key frame can be labeled by a user, or can be confirmed by the user after the video annotation system automatically recommends the key frame to the user, so that in the whole labeling process, the user only needs to label the key frame once, or even only confirms whether the key frame is labeled correctly once, so that the complete video annotation result can be obtained, the labeling operation of the user is greatly reduced, and the video annotation efficiency and the user experience are improved.

The method of the embodiments of the present application is explained in detail above, and in order to better implement the above-mentioned aspects of the embodiments of the present application, the following also provides related equipment for implementing the above-mentioned aspects.

Fig. 9 is a video annotation system provided in the present application, and as shown in fig. 9, the system includes:

an automatic frame extracting unit 121, configured to extract a plurality of video frames from an unlabeled video;

a key frame processing unit 122 for displaying at least one video frame of the plurality of video frames to a user through a display interface;

the automatic labeling unit 123 is configured to obtain a user labeling result that a user labels at least one video frame in the display interface, where the user labeling result includes an image area of a target in the at least one video frame;

and the annotation correcting unit 124 is configured to obtain an annotation result of another video frame in the plurality of video frames according to the user annotation result, where the annotation result includes an image area of the target in the another video frame.

In an embodiment, the automatic frame extracting unit 121 is configured to extract a plurality of video frames from the unlabeled video according to the video content of the unlabeled video, where image similarity between the plurality of video frames is lower than a first threshold, or an object variation between the plurality of video frames is higher than a second threshold.

In one embodiment, the at least one video frame is a leading frame or a trailing frame of the plurality of video frames; or at least one video frame is obtained by inputting a plurality of video frames into a key frame selection model, and the key frame selection model is obtained by training a neural network model by using a plurality of known video frames and corresponding known key frames as training samples.

In an embodiment, the key frame processing unit 122 is configured to input the at least one video frame into a recommended annotation model, and obtain a recommended annotation result, where the recommended annotation result includes an image area of the at least one recommended target in the at least one video frame; the key frame marking unit is used for displaying the recommended marking result to the user through the display interface and acquiring the user marking result selected by the user in the recommended marking result.

In one embodiment, the annotation result includes one or more of a target box, a center point, and a mask.

In an embodiment, the automatic labeling unit 123 is configured to invoke a plurality of computing units according to the user labeling result, and process other video frames in parallel to obtain the labeling result of the other video frames, where one computing unit processes one video frame, or at least one computing unit processes one video frame, and each computing unit in the at least one computing unit generates one labeling result.

In an embodiment, the automatic labeling unit 123 is configured to label, according to the labeling result of the user, the video frame after or before the key frame to obtain a labeling result.

In an embodiment, the system further includes an annotation modification unit 124, where the key frame processing unit 122 is configured to receive, through the display interface, modification information of the annotation result from a user's modification of an image area of the target in another video frame; or, the label modification unit 124 is configured to obtain modification information of the label result by labeling a modification model, where the label modification model is obtained by training a neural network model using a plurality of known label results and corresponding known modification information as training samples; the automatic labeling unit 123 is configured to modify the labeling result of other video frames in the plurality of video frames according to the modification information.

In one embodiment, the annotation result is used to be learned by the AI model.

It should be understood that the video annotation system 120 of this embodiment may be a physical machine, such as an X86 server, a virtual machine, or a computer cluster composed of a plurality of physical machines or virtual machines, and the unit modules inside the video annotation system 120 may also have various partitions, and each module may be a software module, a hardware module, or a part of a software module and a part of a hardware module, which is not limited in this application. Fig. 9 is an exemplary division, and the present application is not limited in particular. The video annotation system 120 according to the embodiment of the present application may correspond to perform the method described in the embodiment, and the above and other operations and/or functions of each unit in the video annotation system 120 are respectively for implementing corresponding flows of each method in fig. 3 to fig. 8, and are not described herein again for brevity.

Fig. 10 is a schematic structural diagram of a computing device 1000 provided in the present application, where the computing device 1000 may be the video annotation system 120 in the foregoing. As shown in fig. 10, computing device 1000 includes: a processor 1010, a communication interface 1020, and a memory 1030. The processor 1010, the communication interface 1020, and the memory 1030 may be connected to each other via an internal bus 1040, or may communicate with each other via other means such as wireless transmission. In the embodiment of the present application, the bus 1040 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, for example, in which the bus 1040 is connected by the bus 1040. The bus 1040 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

Processor 1010 may be comprised of at least one general-purpose processor, such as a Central Processing Unit (CPU), or a combination of a CPU and hardware chips. The hardware chip may be an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a Field-Programmable Gate Array (FPGA), General Array Logic (GAL), or any combination thereof. Processor 1010 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 1030, which enable computing device 1000 to provide a variety of services.

The memory 1030 is configured to store program codes and is controlled by the processor 1010 to execute the processing steps of the video annotation system in the above embodiments. The program code may include one or more software modules, which may be software modules provided in the embodiment of fig. 9, such as an automatic frame extraction unit, a key frame processing unit, and an automatic annotation unit, where the automatic frame extraction unit is configured to extract a plurality of video frames from an unlabeled video, the key frame processing unit is configured to display at least one of the plurality of video frames to a user through a display interface and obtain a user annotation result indicating that the user annotates the at least one video frame in the display interface, and the automatic annotation unit is configured to obtain an annotation result of another video frame in the plurality of video frames according to the annotation result. Specifically, the method may be used to execute steps S310 to S340 in the embodiment of fig. 3 and optional steps thereof, and may also be used to execute other steps executed by the video annotation system described in the embodiment of fig. 3 to 8, which are not described herein again.

Memory 1030 may include Volatile Memory (Volatile Memory), such as Random Access Memory (RAM); the Memory 1030 may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); memory 1030 may also include combinations of the above. The memory 1030 may store program codes, and may specifically include program codes for performing other steps described in the embodiments of fig. 3 to fig. 8, which are not described herein again.

The communication interface 1020 may be a wired interface (e.g., an ethernet interface), an internal interface (e.g., a Peripheral Component Interconnect express (PCIe) bus interface), a wired interface (e.g., an ethernet interface), or a wireless interface (e.g., a cellular network interface or a wireless lan interface), for communicating with other devices or modules.

It should be noted that fig. 10 is only one possible implementation manner of the embodiment of the present application, and in practical applications, the computing device 1000 may also include more or less components, which is not limited herein. For the content that is not shown or described in the embodiment of the present application, reference may be made to the related explanation in the foregoing embodiments of fig. 3 to 8, which is not described herein again.

It should be understood that the computing device shown in fig. 10 may be implemented by a general-purpose physical server, for example, an ARM server or an X86 server, or may be implemented by a virtual machine implemented based on a general-purpose physical server and combining with NFV technology, where the virtual machine refers to a complete computer system having complete hardware system functions and operating in a completely isolated environment, and may also be a computer cluster formed by at least one physical server or virtual machine, and the present application is not limited in particular.

Embodiments of the present application also provide a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium is executed on a processor, the method flows shown in fig. 3 to 8 are implemented.

Embodiments of the present application also provide a computer program product, and when the computer program product is run on a processor, the method flows shown in fig. 3-8 are implemented.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD), or a semiconductor medium.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for video annotation, the method comprising:

extracting a plurality of video frames from the unlabeled video;

displaying at least one video frame of the plurality of video frames to a user through a display interface;

acquiring a user annotation result for the user to annotate the at least one video frame in the display interface, wherein the user annotation result comprises an image area of a target in the at least one video frame;

and obtaining the labeling result of other video frames in the plurality of video frames according to the user labeling result, wherein the labeling result comprises the image areas of the target in the other video frames.

2. The method of claim 1, wherein the extracting the plurality of video frames from the unlabeled video comprises:

extracting a plurality of video frames from the unlabeled video according to the video content of the unlabeled video, wherein the image similarity among the plurality of video frames is lower than a first threshold, or the object variation among the plurality of video frames is higher than a second threshold.

3. The method according to claim 1 or 2, wherein the at least one video frame is a leading frame or a trailing frame of the plurality of video frames; alternatively, the first and second electrodes may be,

the at least one video frame is obtained by inputting the plurality of video frames into a key frame selection model, and the key frame selection model is obtained by training a neural network model by using a plurality of known video frames and corresponding known key frames as training samples.

4. The method according to any one of claims 1 to 3, wherein the obtaining of the user annotation result of the user annotating the at least one video frame in the display interface comprises:

inputting the at least one video frame into a recommended annotation model to obtain a recommended annotation result, wherein the recommended annotation result comprises an image area of at least one recommended target in the at least one video frame;

and displaying the recommended marking result to the user through the display interface, and acquiring the user marking result selected by the user in the recommended marking result.

5. The method according to any one of claims 1 to 4, wherein the labeling result comprises one or more of a target box, a center point, and a mask.

6. The method of claim 5, wherein obtaining annotation results for other video frames of the plurality of video frames according to the user annotation result comprises:

and calling a plurality of computing units according to the user annotation result, and processing the other video frames in parallel to obtain the annotation result of the other video frames, wherein one computing unit processes one video frame, or at least one computing unit processes one video frame, and each computing unit in the at least one computing unit generates an annotation result.

7. The method of claim 6, wherein obtaining annotation results for other video frames of the plurality of video frames according to the user annotation result comprises:

and according to the user labeling result, labeling the video frame behind or in front of the key frame to obtain a labeling result.

8. The method according to any one of claims 1 to 7, further comprising:

receiving modification information of the annotation result from the user through the display interface, wherein the modification information is from modification of the image area of the target in the other video frames by the user; alternatively, the first and second electrodes may be,

obtaining modification information of the labeling result through a labeling modification model, wherein the labeling modification model is obtained by training a neural network model by using a plurality of known labeling results and corresponding known modification information as training samples;

and modifying the labeling results of other video frames in the plurality of video frames according to the modification information.

9. The method according to any one of claims 1 to 8, wherein the labeling result is used for learning by an artificial intelligence AI model.

10. A video annotation system, said system comprising:

the automatic frame extracting unit is used for extracting a plurality of video frames from the unlabeled video;

a key frame processing unit for displaying at least one of the plurality of video frames to a user through a display interface;

the automatic labeling unit is used for acquiring a user labeling result of the user labeling the at least one video frame in the display interface, wherein the user labeling result comprises an image area of a target in the at least one video frame;

and the annotation correcting unit is used for obtaining the annotation result of other video frames in the plurality of video frames according to the user annotation result, wherein the annotation result comprises the image area of the target in the other video frames.

11. The system according to claim 10, wherein the automatic frame extraction unit is configured to extract a plurality of video frames from the unlabeled video according to the video content of the unlabeled video, wherein the image similarity between the plurality of video frames is lower than a first threshold, or the object variation between the plurality of video frames is higher than a second threshold.

12. The system according to claim 10 or 11, wherein the at least one video frame is a leading frame or a trailing frame of the plurality of video frames; alternatively, the first and second electrodes may be,

13. The system according to any one of claims 10 to 12, wherein the key frame processing unit is configured to input the at least one video frame into a recommended annotation model to obtain a recommended annotation result, wherein the recommended annotation result includes an image area of at least one recommended target in the at least one video frame;

the key frame processing unit is used for displaying the recommended marking result to the user through the display interface and acquiring the user marking result selected by the user in the recommended marking result.

14. The system according to any one of claims 10 to 13, wherein the labeling result comprises one or more of a target box, a center point, and a mask.

15. The system according to claim 14, wherein the automatic labeling unit is configured to invoke a plurality of computing units according to the user labeling result, and process the other video frames in parallel to obtain the labeling result of the other video frames, wherein one computing unit processes one video frame, or at least one computing unit processes one video frame, and each computing unit in the at least one computing unit generates one labeling result.

16. The system of claim 15, wherein the automatic labeling unit is configured to label the video frame after or before the key frame according to the user labeling result to obtain a labeling result.

17. The system according to any of the claims 10 to 16, characterized in that the system further comprises an annotation modification unit,

the key frame processing unit is used for receiving modification information of the user on the annotation result through the display interface, wherein the modification information is from the user to the modification of the image area of the target in the other video frames; alternatively, the first and second electrodes may be,

the label correction unit is used for obtaining the correction information of the label result through a label correction model, and the label correction model is obtained after training a neural network model by using a plurality of known label results and corresponding known correction information as training samples;

and the automatic labeling unit is used for modifying the labeling results of other video frames in the plurality of video frames according to the modification information.

18. The system of any one of claims 10 to 17, wherein the annotation result is used for learning by an AI model.

19. A computer-readable storage medium comprising instructions that, when executed on a computing device, cause the computing device to perform the method of any of claims 1 to 9.

20. A computing device comprising a processor and a memory, the processor executing code in the memory to perform the method of any of claims 1 to 9.

21. A computer program product comprising a computer program that, when read and executed by a computing device, causes the computing device to perform the method of any of claims 1 to 9.