CN116721419A

CN116721419A - Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model

Info

Publication number: CN116721419A
Application number: CN202310767430.5A
Authority: CN
Inventors: 栾博恒; 吕宽; 李雨雨; 徐楚量
Original assignee: Godes Hangzhou Intelligent Technology Co ltd
Current assignee: Godes Hangzhou Intelligent Technology Co ltd
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-08

Abstract

The invention discloses an auxiliary labeling method for a SAM combined with a large visual model, which comprises the following steps: step a, dividing pictures; step b, calculating a result according to the mouse coordinates; step c, generating a labeling frame; and d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement. According to the invention, the visual large model SAM is combined with the traditional marking tool, the visual large model SAM divides the picture to be marked by the user into a plurality of target blocks, and then the target blocks are displayed on the webpage, so that an efficient image marking process is realized, and the workload of manual marking is reduced. And the target block is displayed through the suspension of the mouse, and then the correct target block is established through the click prompt area of the user, so that the clicking and displacement of the mouse are reduced from more than two times to one click, and the operation amount of the user is greatly reduced.

Description

Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an auxiliary labeling method combined with a large visual model SAM.

Background

In the prior art, aiming at the condition that a webpage of a traditional browser only can display pictures, or a canvas (canvas) only provides basic graphic drawing, the picture editing and the page picture drawing work are inconvenient to operate, and the linkage operation with a mouse is also impossible. In particular, in the field of deep learning, it is necessary to record coordinates of graphics, and to label the graphics.

As the prior art discloses a webpage image labeling method, a webpage image labeling device, electronic equipment and a storage medium (application publication number: CN 112346809A), the linkage of an operation event and an operation canvas can be realized, and after graphic labeling is carried out on a plurality of target objects in the operation canvas through the operation event, the coordinate information of the target objects can be recorded in real time, so that the operation on pictures in the operation canvas can be carried out conveniently and rapidly.

However, in the existing labeling scheme, for large-batch image data with high repeatability, manual labeling of all data is generally required, and when manual labeling is performed, a user is required to manually drag a drawing frame, so that the following technical problems exist: the drawn frame is a state unsuitable for AI learning, the size needs to be repeatedly trimmed and modified to be suitable, more time is required to repeat and correct the labeling frame, the AI can be better identified and learned, the labeling process has large dependence on manpower, most repetitive work is performed, and the efficiency is low.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art, and provides an auxiliary labeling method combined with a large visual model SAM, so that an efficient image labeling process is realized, and the workload of manual labeling is reduced.

In order to solve the technical problems, the invention adopts the following technical scheme:

the auxiliary labeling method combined with the visual large model SAM is characterized by comprising the following steps of:

step a, picture segmentation: the user opens a picture marking tool, divides a picture to be marked by the user into a plurality of image embedding masks through a visual large model SAM, and integrates the images to generate a model which can be displayed at a webpage end.

Step b, calculating a result according to the mouse coordinates: and decoding the model, thereby finding out a target block which accords with the position of the mouse, displaying the target block on a webpage, suspending the target block by a user through the mouse, generating a prompt area, and establishing a correct target block by clicking the prompt area by the user.

Step c, generating a labeling frame: and generating a label frame to wrap the target block according to the coordinates and the length and the width of the displayed target block when the user clicks the correct target block.

And d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement.

Further, the visual large model comprises an encoder and a decoder, and the picture segmentation is specifically as follows: (1) extracting image features using an encoder; (2) And restoring the feature map to the original image size by adopting a decoder, and generating a segmentation result.

Further, the visual large model uses a cross entropy based multi-tasking loss function including classification loss at the pixel level and regression loss at the bounding box level. The classification loss is used to measure the class to which each pixel belongs, and the regression loss is used to adjust the bounding box position of each pixel.

Further, the visual large model employs a data enhancement module that includes random rotation, scaling, cropping, flipping, and color space transformation and noise addition.

Further, the vision large model uses a pre-training model as an initial weight of the encoder for accelerating model training and improving segmentation accuracy.

Further, the pre-training model uses MAE and ViT for pre-training.

Further, the visual large model cutting picture is processed into an enabling model file, then the enabling model file is operated by onnx, the model file is processed, a corresponding mask is obtained according to mouse coordinates, the mask is decoded and converted into a picture file, the picture file is a target block, and then the picture file is covered at a corresponding position of an original picture.

Further, the prompt area is a blue area covering the labeling target, the size of the blue area is switched through a mouse wheel to adjust, a plurality of target blocks which are matched with the blue area are found through mouse coordinates and are assembled into an array, and a user can switch the target blocks displayed by the array through the wheel.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

according to the invention, the visual large model SAM is combined with the traditional marking tool, the visual large model SAM divides the picture to be marked by the user into a plurality of target blocks, and then the target blocks are displayed on the webpage, so that an efficient image marking process is realized, and the workload of manual marking is reduced.

According to the invention, the target block is displayed through the suspension of the mouse, and the correct target block is established through the click prompt area of the user, so that the click and displacement of the mouse are reduced from more than two times to one click, and the operation amount of the user is greatly reduced.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is an image of the present invention in a mouse hover;

FIG. 3 is a diagram of the present invention in the generation of a callout box.

Detailed Description

As shown in fig. 1 to 3, the auxiliary labeling method combining the SAM of the large visual model combines SAM (segement anything model) in the large visual model with the traditional labeling tool, so that an efficient image labeling process is realized, and the workload of manual labeling is reduced.

The auxiliary labeling method combined with the visual large model SAM comprises the following steps:

step a, picture segmentation: the method comprises the steps of opening a picture marking tool by a user, extracting ten pictures, cutting the ten pictures through a visual large model SAM, dividing the pictures required to be marked by the user into a plurality of image embedding masks, and integrating to generate an embedding model file which can be called at a webpage end through onnx.

Wherein the visual large model comprises an encoder and a decoder, wherein the encoder section is composed of a plurality of convolution layers and a pooling layer for extracting image features; the decoder section is composed of a plurality of deconvolution layers and up-sampling layers for restoring the feature map to the original image size and generating a segmentation result. The method comprises the following steps of:

an encoder: is composed of multiple convolution layers and pooling layers for extracting image features. Each convolution layer typically includes operations such as convolution kernels, activation functions, and batch normalization for feature extraction and dimension reduction of the input image. The pooling layer is used for downsampling the feature map so as to reduce the calculated amount and the memory consumption.

A decoder: the method comprises a plurality of deconvolution layers and up-sampling layers, and is used for restoring the feature map to the original image size and generating a segmentation result. Each deconvolution layer typically includes operations such as deconvolution kernels, activation functions, and batch normalization for upsampling and feature fusion of feature maps. The up-sampling layer is used for up-sampling the feature map to restore the original image size.

Loss function: the visual large model uses a cross entropy based multitasking loss function that includes pixel level classification loss and bounding box level regression loss. The classification loss is used to measure which class (e.g., foreground or background) each pixel belongs to, and the regression loss is used to adjust the bounding box position of each pixel to better match the target.

Data enhancement: to improve the robustness and generalization ability of the model, the visual large model employs a variety of data enhancement techniques such as random rotation, scaling, cropping, flipping, etc., as well as color space transformation and noise addition.

Pre-training model: to speed model training and improve segmentation accuracy vision large models, pre-trained image classification models are typically used as initial weights for the encoder to better extract image features, the pre-trained models being pre-trained using MAEs and ViT.

Step b, calculating a result according to the mouse coordinates: and running the mapping model file by onnx, processing the model file, acquiring a corresponding mask according to mouse coordinates, decoding the mask into a picture file, taking the picture file as a target block, covering the target block at a corresponding position of an original image, and displaying the target block on a webpage. The user suspends through the mouse, produces the suggestion region, and the suggestion region is the blue region that covers the mark target, switches the size in blue region through the mouse gyro wheel, adjusts, seeks a plurality of target blocks that accords with through the mouse coordinate and assembles into the array, lets the user switch the target block of array demonstration through the gyro wheel. When the user requirements are met, the user clicks the prompt area to establish the correct target block.

And d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement. If the requirements are not met, deleting is carried out, or the switching selection tool is finely adjusted.

The above is only a specific embodiment of the present invention, but the technical features of the present invention are not limited thereto. Any simple changes, equivalent substitutions or modifications and the like made on the basis of the present invention to solve the substantially same technical problems and achieve the substantially same technical effects are included in the scope of the present invention.

Claims

1. The auxiliary labeling method combined with the visual large model SAM is characterized by comprising the following steps of:

step a, picture segmentation:

the user opens a picture marking tool, divides a picture to be marked by the user into a plurality of image embedding masks through a visual large model SAM, and integrates the images to generate a model which can be displayed at a webpage end;

step b, calculating a result according to the mouse coordinates:

decoding the model, thereby finding out a target block conforming to the position of the mouse, displaying the target block on a webpage, suspending the target block by a user through the mouse, generating a prompt area, and establishing a correct target block by clicking the prompt area by the user;

step c, generating a labeling frame:

and generating a label frame to wrap the target block according to the coordinates and the length and the width of the displayed target block when the user clicks the correct target block.

2. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: the visual large model comprises an encoder and a decoder, and the picture segmentation is specifically as follows:

(1) Extracting image features by adopting an encoder;

(2) And restoring the feature map to the original image size by adopting a decoder, and generating a segmentation result.

3. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the visual large model uses a cross entropy-based multi-task loss function, including pixel-level classification loss and bounding box-level regression loss;

the classification loss is used to measure the class to which each pixel belongs, and the regression loss is used to adjust the bounding box position of each pixel.

4. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the visual large model adopts a data enhancement module, wherein the data enhancement module comprises random rotation, scaling, clipping, overturning, color space transformation and noise addition.

5. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the vision large model uses a pre-training model as an initial weight of the encoder for accelerating model training and improving segmentation accuracy.

6. The auxiliary labeling method combined with the visual large model SAM according to claim 5, wherein: the pretraining model uses MAE and ViT for pretraining.

7. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: processing a visual large model cut picture into an enabling model file, then utilizing onnx to run the enabling model file to process the model file, obtaining a corresponding mask according to mouse coordinates, converting mask decoding into a picture file, enabling the picture file to be a target block, and then covering the picture file at a corresponding position of an original picture.

8. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: the prompting area is a blue area covering the labeling target, the size of the blue area is switched through a mouse wheel to adjust, a plurality of target blocks which are matched with the blue area are found through mouse coordinates and are assembled into an array, and a user switches the target blocks displayed by the array through the wheel.