CN116721419A - Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model - Google Patents

Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model Download PDF

Info

Publication number
CN116721419A
CN116721419A CN202310767430.5A CN202310767430A CN116721419A CN 116721419 A CN116721419 A CN 116721419A CN 202310767430 A CN202310767430 A CN 202310767430A CN 116721419 A CN116721419 A CN 116721419A
Authority
CN
China
Prior art keywords
model
large model
sam
target block
visual large
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310767430.5A
Other languages
Chinese (zh)
Inventor
栾博恒
吕宽
李雨雨
徐楚量
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Godes Hangzhou Intelligent Technology Co ltd
Original Assignee
Godes Hangzhou Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Godes Hangzhou Intelligent Technology Co ltd filed Critical Godes Hangzhou Intelligent Technology Co ltd
Priority to CN202310767430.5A priority Critical patent/CN116721419A/en
Publication of CN116721419A publication Critical patent/CN116721419A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses an auxiliary labeling method for a SAM combined with a large visual model, which comprises the following steps: step a, dividing pictures; step b, calculating a result according to the mouse coordinates; step c, generating a labeling frame; and d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement. According to the invention, the visual large model SAM is combined with the traditional marking tool, the visual large model SAM divides the picture to be marked by the user into a plurality of target blocks, and then the target blocks are displayed on the webpage, so that an efficient image marking process is realized, and the workload of manual marking is reduced. And the target block is displayed through the suspension of the mouse, and then the correct target block is established through the click prompt area of the user, so that the clicking and displacement of the mouse are reduced from more than two times to one click, and the operation amount of the user is greatly reduced.

Description

Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an auxiliary labeling method combined with a large visual model SAM.
Background
In the prior art, aiming at the condition that a webpage of a traditional browser only can display pictures, or a canvas (canvas) only provides basic graphic drawing, the picture editing and the page picture drawing work are inconvenient to operate, and the linkage operation with a mouse is also impossible. In particular, in the field of deep learning, it is necessary to record coordinates of graphics, and to label the graphics.
As the prior art discloses a webpage image labeling method, a webpage image labeling device, electronic equipment and a storage medium (application publication number: CN 112346809A), the linkage of an operation event and an operation canvas can be realized, and after graphic labeling is carried out on a plurality of target objects in the operation canvas through the operation event, the coordinate information of the target objects can be recorded in real time, so that the operation on pictures in the operation canvas can be carried out conveniently and rapidly.
However, in the existing labeling scheme, for large-batch image data with high repeatability, manual labeling of all data is generally required, and when manual labeling is performed, a user is required to manually drag a drawing frame, so that the following technical problems exist: the drawn frame is a state unsuitable for AI learning, the size needs to be repeatedly trimmed and modified to be suitable, more time is required to repeat and correct the labeling frame, the AI can be better identified and learned, the labeling process has large dependence on manpower, most repetitive work is performed, and the efficiency is low.
Disclosure of Invention
The invention aims to solve the technical problems in the prior art, and provides an auxiliary labeling method combined with a large visual model SAM, so that an efficient image labeling process is realized, and the workload of manual labeling is reduced.
In order to solve the technical problems, the invention adopts the following technical scheme:
the auxiliary labeling method combined with the visual large model SAM is characterized by comprising the following steps of:
step a, picture segmentation: the user opens a picture marking tool, divides a picture to be marked by the user into a plurality of image embedding masks through a visual large model SAM, and integrates the images to generate a model which can be displayed at a webpage end.
Step b, calculating a result according to the mouse coordinates: and decoding the model, thereby finding out a target block which accords with the position of the mouse, displaying the target block on a webpage, suspending the target block by a user through the mouse, generating a prompt area, and establishing a correct target block by clicking the prompt area by the user.
Step c, generating a labeling frame: and generating a label frame to wrap the target block according to the coordinates and the length and the width of the displayed target block when the user clicks the correct target block.
And d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement.
Further, the visual large model comprises an encoder and a decoder, and the picture segmentation is specifically as follows: (1) extracting image features using an encoder; (2) And restoring the feature map to the original image size by adopting a decoder, and generating a segmentation result.
Further, the visual large model uses a cross entropy based multi-tasking loss function including classification loss at the pixel level and regression loss at the bounding box level. The classification loss is used to measure the class to which each pixel belongs, and the regression loss is used to adjust the bounding box position of each pixel.
Further, the visual large model employs a data enhancement module that includes random rotation, scaling, cropping, flipping, and color space transformation and noise addition.
Further, the vision large model uses a pre-training model as an initial weight of the encoder for accelerating model training and improving segmentation accuracy.
Further, the pre-training model uses MAE and ViT for pre-training.
Further, the visual large model cutting picture is processed into an enabling model file, then the enabling model file is operated by onnx, the model file is processed, a corresponding mask is obtained according to mouse coordinates, the mask is decoded and converted into a picture file, the picture file is a target block, and then the picture file is covered at a corresponding position of an original picture.
Further, the prompt area is a blue area covering the labeling target, the size of the blue area is switched through a mouse wheel to adjust, a plurality of target blocks which are matched with the blue area are found through mouse coordinates and are assembled into an array, and a user can switch the target blocks displayed by the array through the wheel.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
according to the invention, the visual large model SAM is combined with the traditional marking tool, the visual large model SAM divides the picture to be marked by the user into a plurality of target blocks, and then the target blocks are displayed on the webpage, so that an efficient image marking process is realized, and the workload of manual marking is reduced.
According to the invention, the target block is displayed through the suspension of the mouse, and the correct target block is established through the click prompt area of the user, so that the click and displacement of the mouse are reduced from more than two times to one click, and the operation amount of the user is greatly reduced.
Drawings
The invention is further described below with reference to the accompanying drawings:
FIG. 1 is a flow chart of the operation of the present invention;
FIG. 2 is an image of the present invention in a mouse hover;
FIG. 3 is a diagram of the present invention in the generation of a callout box.
Detailed Description
As shown in fig. 1 to 3, the auxiliary labeling method combining the SAM of the large visual model combines SAM (segement anything model) in the large visual model with the traditional labeling tool, so that an efficient image labeling process is realized, and the workload of manual labeling is reduced.
The auxiliary labeling method combined with the visual large model SAM comprises the following steps:
step a, picture segmentation: the method comprises the steps of opening a picture marking tool by a user, extracting ten pictures, cutting the ten pictures through a visual large model SAM, dividing the pictures required to be marked by the user into a plurality of image embedding masks, and integrating to generate an embedding model file which can be called at a webpage end through onnx.
Wherein the visual large model comprises an encoder and a decoder, wherein the encoder section is composed of a plurality of convolution layers and a pooling layer for extracting image features; the decoder section is composed of a plurality of deconvolution layers and up-sampling layers for restoring the feature map to the original image size and generating a segmentation result. The method comprises the following steps of:
an encoder: is composed of multiple convolution layers and pooling layers for extracting image features. Each convolution layer typically includes operations such as convolution kernels, activation functions, and batch normalization for feature extraction and dimension reduction of the input image. The pooling layer is used for downsampling the feature map so as to reduce the calculated amount and the memory consumption.
A decoder: the method comprises a plurality of deconvolution layers and up-sampling layers, and is used for restoring the feature map to the original image size and generating a segmentation result. Each deconvolution layer typically includes operations such as deconvolution kernels, activation functions, and batch normalization for upsampling and feature fusion of feature maps. The up-sampling layer is used for up-sampling the feature map to restore the original image size.
Loss function: the visual large model uses a cross entropy based multitasking loss function that includes pixel level classification loss and bounding box level regression loss. The classification loss is used to measure which class (e.g., foreground or background) each pixel belongs to, and the regression loss is used to adjust the bounding box position of each pixel to better match the target.
Data enhancement: to improve the robustness and generalization ability of the model, the visual large model employs a variety of data enhancement techniques such as random rotation, scaling, cropping, flipping, etc., as well as color space transformation and noise addition.
Pre-training model: to speed model training and improve segmentation accuracy vision large models, pre-trained image classification models are typically used as initial weights for the encoder to better extract image features, the pre-trained models being pre-trained using MAEs and ViT.
Step b, calculating a result according to the mouse coordinates: and running the mapping model file by onnx, processing the model file, acquiring a corresponding mask according to mouse coordinates, decoding the mask into a picture file, taking the picture file as a target block, covering the target block at a corresponding position of an original image, and displaying the target block on a webpage. The user suspends through the mouse, produces the suggestion region, and the suggestion region is the blue region that covers the mark target, switches the size in blue region through the mouse gyro wheel, adjusts, seeks a plurality of target blocks that accords with through the mouse coordinate and assembles into the array, lets the user switch the target block of array demonstration through the gyro wheel. When the user requirements are met, the user clicks the prompt area to establish the correct target block.
Step c, generating a labeling frame: and generating a label frame to wrap the target block according to the coordinates and the length and the width of the displayed target block when the user clicks the correct target block.
And d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement. If the requirements are not met, deleting is carried out, or the switching selection tool is finely adjusted.
The above is only a specific embodiment of the present invention, but the technical features of the present invention are not limited thereto. Any simple changes, equivalent substitutions or modifications and the like made on the basis of the present invention to solve the substantially same technical problems and achieve the substantially same technical effects are included in the scope of the present invention.

Claims (8)

1. The auxiliary labeling method combined with the visual large model SAM is characterized by comprising the following steps of:
step a, picture segmentation:
the user opens a picture marking tool, divides a picture to be marked by the user into a plurality of image embedding masks through a visual large model SAM, and integrates the images to generate a model which can be displayed at a webpage end;
step b, calculating a result according to the mouse coordinates:
decoding the model, thereby finding out a target block conforming to the position of the mouse, displaying the target block on a webpage, suspending the target block by a user through the mouse, generating a prompt area, and establishing a correct target block by clicking the prompt area by the user;
step c, generating a labeling frame:
and generating a label frame to wrap the target block according to the coordinates and the length and the width of the displayed target block when the user clicks the correct target block.
And d, confirming whether the labeling frame meets the requirement, and repeating the step b until all the pictures are labeled after the labeling frame meets the requirement.
2. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: the visual large model comprises an encoder and a decoder, and the picture segmentation is specifically as follows:
(1) Extracting image features by adopting an encoder;
(2) And restoring the feature map to the original image size by adopting a decoder, and generating a segmentation result.
3. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the visual large model uses a cross entropy-based multi-task loss function, including pixel-level classification loss and bounding box-level regression loss;
the classification loss is used to measure the class to which each pixel belongs, and the regression loss is used to adjust the bounding box position of each pixel.
4. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the visual large model adopts a data enhancement module, wherein the data enhancement module comprises random rotation, scaling, clipping, overturning, color space transformation and noise addition.
5. The auxiliary labeling method combined with the visual large model SAM according to claim 2, wherein: the vision large model uses a pre-training model as an initial weight of the encoder for accelerating model training and improving segmentation accuracy.
6. The auxiliary labeling method combined with the visual large model SAM according to claim 5, wherein: the pretraining model uses MAE and ViT for pretraining.
7. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: processing a visual large model cut picture into an enabling model file, then utilizing onnx to run the enabling model file to process the model file, obtaining a corresponding mask according to mouse coordinates, converting mask decoding into a picture file, enabling the picture file to be a target block, and then covering the picture file at a corresponding position of an original picture.
8. The auxiliary labeling method combined with the visual large model SAM according to claim 1, wherein: the prompting area is a blue area covering the labeling target, the size of the blue area is switched through a mouse wheel to adjust, a plurality of target blocks which are matched with the blue area are found through mouse coordinates and are assembled into an array, and a user switches the target blocks displayed by the array through the wheel.
CN202310767430.5A 2023-06-26 2023-06-26 Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model Pending CN116721419A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310767430.5A CN116721419A (en) 2023-06-26 2023-06-26 Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310767430.5A CN116721419A (en) 2023-06-26 2023-06-26 Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model

Publications (1)

Publication Number Publication Date
CN116721419A true CN116721419A (en) 2023-09-08

Family

ID=87873172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310767430.5A Pending CN116721419A (en) 2023-06-26 2023-06-26 Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model

Country Status (1)

Country Link
CN (1) CN116721419A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935418A (en) * 2023-09-15 2023-10-24 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system
CN116935418B (en) * 2023-09-15 2023-12-05 成都索贝数码科技股份有限公司 Automatic three-dimensional graphic template reorganization method, device and system

Similar Documents

Publication Publication Date Title
CN111723585B (en) Style-controllable image text real-time translation and conversion method
CN112163449B (en) Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN110650368B (en) Video processing method and device and electronic equipment
US10614574B2 (en) Generating image segmentation data using a multi-branch neural network
Wang et al. Efficient example-based painting and synthesis of 2d directional texture
CN110276354B (en) High-resolution streetscape picture semantic segmentation training and real-time segmentation method
CN111091167B (en) Mark recognition training data synthesis method and device, electronic equipment and storage medium
CN109977834B (en) Method and device for segmenting human hand and interactive object from depth image
CN115812221A (en) Image generation and coloring method and device
WO2023212997A1 (en) Knowledge distillation based neural network training method, device, and storage medium
CN116721419A (en) Auxiliary labeling method combined with SAM (self-contained imaging) of visual large model
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN112381057A (en) Handwritten character recognition method and device, storage medium and terminal
CN115131587A (en) Template matching method of gradient vector features based on edge contour
CN111626912A (en) Watermark removing method and device
CN112132164B (en) Target detection method, system, computer device and storage medium
CN115908753B (en) Method and related device for reconstructing whole-body human body grid surface
CN110889854B (en) Sketch part segmentation method, system, device and storage medium based on multi-scale deep learning
CN116070687A (en) Neural network light field representation method based on global ray space affine transformation
CN114387346A (en) Image recognition and prediction model processing method, three-dimensional modeling method and device
CN117934688A (en) Nerve representation modeling method based on Gaussian splatter sample
CN111796708B (en) Method for reproducing three-dimensional shape features of image on touch screen
CN116703777A (en) Image processing method, system, storage medium and electronic equipment
CN113673567B (en) Panorama emotion recognition method and system based on multi-angle sub-region self-adaption
CN115496829A (en) Method and device for quickly manufacturing local high-definition image map based on webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication