CN116935418A

CN116935418A - Automatic three-dimensional graphic template reorganization method, device and system

Info

Publication number: CN116935418A
Application number: CN202311188895.1A
Authority: CN
Inventors: 陈尧森; 韩兴; 温序铭
Original assignee: Chengdu Sobey Digital Technology Co Ltd
Current assignee: Chengdu Sobey Digital Technology Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-10-24
Anticipated expiration: 2043-09-15
Also published as: CN116935418B

Abstract

The application provides a three-dimensional image-text template automatic recombination method, equipment and a system, which comprise the following steps: s1, acquiring an image-text template data set containing each category from a three-dimensional image-text template library; s2, fine tuning the pre-trained CLIP model through an image-text template data set; s3, image region segmentation is carried out on the input image; s4, inputting the segmented image area and the graphic template data set into a fine-tuned CLIP model to obtain an image area meeting the condition and a corresponding graphic template type; s5, outputting the position of the image area and the corresponding image-text template type; s6, acquiring image area control parameters; s7, finishing the image-text template reorganization of the input image according to the control parameters, the image area position and the corresponding image-text template types. The application realizes automatic image-text template recombination generation and provides higher efficiency, accuracy and flexibility for the fields of image-text display, report generation and the like.

Description

Automatic three-dimensional graphic template reorganization method, device and system

Technical Field

The application relates to the technical field of computer vision and deep learning, in particular to an automatic three-dimensional image-text template reorganization method, equipment and a system.

Background

With the continued development of deep learning techniques, they play an important role in computer vision tasks. Advanced semantic features in image and text data can be automatically learned and extracted by a deep learning method. The features can better capture the relevance between the image and the text, thereby realizing more accurate and efficient recognition of the image-text template. For example, in terms of image processing, deep learning may be used for tasks such as object detection, image segmentation, and image generation. In terms of text processing, deep learning may be used for text classification, named entity recognition, and semantic understanding tasks.

By combining the detection technology, the segmentation technology and the OCR technology in the deep learning, the interested image area can be accurately extracted from the input image, accurate information is provided for classifying and identifying the image-text templates, and key text information in the image is identified. The development of the technologies enables the automatic reorganization method of the three-dimensional image-text templates to be better suitable for various scenes and complex image contents, and provides powerful support for automatic image analysis and application.

Disclosure of Invention

Aiming at the problems existing in the prior art, the automatic three-dimensional image-text template reorganization method, equipment and system are provided, automatic image-text template generation reorganization is realized, and higher efficiency, accuracy and flexibility are provided for the fields of image-text display, report generation and the like.

The first aspect of the application provides an automatic three-dimensional graphic template reorganization method, which comprises the following steps:

s1, acquiring an image-text template data set containing each category from a three-dimensional image-text template library;

s2, fine tuning the pre-trained CLIP model through an image-text template data set;

s3, image region segmentation is carried out on the input image;

s4, inputting the segmented image area and the graphic template data set into a fine-tuned CLIP model to obtain an image area meeting the condition and a corresponding graphic template type;

s5, outputting the position of the image area and the corresponding image-text template type;

s6, acquiring image area control parameters;

and S7, finishing the image-text template recombination of the input image according to the control parameters, the image area position and the corresponding image-text template types.

In a preferred embodiment, in step S1, the teletext data set consists of image text pairs formed from images and corresponding teletext information of the teletext category.

As a preferred embodiment, the fine tuning process in step S2 includes: the teletext template dataset is entered into the pre-trained CLIP model, causing it to capture semantic associations between images and its categories.

In a preferred embodiment, in step S3, all objects of the input image are segmented by regions using the SAM model, all segmented image regions are clipped on the smallest circumscribed rectangular frame, and all segmented image regions are stored.

As a preferred solution, the specific substeps of step S4 include:

s41, inputting the segmented image area and text information of all image-text template categories into a fine-tuned CLIP model;

step S42, after the CPLI model codes the images and the texts, cosine similarity calculation is carried out on the image and the text coding results one by one, and the similarity score of each image-text template is obtained;

and step S43, saving the image region positions with similarity scores higher than the threshold value and the image template categories thereof.

In a preferred scheme, in step S5, a rectangular frame is used to select a corresponding image area according to the obtained area position and the category of the graphic template, and the corresponding category of the graphic template is output on the rectangular frame to complete visualization of the classification result.

As a preferred solution, the specific substeps of step S6 include:

step S61, preprocessing an input image;

step S62, positioning the text and digital areas in the image;

step S63, performing OCR (optical character recognition) on the text and number areas;

step S64, post-processing and correcting the recognized characters and numbers;

step S65, obtaining key words and digital information in the input image as control parameters of the graphics context.

As a preferred solution, the specific substeps of step S7 include:

step S71, generating a chart of a corresponding type in a three-dimensional image-text template library at a corresponding position according to the acquired image region position and the image-text template type;

and S72, reorganizing and customizing the generated chart according to the control parameters to generate a new image-text template.

The second aspect of the present application provides an automatic three-dimensional graphic template reorganizing device, which comprises a processor and a memory, wherein a computer program is stored in the memory, and the automatic three-dimensional graphic template reorganizing method is executed when the computer program is loaded by the processor.

The third aspect of the present application provides an automatic three-dimensional graphic template reorganization system, comprising the automatic three-dimensional graphic template reorganization device

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows:

and (3) automatic image-text template recombination generation: by using the technologies of image-text template recognition, OCR recognition, image segmentation and the like, the application can automatically extract key text, figures and image areas from an input image and recombine the key text, figures and image areas with control parameters to generate a new image-text template. The workload of manually creating the image-text templates is greatly reduced, and the generation efficiency and consistency are improved.

Accuracy and reliability: by applying computer vision and deep learning techniques, the application can realize accurate image segmentation and OCR recognition, thereby providing accurate region and character recognition results. This ensures that the generated teletext templates are consistent with the original image content and that the key information is correct and reliable.

Flexibility and personalization: the reorganization step of the application matches and combines the control parameters with the segmented areas, so that the generated new graph Wen Moban has higher flexibility and individuation. The graphic templates of various styles, styles and formats can be customized and generated according to different control parameters, so that the personalized requirements of users are met.

Time and cost savings: because of the automatic image-text template recombination generation, the application saves the time and cost for manually creating and designing the image-text template. The user does not need to manually process and edit the image area and the text content, so that the working efficiency is greatly improved, and the related cost is reduced.

Extensibility and adaptability: the application is based on computer vision and deep learning techniques, which have high expandability and adaptability. With further development and improvement of the technology, the performance and effect of the identification and recombination of the image-text templates can be further improved by means of updating and optimizing models, adding training data and the like.

Drawings

Fig. 1 is a flow chart of an automatic reorganization method of a three-dimensional graphic template according to the present application.

FIG. 2 is a diagram illustrating a CLIP model tuning according to an embodiment of the present application.

FIG. 3 is a block diagram of an image segmentation and recognition process according to an embodiment of the present application.

Fig. 4 (a) -4 (c) are diagrams of visual results of recognition of the graphic template in the embodiment of the application.

FIG. 5 is a block diagram of a three-dimensional graphic template reorganization in accordance with an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar modules or modules having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. On the contrary, the embodiments of the application include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.

In order to realize automatic image-text template generation and recombination, the embodiment provides a three-dimensional image-text template automatic recombination method which can automatically extract key characters, numbers and image areas from an image and recombine the key characters, numbers and image areas with control parameters to generate a new image-text template. The specific scheme is as follows:

referring to fig. 1, the automatic three-dimensional graphic template reorganization method includes:

s1, preparing a data set.

Before reorganization, a set of teletext template data containing the respective category needs to be prepared to support the use of subsequent steps, the data being pairs of images and image text consisting of the corresponding teletext template category. Wherein the data set is diversified and representative, and needs to cover various types and scenes of the graphic templates.

In this embodiment, a plurality of patterns of graphic templates are selected from a three-dimensional graphic template library to make a column diagram, a pie diagram, a map, and a line diagram of 2000 images in total, wherein 538 column diagrams, 500 pie diagrams, 484 map and 478 line diagrams are used as the fine tuning data set. And taking the category corresponding to each picture as the title, and carrying out one-to-one correspondence on the title and the storage path in the CSV file to obtain the text pairs of the image template image, thereby completing the manufacture of the data set.

S2, fine tuning of the CLIP model.

Referring to fig. 2, the pre-trained CLIP (Contrastive Language-Image Pretraining) model is trimmed by using the fabricated data set, so that the image features of the graphic template and the semantic information of the category are effectively matched, and accurate classification of the graphic template is realized.

Wherein the CLIP model is an image text matching deep learning model based on contrast learning, which can align an image and a text representation space, and is composed of an image encoder and a text encoder. During the countermeasure training process, the image encoder attempts to minimize the distance between the image and the text, while the text encoder attempts to maximize the distance between them. This training helps to learn better image and text representations for the model, making the model excellent in handling various visual and linguistic tasks. Particularly, the method is excellent in various image classification tasks, so that the method is used as a classification model of a three-dimensional graphic template automatic recombination task in the embodiment.

The pre-training CLIP model has good performance in identifying common objects through training of 4 hundred million image text pairs, but has lower identification accuracy in a specific three-dimensional graphic template automatic reorganization task. In the embodiment, 2000 text sheets of image text of the image template are manufactured to finely tune the pre-trained CLIP model of the data set, semantic association between the image template image and the category of the image template image is captured, and the accuracy of automatic recombination of the three-dimensional image template is improved.

S3, image segmentation.

Referring to fig. 3, this step is directed to the image to be processed, i.e., the input image. By segmenting the input image, the object in the image is finely segmented according to the region, and an accurate image region of the image template is provided.

Specifically, in this embodiment, the segmentation is completed by using the image segmentation algorithm SAM. The SAM model comprises an image encoder, a hint encoder, and a lightweight mask decoder.

An image encoder: a pre-trained Vision Transformer (ViT) is used to minimally adapt to process high resolution inputs.

A hint encoder: sparse (dot, box) and dense (mask) cues are used. Points and boxes are represented by position codes that are added to the learning embeddings of each hint type. Dense cues (masks) are embedded using convolution and added to the image embedded elements.

Mask decoder: the mask decoder partitions the mask based on the embedded prediction from the image and hint encoder. It maps image embedding, hint embedding, and output markup to a mask. All embeddings are updated by the decoder block, which uses hints for self-attention and cross-attention in both directions, from hints to image embedding and back.

After all objects in the input image are segmented according to the regions through the SAM model, all segmented regions are cut in the minimum circumscribed rectangular frame of the segmented regions, and finally the segmented image regions are saved.

S4, identifying the image-text template.

And inputting the segmented image area and all collected text information of the graphic template categories into the CLIP model after fine adjustment. And the CLIP model judges the similarity of the image and the text information to obtain the image region position meeting the condition and the corresponding image-text template category. The method comprises the following specific steps:

s41, inputting the segmented image area and text information of all image-text template categories into the fine-tuned CLIP model.

S42, the CLIP model inputs the image and text information of the image and text template categories into an image encoder and a text encoder for encoding, and cosine similarity calculation is carried out on the image and text encoding results one by one to obtain similarity scores of each image and text template.

S43, comparing the similarity score of the image template of each image area with a set confidence threshold value, and storing the image area position higher than the threshold value and the corresponding image template category.

S5, outputting the classification result in a visualized mode.

After the image area position of the input image and the corresponding image-text template category are acquired through the CLIP model, visual output is needed.

Specifically, please refer to fig. 4 (a) -4 (c) for a visual output of classification results of different input images, including a histogram, a pie chart and a line chart, i.e. an image area position higher than a threshold value is obtained, a rectangular frame is used for frame selection, and a corresponding image template category is output at the upper left corner of the rectangular frame.

S6, controlling parameter acquisition.

Referring to fig. 5, in order to make the recombined graphic template coincide with the original image, control parameters such as color, number, text, number, etc. in the original image need to be acquired. The method comprises the following specific steps:

s61, preprocessing operations such as denoising and image enhancement are performed on the input image.

S62, using a text detection model based on deep learning, locating the text and number areas in the image, namely determining the areas possibly containing key text and number information.

S63, OCR recognition is carried out on the character and number areas by using an OCR model based on a convolutional neural network, and the character and number areas are converted into characters and numbers readable by a computer.

S64, normalization is carried out on the recognized text and number results, and post-processing and correction of erroneously recognized characters and the like are removed.

S65, using the obtained key text and digital information as control parameters of the graphic template.

By applying computer vision and deep learning techniques, accurate image segmentation and OCR recognition can be achieved, thereby providing accurate region and text recognition results. This ensures that the generated teletext templates are consistent with the original image content and that the key information is correct and reliable.

S7, recombining the graphic template.

And (3) recombining the control parameters, the image region positions and the corresponding image-text template types in a three-dimensional image-text template library to generate a new image-text template. Specific:

s71, generating a chart of a corresponding type in the three-dimensional image-text template library at a corresponding position according to the image region position and the corresponding image-text template type.

S72, reorganizing and customizing the generated chart according to the control parameters to generate a new image-text template. The new image-text template has higher flexibility and individuation, and can meet image-text display and report generation of different requirements.

The application can automatically extract key words, numbers and image areas from the input image by using technologies such as image-text template recognition, OCR recognition, image segmentation and the like, and recombine the key words, numbers and image areas with control parameters to generate a new image-text template. The workload of manually creating the image-text templates is greatly reduced, and the generation efficiency and consistency are improved.

In the practical application process, the application also provides a three-dimensional graphic template automatic reorganization device, which comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is loaded by the processor, the three-dimensional graphic template automatic reorganization method is executed.

In the practical application process, the application also provides an automatic three-dimensional graphic template reorganization system which comprises the automatic three-dimensional graphic template reorganization equipment.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the automatic three-dimensional graphic template reorganization method described above.

It should be noted that, in the description of the embodiments of the present application, unless explicitly specified and limited otherwise, the terms "disposed," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; may be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present application will be understood in detail by those skilled in the art; the accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. An automatic reorganization method of a three-dimensional graphic template is characterized by comprising the following steps:

s3, image region segmentation is carried out on the input image;

s6, acquiring image area control parameters;

2. The automatic reorganization method of three-dimensional graphic templates according to claim 1, wherein in the step S1, the graphic template data set is composed of image text pairs formed by images and corresponding graphic template category text information.

3. The automatic reorganization method of three-dimensional graphic templates according to claim 1 or 2, wherein the fine tuning process in step S2 includes: the teletext template dataset is entered into the pre-trained CLIP model, causing it to capture semantic associations between images and its categories.

4. The automatic reorganization method of three-dimensional graphic template according to claim 1, wherein in the step S3, all objects of the input image are segmented by regions by adopting a SAM model, all segmented image regions are cut on a minimum circumscribed rectangular frame thereof, and all segmented image regions are stored.

5. The automatic reorganization method of three-dimensional graphic templates according to claim 1, wherein the specific steps of step S4 include:

6. The automatic reorganization method of three-dimensional graphic templates according to claim 1, wherein in step S5, a rectangular frame is adopted to select a corresponding image area according to the obtained area position and the graphic template category, and a corresponding graphic template category is output on the rectangular frame to complete visualization of the classification result.

7. The automatic reorganization method of three-dimensional graphic templates according to claim 1, wherein the specific steps of step S6 include:

step S61, preprocessing an input image;

step S62, positioning the text and digital areas in the image;

step S64, post-processing and correcting the recognized characters and numbers;

8. The automatic reorganization method of three-dimensional graphic templates according to claim 1, wherein the specific steps of step S7 include:

9. An automatic three-dimensional graphic template reorganization device, comprising a processor and a memory, wherein the memory stores a computer program, and when the computer program is loaded by the processor, the automatic three-dimensional graphic template reorganization method of any one of claims 1-8 is executed.

10. An automatic three-dimensional graphic template reorganization system, comprising the automatic three-dimensional graphic template reorganization device according to claim 9.