CN117351331A

CN117351331A - Method and device for adding adapter for large visual model

Info

Publication number: CN117351331A
Application number: CN202311385817.0A
Authority: CN
Inventors: 吕伊凯; 周吴夏朗; 杜晓祥
Original assignee: Beijing Yunshang Technology Co ltd
Current assignee: Beijing Yunshang Technology Co ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-05

Abstract

The application discloses a method and a device for adding an adapter for a large visual model, which relate to the technical field of deep learning models, wherein the large visual model in a graph-text multi-mode large model in a service scene is independently extracted, and one adapter is respectively trained for different service scenes on the basis of guaranteeing the identification capacity of the original large visual model, so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.

Description

Method and device for adding adapter for large visual model

Technical Field

The application relates to the technical field of deep learning models, in particular to a method and a device for adding an adapter for a large visual model.

Background

The graph-text multi-mode large model is a deep learning model comprehensively utilizing image and text information. The method fuses and interacts information of two modes by processing the image and text data simultaneously so as to improve understanding and reasoning capacity of complex tasks. In a graph-text multimodal big model, two main components of a visual big model and a text big model are typically contained:

visual large model (visual model): the visual large model is used to process the image data and extract and learn therefrom the characteristic representation of the image. The visual large model may be a Convolutional Neural Network (CNN) based model, such as ResNet, inception, etc., for feature extraction and representation learning of images.

Text big Model (Text Model): the text large model is used to process text data and learn a characteristic representation of the text. The text large model may be a model based on a Recurrent Neural Network (RNN) or a Transformer model, such as LSTM, BERT, etc., for encoding and feature extraction of text.

With the increase of the parameters of the deep learning model and the continuous increase of the training data scale, the graph-text multi-mode large model has very strong image, text recognition and generation capacity. Because the text large model provides rich semantic information related to the pictures, compared with a common visual model, the visual large model in the picture-text multi-mode large model has further improved picture identification and generation capacity.

However, the increase of the number of parameters of the large visual model also means the increase of deployment cost and difficulty, and different models are often required to be deployed according to different service requirements in a real service scene. If a plurality of large visual models are deployed, the cost is greatly increased, and the on-line model recognition speed is reduced.

Disclosure of Invention

Therefore, the application provides a method and a device for adding an adapter for a large visual model, so as to solve the problems of poor recognition capability and low recognition speed of the large visual model in the prior art when a service scene is recognized.

In order to achieve the above object, the present application provides the following technical solutions:

in a first aspect, a method of adding an adapter to a visual large model, comprises:

step 1: extracting a visual large model from an original graph-text multi-mode large model;

step 2: building a plurality of adapters and respectively training the adapters according to different service scenes;

step 3: converting the visual large model and the plurality of adapters into the same file format;

step 4: merging and deploying the visual large model and a plurality of the adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.

Preferably, the adapter comprises a multi-scale feature extraction module, a feature interaction module and a classifier module, wherein the multi-scale feature extraction module is composed of a plurality of convolution layers, the feature interaction module is composed of a cross attention layer and a convolution nerve layer, and the classifier module is composed of a linear layer.

Preferably, in the step 2, parameters of the visual large model are fixed when a plurality of adapters are trained, and parameters of the adapters are updated according to service related data.

Preferably, in the step 3, the file format is an ONNX exchange format.

Preferably, in the step 4, an NVIDIATRITON deep learning inference engine is used when deploying the visual large model and the plurality of adapters.

Preferably, in the step 4, the protocol communication is HTTP/gPRC protocol.

In a second aspect, an apparatus for adding an adapter to a visual large model, comprises:

the visual large model extraction module is used for extracting a visual large model from the original graph-text multi-mode large model;

the adapter training module is used for building a plurality of adapters and respectively training the adapters according to different service scenes;

a format conversion module for converting the visual macro model and the plurality of adapters into the same file format;

the deployment module is used for fusing and deploying the visual large model and the plurality of adapters to a server; or the visual large model and the adapters are deployed on a plurality of servers respectively according to requirements, wherein the visual large model is a server, the adapters are clients, and data interaction is carried out between the visual large model and the adapters through protocol communication.

In a third aspect, a computer device comprises a memory storing a computer program and a processor implementing the steps of a method of adding an adapter to a visual large model when the computer program is executed.

In a fourth aspect, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of a method of adding an adapter to a visual large model.

Compared with the prior art, the application has the following beneficial effects:

the application provides a method and a device for adding an adapter for a large visual model, which are used for independently extracting the large visual model in a graph-text multi-mode large model in a service scene, and respectively training one adapter for different service scenes on the basis of guaranteeing the identification capacity of the original large visual model so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.

Drawings

For a more visual description of the prior art and the present application, exemplary drawings are presented below. It should be understood that the specific shape and configuration shown in the drawings should not be considered in general as limiting upon the practice of the present application; for example, based on the technical concepts and exemplary drawings disclosed herein, those skilled in the art have the ability to easily make conventional adjustments or further optimizations for the add/subtract/assign division, specific shapes, positional relationships, connection modes, dimensional scaling relationships, etc. of certain units (components).

FIG. 1 is a flow chart of a method for adding an adapter to a visual large model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method for adding an adapter to a visual large model according to an embodiment of the present application;

fig. 3 is a schematic diagram of the overall structure of a large visual model+adapter according to the first embodiment of the present application;

fig. 4 is a schematic structural diagram of different modules of an adapter according to an embodiment of the present application.

Detailed Description

The present application is further described in detail below with reference to the attached drawings.

In the description of the present application: unless otherwise indicated, the meaning of "a plurality" is two or more. The terms "first," "second," "third," and the like in this application are intended to distinguish between the referenced objects without a special meaning in terms of technical connotation (e.g., should not be construed as emphasis on degree or order of importance, etc.). The expressions "comprising", "including", "having", etc. also mean "not limited to" (certain units, components, materials, steps, etc.).

The terms such as "upper", "lower", "left", "right", "middle", and the like, as used in this application, are generally used for the purpose of facilitating an intuitive understanding with reference to the drawings and are not intended to be an absolute limitation of the positional relationship in actual products.

Example 1

Referring to fig. 1, a method for adding an adapter to a visual large model includes:

s1: extracting a visual large model from an original graph-text multi-mode large model;

referring to fig. 2, the graph-text multi-modal large model is generally composed of a visual large model and a language large model, and the visual large model exhibits stronger recognition and generation capabilities than a general visual model because the language large model provides rich text information to the visual large model.

The original graph-text multi-mode large model is in Pytorch format, and the visual large model is independently extracted by adopting a related tool to serve as an infrastructure model.

S2: building a plurality of adapters and respectively training the plurality of adapters according to different service scenes;

in order to keep the capability of the large visual model, the embodiment fixes the parameters of the large visual model unchanged, and builds and trains the independent adapters for different business scenes respectively.

And constructing an adapter by using a Pytorch deep learning library. The built adapter consists of various neural network architectures, such as convolutional layers (Convolutional Layer), transformers (Transformer Layer), etc.

Specifically, referring to fig. 3 and fig. 4, the built adapter includes a multi-scale feature extraction module, a feature interaction module, and a classifier module; the multi-scale feature extraction module consists of a plurality of convolution layers, and by inputting an original picture into the module, a feature pyramid with three resolutions (1/8, 1/16 and 1/32) can be obtained, and the feature pyramids are flattened and spliced to obtain a multi-scale feature sequence; the feature interaction module consists of a cross attention layer and a convolution nerve layer, and the multi-scale feature sequence and the visual large model output sequence are input into the module, so that the multi-scale feature is extracted from the single-scale output of the visual large model; the classifier module consists of a linear layer, and the final extracted sequence is input into the module to obtain the final classification result.

The training process of the training adapter specifically comprises the following steps: sample data of a specific task are collected, the sample data are subjected to regular transformation such as rotation, geometric transformation and color transformation, then are input into an adapter of a large visual model and corresponding service for reasoning calculation, the output of the model and the calculation loss of a real label are obtained, and the gradient of model parameters is calculated according to the back propagation of the loss. The parameters of the visual large model are fixed and do not participate in updating, and only the parameters of the adapter are updated according to the service related data. So a plurality of businesses respectively train an adapter correspondingly and share a large visual model, and the universality of the large visual model to various business scenes can be improved while the capacity of the large visual model is ensured. This embodiment trains the adapter on the NVIDIARTX 3090 graphics card.

S3: converting the visual large model and the plurality of adapters into the same file format;

specifically, the visual large model and the adapter are respectively converted into ONNX exchange formats.

S4: merging and deploying the large visual model and a plurality of adapters on a server; or the visual large model and the plurality of adapters are deployed on the plurality of servers respectively according to the requirements, wherein the visual large model is a server, the plurality of adapters are clients, and data interaction is carried out between the visual large model and the plurality of adapters through protocol communication.

In particular, the on-line deployment of the present embodiment may select a centralized deployment or a distributed deployment.

When the visual large model and the adapter are combined into one ONNX file by using a correlation tool in the centralized part, the ONNX file is deployed on a single-node server by using an NVIDIATRITON deep learning reasoning engine.

The distributed deployment adopts an NVIDIATRITON reasoning framework, the mode is suitable for an NVIDIA GPU platform, deployment is convenient, concurrency capacity is high, delay is low, and various performance reasoning optimization methods are integrated. Because the visual large model occupies large resources and has high throughput requirements, the visual large model is used as an infrastructure to be deployed independently in the distributed deployment, and the adapter is deployed on the corresponding service server, so that the visual large model occupies small resources, has high reasoning efficiency and does not occupy too much resources of other services on the service server. In this embodiment, the visual large model and each adapter respectively compile the wenrt format supported by the ONNX exchange format NVIDIATRITON, the visual large model is used as a Server, the adapter is used as a Client (Client), the adapter requests to obtain the reasoning result of the visual large model through the HTTP/gPRC protocol, and after the result is obtained, the adapter inputs the result to perform downstream reasoning calculation, each adapter can perform concurrent requests without mutual influence, so that the picture recognition efficiency of the whole online service is improved.

According to the method for adding the adapter to the large visual model, the large visual model in the graph-text multi-mode large model in the service scene is independently extracted, and on the basis of guaranteeing the identification capacity of the original large visual model, one adapter is trained for different service scenes respectively, so that the large visual model has the universal identification capacity facing various scenes; the method comprises the steps of selecting centralized deployment or distributed deployment according to different online requirements, wherein a visual large model and each adapter in the centralized deployment share one inference graph, the method is suitable for a single-node server, the visual large model and each adapter in the distributed deployment are deployed at different nodes, data interaction is carried out through protocol communication, the method is suitable for a cluster multi-node server, and recognition speed is guaranteed while deployment cost is reduced.

Example two

The embodiment provides a device for adding an adapter for a large visual model, which comprises the following components:

Specific limitations regarding an apparatus for adding an adapter to a visual large model may be found in the above description of a method for adding an adapter to a visual large model, and will not be described in detail herein.

Example III

The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of a method of adding an adapter to a visual large model when the computer program is executed.

Example IV

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a method of adding an adapter to a visual large model.

Any combination of the technical features of the above embodiments may be performed (as long as there is no contradiction between the combination of the technical features), and for brevity of description, all of the possible combinations of the technical features of the above embodiments are not described; these examples, which are not explicitly written, should also be considered as being within the scope of the present description.

Claims

1. A method of adding an adapter to a visual large model, comprising:

2. The method of adding an adapter to a visual large model according to claim 1, wherein the adapter comprises a multi-scale feature extraction module, a feature interaction module and a classifier module, wherein the multi-scale feature extraction module is comprised of a plurality of convolution layers, the feature interaction module is comprised of a cross-attention layer and a convolution nerve layer, and the classifier module is comprised of a linear layer.

3. The method for adding adapters to a large visual model according to claim 1, wherein parameters of the large visual model are fixed when a plurality of adapters are trained in step 2, and parameters of the adapters are updated according to business-related data.

4. The method of adding an adapter to a visual large model according to claim 1, wherein in step 3, the file format is an ONNX exchange format.

5. The method of adding adapters to a visual large model according to claim 1, wherein in step 4, an NVIDIA TRITON deep learning inference engine is used when deploying the visual large model and a plurality of the adapters.

6. The method of adding an adapter to a visual large model according to claim 1, wherein in step 4, the protocol communication is HTTP/gPRC protocol.

7. An apparatus for adding an adapter to a visual large model, comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.