CN117992578A

CN117992578A - Method for processing data based on large language model, large language model and electronic equipment

Info

Publication number: CN117992578A
Application number: CN202410398600.1A
Authority: CN
Inventors: 王召德; 吕承飞; 杨经邦; 姜霄棠
Original assignee: Taobao China Software Co Ltd
Current assignee: Taobao China Software Co Ltd
Priority date: 2024-04-02
Filing date: 2024-04-02
Publication date: 2024-05-07
Anticipated expiration: 2044-04-02
Also published as: CN117992578B

Abstract

The application discloses a method for processing data based on a large language model, the large language model, electronic equipment, a computer readable storage medium and a computer program product, which are applied to a user terminal, wherein the large language model is deployed on the user terminal, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance; the method comprises the following steps: acquiring input data; vector conversion is carried out on input data through an embedding layer of the large language model, and floating point query vectors of floating point data types corresponding to the input data are obtained; converting the floating point query vector into an integer query vector of an integer data type; and calculating the weight parameters of the linear calculation layer and the integer query vector to obtain a query result corresponding to the input data. The scheme provided by the application can smoothly run the large language model at the user terminal, so that the user terminal can provide service for the user without networking, and the privacy of the user can be better ensured.

Description

Method for processing data based on large language model, large language model and electronic equipment

Technical Field

The present application relates to the field of computer technology, and in particular, to a method for processing data based on a large language model, an electronic device, a computer readable storage medium, and a computer program product.

Background

With the rapid development of computer technology, a large language model can play its role in many scenes due to its superior data processing capability, for example, generating a required image through the large language model, querying expertise through the large language model, performing a dialogue with the large language model, generating an article meeting the requirements through the large language model, and the like. Because the large language model has large parameter and calculation amount and needs relatively large calculation resources, the large language model is usually deployed at a cloud server.

However, when the large language model is deployed at the cloud service end, the user terminal needs to be connected with the network to use the large language model, and text information such as user problems, developer codes and the like when the user needs to upload the text information to the cloud service end also causes a certain leakage risk to the privacy information of the user, so that the user is inconvenient to use the large language model at the user terminal.

Disclosure of Invention

The application provides a method for processing data based on a large language model, the large language model, electronic equipment, a computer readable storage medium and a computer program product, which can smoothly run the large language model in a user terminal, so that the user terminal can provide services for users without networking, and the privacy of the users can be better ensured.

In a first aspect, the present application provides a method for processing data based on a large language model, applied to a user terminal, where the large language model is deployed on the user terminal, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance; the method comprises the following steps:

Acquiring input data;

vector conversion is carried out on the input data through an embedding layer of the large language model, and a floating point query vector of a floating point data type corresponding to the input data is obtained;

converting the floating point query vector into an integer query vector of an integer data type;

And calculating the weight parameter of the linear calculation layer and the integer query vector to obtain a query result corresponding to the input data.

In a second aspect, the present application provides a large language model comprising:

A text conversion tokenizer layer for converting the input data into a text data format supported by the large language model;

an embedding embedding layer for converting the text data into vectors;

and converting a transducer layer, wherein the transducer layer comprises Linear computing Linear layers, the weight parameters of the Linear layers are quantized into format data of integer data types, and the Linear layers are used for carrying out Linear transformation on input vector data to obtain a model operation result.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and computer program instructions stored on the memory and executable on the processor; the processor, when executing the computer program instructions, implements the method of any of the first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for performing the method of any of the first aspects when executed by a processor.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the first aspects.

Compared with the prior art, the application has the following advantages:

The method for processing data based on the large language model is applied to the user terminal, the large language model is deployed on the user terminal, the weight parameters of each linear calculation layer of the large language model are quantized into the format data of the integer data type in advance, compared with the weight parameters of the floating point data type, the weight parameters of the integer data type can greatly reduce the memory access amount when in matrix operation, and the memory of the user terminal is usually smaller, the weight parameters of the integer data type are adopted to carry out matrix operation on input data, so that the calculation efficiency of the large language model and the requirement on equipment resources are improved, the user terminal can operate the large language model more smoothly, the weight parameters of the integer data type can also well replace the information of the weight parameters of the floating point data type, and the calculation accuracy is ensured. When the scheme provided by the application processes data through a large language model, the embedded layer of the large language model firstly carries out vector conversion on the input data to obtain a floating point query vector of a floating point data type corresponding to the input data, then the floating point query vector is converted into an integer query vector of an integer data type, the weight parameter of the linear computing layer and the integer query vector are operated to obtain a query result corresponding to the input data, and when the integer query vector of the integer data type of the input data is operated in matrix multiplication and the like through the linear computing layer, the memory access quantity can be further reduced, so that the computing efficiency of a user terminal is improved.

Therefore, when the large language model is deployed on the user terminal to run, the weight parameters of the large language model and the input data input by the user are quantized into the data of the integer data type, so that the access amount of the large language model to the memory and the calculated amount when the large language model runs on the user terminal can be greatly reduced, and the scheme provided by the application is more suitable for the conditions of small memory and limited calculation resources of the user terminal, and the user terminal can run the large language model in a flow field. In addition, since the distribution difference of the weight parameters of the linear calculation layer of the large language model is usually small, the distribution similarity is high after each weight parameter is quantized to be an integer, therefore, the calculation accuracy of the large language model is little affected after the weight parameter is quantized, and since the integer data type usually has a sufficient range to represent the input data of the large language model, the query vector corresponding to the input data can be completely quantized through the integer data type, and therefore, the information of the input data can be better ensured not to be lost. Therefore, the application can well ensure the calculation precision of the large language model running at the user terminal by carrying out the weight parameter quantization and the query vector quantization, and as the large language model is deployed at the user terminal, the user terminal can reply to the content queried by the user without networking, thereby better ensuring the privacy of the content queried by the user, being more convenient for the user to use.

Drawings

Fig. 1 is an application scenario schematic diagram of a scheme for processing data based on a large language model provided by the application.

Fig. 2 is a flowchart illustrating an example of a method for processing data based on a large language model according to an embodiment of the present application.

FIG. 3 is an abstract block diagram of a large language model in an embodiment of the application.

FIG. 4 is a diagram illustrating functional blocks in deriving a large language model in accordance with an embodiment of the present application.

FIG. 5 is an exemplary diagram of model operations performed by the W4A8 quantization strategy in an embodiment of the present application.

Fig. 6 is a block diagram of an apparatus for processing data based on a large language model according to an embodiment of the present application.

Fig. 7 is a block diagram of an electronic device provided by the present application.

Detailed Description

In order that those skilled in the art can better understand the technical solutions of the present application, the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. The application can be practiced in many other ways than those described below, and therefore, based on the examples provided herein, one of ordinary skill in the art will be able to arrive at all other examples that are within the scope of the application without undue burden.

It should be noted that the terms "first," "source domain," "third," and the like in the claims, specification, and drawings of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. The data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and their variants are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

To facilitate understanding of the embodiments of the present application, the application background of the embodiments will be described.

With the rapid development of computer technology, a large language model can play its role in many scenes due to its superior data processing capability, for example, generating a required image through the large language model, querying expertise through the large language model, performing a dialogue with the large language model, generating an article meeting the requirements through the large language model, and the like.

A large language model refers to a natural language processing model with a large number of parameters and capabilities. These models can utilize their large-scale parameters and data to generate more accurate, consistent text output when processing related tasks such as text. For example, a large language model (such as a GPT series model, a BERT model, etc.) based on a neural network can realize automatic understanding, generation and inference of text by training massive language data, and is suitable for various tasks such as language translation, text abstract, dialogue system, etc.

Because the large language model has large parameter and calculation amount and needs relatively large calculation resources, the large language model is usually deployed at a cloud server.

However, when the large language model is deployed at the cloud service end, the user terminal needs to be connected with the network to use the large language model, and text information such as user problems, developer codes and the like when the user needs to upload the text information to the cloud service end also causes a certain leakage risk to the privacy information of the user, so that the user is inconvenient to use the large language model at the user terminal. And a special cloud service end is required to be equipped to deploy the large language model, so that the running cost of the large language model is increased.

In order to solve the above problems, embodiments of the present application provide a method, apparatus, electronic device, and computer-readable storage medium for processing data based on a large language model. The method aims to smoothly operate the large language model at the user terminal, so that the user terminal can provide services for the user without networking, the privacy of the user can be better ensured, and the deployment and operation cost of the large language model is reduced.

The method for processing the data based on the large language model can be applied to deployment and operation of the large language model in a user terminal, and the user terminal can be a mobile terminal such as a mobile phone, an intelligent watch, an intelligent VR device, an intelligent vehicle-mounted device, a notebook computer and the like, or can be a desktop computer or other non-mobile user terminals.

In order to facilitate understanding of the method embodiments of the present application, application scenarios thereof are described. Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a solution provided in an embodiment of the present application. The method can be applied to the user terminal 101, and the user terminal 101 can be a mobile phone, a tablet personal computer (pad), a vehicle-mounted device, a wearable device, a smart television, a virtual reality VR device, a notebook computer, a desktop computer and other devices capable of running a large language model. The user terminal 101 is pre-deployed with a large language model whose weight parameters of each linear computation layer are pre-quantized into format data of integer data type.

In this embodiment, after a user inputs query content on the user terminal 101, the user terminal 101 can determine input data input by the user, and perform inference processing on the input data through a large language model on the user terminal 101 to obtain a query result corresponding to the input data.

Example 1

The first embodiment of the present application provides a method for processing data based on a large language model, where the method is applied to a user terminal, and specifically, an execution subject of the method may be an electronic device used by a user and deployed with the large language model, where the electronic device may be a desktop computer, a notebook computer, an intelligent mobile terminal, a client device, or other electronic devices that have data processing capabilities and are capable of large language models. The user terminal is provided with a large language model, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance.

It is understood that the large language model (Large Language Model, abbreviated LLM) generally includes a text conversion tokenizer layer, an embedding embedding layer, a transducer layer, etc., where the transducer layer includes a Linear calculation layer, which is a portion of the transducer layer that includes weight parameters.

The tokenizer layers described above are used to convert the input text (i.e., the input data in the present application) into a format that the model can understand and process, and the main function of tokenizer is to segment the original text into words, sub-words or characters and map them to corresponding identifiers (tokens) in the vocabulary of the model.

The embedding layer is used to convert text data into vectors to represent the concept of text by vectors.

The Linear layer is used for carrying out Linear transformation on input data, and is particularly used for carrying out reasoning operation on the input data through operations such as matrix multiplication, offset addition and the like, so as to obtain a model operation result meeting the requirements. Wherein the Linear layer is one of basic operators of the large language model, and is used for forming a basic unit of the large language model. The primary structure of LLM is the decoder layer of the transform structure, and the primary weight parameters in the decoder layer are basically concentrated in the operator of the Linear layer.

Since the weight parameters of the Linear layers are generally represented by floating point data types, for example, by 32-bit floating point data type fp32, the memory access amount in the calculation process is very large, and the number of the Linear layers of the large language model is usually very large, and generally includes tens of Linear layers, each Linear layer will require a large amount of calculation resources and a large amount of memory access amount by the floating point data types, which is unfavorable for the flow field operation of the user terminal, therefore, the weight parameters of each Linear calculation layer can be quantized into the format data of integer data types in advance, thereby reducing the memory access amount and the calculation amount in the Linear operation process. Specifically, the weight parameter of the linear calculation layer may be quantized into data in the format of 4-bit binary integer int4 or data in the format of 8-bit binary integer int8 in advance, and since the calculated amount of data in the format of int4 is smaller, the memory access amount is smaller during calculation, the weight parameter of the linear calculation layer may be quantized into data in the format of int4 in advance.

The large language model may further include an attention layer, an activation function layer, and the like, and a person skilled in the art may set each data processing layer included in the large language model according to a specific scene.

In the embodiment of the application, the tokenizer layers are used for converting text (string) into integers (int), and the weight parameters of the embedding layers can be represented by floating point data types so as to improve text conversion precision and vector conversion precision, thereby improving the reasoning accuracy of the model.

It will be appreciated that in a large language model, each layer has its own weight parameters, which may be a weight matrix describing the linear mapping between inputs and outputs, which determine the propagation and conversion process of the input data in the model network.

As shown in fig. 3, the abstract structure of the large language model generally includes an embedding layer embedding, respective data transformation layers (blocks _-1、block_- 2 and … …), and a language model layer (Language Model Layer, abbreviated as lm) for evaluating text probability, embedding is used for performing text vectorization conversion, and respective blocks are used for performing operations according to the output data of the previous layer, and obtaining an output result through the lm layer.

As shown in fig. 2, the method for processing data based on a large language model according to the first embodiment of the present application includes the following steps S110 to S140.

Step S110: input data is acquired.

The input data is data input by the user on the user terminal, for example, input data such as "please compose a graph", "please explain a meaning", "please generate a code for realizing function a", "please explain the execution logic of code b", "please interpret c sentence", etc. input data input by the user is question data of the user inquiring the corresponding content, and the user terminal can process and infer the input data through each data processing layer of the large language model to output an output result matched with the input data.

Step S120: and carrying out vector conversion on the input data through an embedding layer of the large language model to obtain a floating point query vector of a floating point data type corresponding to the input data.

To improve model reasoning accuracy, the embedding layer of a large language model generally converts input data into a floating point query vector of a floating point data type when performing vector conversion on the input data, for example, converts the input data into a floating point query vector of fp32 or fp16 floating point type. Wherein fp16 floating point type is less computationally intensive and more computationally efficient than fp32 floating point type, and therefore, input data may be converted into a floating point query vector of fp16 floating point type.

Optionally, step S120 may input the input data into the embedding layer to obtain a floating point query vector of a floating point data type corresponding to the input data.

Step S130: the floating point query vector is converted into an integer query vector of an integer data type.

Specifically, the floating point query vector can be converted into an integer query vector in an 8-bit binary integer int8 format or an integer query vector in an int4 format, and as the data volume corresponding to the input data is usually relatively small, the operation efficiency can be improved, and the accuracy of the subsequent reasoning operation can be well ensured by converting the floating point query vector into the integer query vector in an int8 type.

Step S140: and calculating the weight parameter of the linear calculation layer and the integer query vector to obtain a query result corresponding to the input data.

In the embodiment of the present application, when the floating point query vector is converted into the int8 type integer query vector, the weight parameter of the linear computation layer is quantized into the int4 type format data, the quantization mode in this case may be referred to as adopting a W4A8 quantization strategy, where W represents weight and 4 represents the weight parameter of the linear computation layer of the large language model, 4 represents 4 bits, the weight parameter representing the linear computation layer is quantized into 4 bits and stored, that is, the weight parameter is quantized into the int4 type data, a represents activate and represents the input data (that is, the above floating point query vector) when running, 8 represents that the activation value is dynamically quantized to 8 bits by the floating point number when running, that is, converted into the int8 type integer query vector.

Specifically, the matrix multiplication is performed on the weight matrix (i.e., the weight parameter) of the linear calculation layer and the integer query vector, so as to obtain a query result corresponding to the input data.

In one embodiment, the step S140 may be implemented as follows step S141 to step S144.

Step S141: and converting the weight parameters of the linear calculation layer into weight parameters of a target integer type, wherein the target integer type is the same as the integer type corresponding to the integer query vector.

For example, when the integer query vector corresponds to an int8 type, the target integer type is also an int8 type. Step S141 converts the weight matrix of the linear computation layer into a data type identical to the integer query vector type corresponding to the input data, so as to facilitate the subsequent matrix operation.

Step S142: and carrying out matrix multiplication on the integer query vector and the weight parameter of the target integer type to obtain a matrix operation result.

Step S143: and performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type.

Alternatively, the matrix operation result may be dequantized through the following steps 143a to 143 b.

Step S143a: and obtaining a scaling factor and a zero point when the weight parameters of the linear calculation layer are quantized, wherein the scaling factor is a scaling factor for mapping the floating point number to an integer range, and the zero point is an offset for mapping the zero value of the floating point number to the zero value of the integer.

Step S143b: and according to the scaling factor and the zero point, performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type.

Because the weight parameters of the large language model are usually the weight parameters of the floating point type when the large language model is processed by other processing layers, other reasoning layers of the large language model can be ensured to continuously process the matrix operation result through inverse quantization operation, so that the whole normal operation of the large language model is maintained.

Step S144: and determining a query result corresponding to the input data according to the output result of the floating point data type.

For example, the floating point data type may be text converted to obtain a corresponding text, and a query result corresponding to the input data may be determined according to the obtained text.

The process of calculating the weight parameter of the linear calculation layer and the integer query vector (i.e., the process of step S141 to step S143) is further explained below by a specific example.

As shown in fig. 5, when the above-mentioned W4A8 quantization strategy is adopted to perform model operation, the weight parameter of the linear calculation layer is quantized by 4 bits, and is input as a floating point query vector (here, fp16 is taken as an example), the weight parameter of the linear calculation layer quantized into the 4bit type (i.e., int 4) is read from the memory during calculation, and scale and zero information when the weight parameter of the linear calculation layer is quantized is obtained; converting the weight parameter of the int4 type of the linear calculation layer into the weight parameter of the int8 type, quantizing the floating point query vector (i.e. the input vector in fig. 5 is fp16 or fp32 floating point type) into the int8 integer query vector, performing matrix multiplication operation on the int8 integer query vector and the weight parameter of the int8 type to obtain a matrix operation result of the int32, and performing inverse quantization on the matrix operation result of the int32 by using a scaling factor when the floating point query vector is quantized and scale and zero information when the weight parameter is quantized to obtain a floating point output result of the floating point data type.

Taking the above-mentioned W4A8 quantization strategy as an example for performing model operation, the floating point query vector generally needs to pass through multiple decoding layers of a large language model, and taking a LLM model with 7B parameter scale as an example, where Linear computation (matrix multiplication) is performed on the vector in each layer, assuming that the input floating point query vector is a, the weight parameter of the trained decoding layer is W, and matrix multiplication of a@w needs to be calculated, where @ represents matrix multiplication. If fp32 computation is used in this process, i.e. a and W are both in fp32 format, the CPU needs to read W-scale data from the memory at least for each layer of decode computation, then 28GB of memory access is needed for each round of computation, if fp16 format is used, at least 14GB of memory access is needed, and if W uses 4 bits in the W4A8 quantization strategy, then 3.5GB of memory access is needed for each round, so that the total memory access can be greatly reduced. The access quantity can be reduced by using 8 bits, and meanwhile, the calculation instruction of the int8 can be utilized, so that the calculation efficiency is greatly improved.

In one embodiment, before step S140, the following steps S140a to S140c may be further included.

Step S140a: and acquiring the single calculation dimension size corresponding to the calculation instruction supported by the user terminal.

For example, when the computing instruction supported by the ue is a signed integer matrix multiplication (SIGNED INTEGER matrix multiply-accumulate, smmla for short) instruction, the specific operation process of the smmla instruction is [2, 8] @ [2, 8] → [2, 2], so the smmla instruction corresponds to the single computing dimension size 8. When the calculation instruction supported by the user terminal is a signed dot product instruction (Signed Dot Product, sdot for short), the size of the single calculation dimension corresponding to the sdot instruction is 4.

Step S140b: and determining the size of the data block matched with the size of the dimension of the single calculation.

For example, when the single-calculation dimension size is 8, the data block size is also 8, and when the single-calculation dimension size is 4, the data block size is also 4.

Step S140c: and rearranging the data arrangement shapes of the integer query vector and the weight parameters of the linear calculation layer according to the data block size to obtain rearranged query vector and rearranged weight parameters.

The step S140c specifically obtains a rearranged query vector and a rearranged weight parameter, where the size of the single calculation dimension is the same as the size of the single calculation dimension corresponding to the calculation instruction.

In a specific embodiment, the data arrangement shape of the integer query vector is a two-dimensional matrix formed by the batch size of the integer query vector and the number of channels of the input data of the large language model, the data arrangement shape of the weight parameter of the linear computation layer is a two-dimensional matrix formed by the number of channels of the output feature map in the weight parameter of the linear computation layer and the number of channels of the input feature map in the weight parameter of the linear computation layer, specifically, the data arrangement shape of the integer query vector may be [ batch, ic ], the data arrangement shape of the weight parameter of the linear computation layer is [ oc, ic ], where batch represents the batch size of the integer query vector, i.e., the number of division units corresponding to the input data, for example, if the input data is divided into 5 word division units or 5 characters, batch is 5, ic corresponding to the integer query vector represents the number of channels (i.e., feature dimension number) of the input data of the large language model, i.e., the feature dimension number of the input data, for example, the input data includes four feature vectors, such as time, place, age, etc., the corresponding to the feature vector is 4; and ic corresponding to the weight parameter represents the number of channels of the input feature map in the weight parameter of the linear calculation layer, and oc represents the number of channels of the output feature map in the weight parameter of the linear calculation layer.

Accordingly, the above step S1401c may be implemented as follows: according to the data block size, carrying out block rearrangement on the channel number of the input data of the large language model of the integer query vector to obtain a rearranged query vector; and carrying out block rearrangement on the channel number of the output characteristic diagram of the weight parameter of the linear calculation layer and the channel number of the input characteristic diagram of the linear calculation layer according to the data block size to obtain rearranged weight parameters.

Specifically, step S1401c may be implemented as follows step S140 c-1.

Step S1410c-1: and rearranging the data arrangement shape of the integer query vector into [ ic/pack, latch, pack ], and rearranging the data arrangement shape of the weight parameter of the linear calculation layer into [ oc/pack, ic/pack, pack, pack ].

Where pack represents the data chunk size described above.

Illustratively, the weight parameter size of a Linear layer is [4096,4096], which is rearranged to [512, 512, 8, 8] if the data block size is 8, and rearranged to [1024, 1024, 4, 4] if the data block size is 4.

For example, when the user asks "please compose a landscape," 14 token can be generated through tokenizer layers, namely 14 partition units are generated, the 14 token can be converted into a vector with the size of [14, 4096] through the embedding layer, and the input vector can be rearranged to the size of [512, 14, 8] assuming that the instruction set supported by the user mobile phone includes an instruction corresponding to the block data size of 8, for example, the instruction supported by the user mobile phone is smmla, pack=8; the weight parameters of the Linear layer have been rearranged in advance to [512, 512, 8, 8]; the calculation logic is thus executed in accordance with [512, 512, 8, 8] @ [512, 14, 8], wherein a first dimension 512 of [512, 512, 8, 8] is a dimension that can be multithreaded in parallel, a second dimension 512 of [512, 14, 8] is a dimension of loop calculation, and a third dimension 8 of [512, 14, 8] and [512, 512, 8, 8] is a dimension of output; 8 in [512, 14, 8] and 4 th dimension 8 in [512, 512, 8, 8] are dimensions when calculated using single instruction Multiple Data (Single Instruction, multiple Data, SIMD) instructions.

Accordingly, step S140 may be implemented as follows: and taking the batch size of the integer query vectors and the data block size as the dimension of output, taking the data block size as the calculation dimension when the calculation instruction is calculated once, and calculating the rearranged query vectors and the rearranged weight parameters by using the calculation instruction to obtain a query result corresponding to the input data.

Specifically, step S140 may be implemented as follows step S145.

Step S145: and calculating the rearranged query vector and the rearranged weight parameter through the calculation instruction supported by the user terminal to obtain a query result corresponding to the input data.

Specifically, in step S145, the computation of the dimension corresponding to oc/pack may be performed in a parallel manner, the computation of the dimension corresponding to ic/pack may be performed in a cyclic computation manner, the latch and the pack are taken as output dimensions, the pack is taken as a computation dimension when the computation instruction is used for single computation, and the rearranged query vector and the rearranged weight parameter are operated by using the computation instruction, so as to obtain a query result corresponding to the input data.

After the scheme provided by the embodiment is used for rearranging the data arrangement shape of the weight parameter of the Linear layer and the integer query vector corresponding to the input data, for example, [4096, 4096] @ [14, 4096] is rearranged to be [512, 512, 8, 8] @ [512, 14, 8], when the smmla instruction is used for performing matrix operation, because the specific operation of the hardware instruction is [2, 8] @ [2, 8] - > [2, 2], the calculation logic in the data cycle after the data shape rearrangement is [8, 8] @ [14, 8] - > [8, 14] in the embodiment, the single calculation amount in the calculation logic after the rearrangement is 8, which is the same as the single calculation amount of the smmla instruction, the logic operation of [8, 8] @ [14, 8] - [8, 14] can be completed directly through a plurality of smmla instructions, the user terminal is more convenient for performing the data operation, the data is also continuous and the data is better in continuous writing, so that the data arrangement can be more compact, the matrix operation can be further executed, and the data arrangement operation is more compact, and the matrix operation can be further executed.

In one embodiment, the step S145 may be implemented as the following steps S145a to S145 b.

Step S145a: and obtaining the available number of the registers corresponding to the user terminal.

Step S145b: and determining the circulation times of calculation of the calculation instruction according to the available number of the registers and the batch size of the integer query vectors, and calculating the rearranged query vectors and the rearranged weight parameters through the calculation instruction based on the circulation times to obtain a query result corresponding to the input data.

When implementing computation kernel (core function) using assembly language, that is, when using assembly language to generate computation function of underlying computation logic for supporting large language model, it is necessary to consider the availability of the number of registers in processor architecture (e.g. ARM architecture) of user terminal, so as to make full use of available registers to increase computation efficiency.

Specifically, taking the example mentioned above as an example, assuming that the number of input tokens is batch, the calculation scale is [512, 512, 8, 8] @ [512, batch, 8], and the underlying calculation logic (i.e., kernel) to be implemented is: [8, 8] @ [ bacth, 8] - > [8, bacth ], this kenerl, if implemented directly without a loop, requires at least 5+batch of registers, based on the number of available registers (e.g., 32 registers in arm 64), can enable a calculation of batch less than 27, e.g., calculation logic of 12, 10, 8, 4, 2, 1, etc. can be calculated without setting a loop, when the number of available registers is 10, since at least 5+batch of registers need to be used, when batch is less than or equal to 5, no loop can be used, when batch is greater than 5, it is necessary to implement the underlying calculation logic [8, 8] @ [ bacth, 8] - > [8, bacth ], e.g., if batch is greater than 5 and less than or equal to 10, the number of loops is 2, when batch is greater than 10 and less than or equal to 15, the number of loops is 3, thus the number of loops should be used in a more single-run as much more memory models of the calculation of kernel can be determined.

In the step S145b, the calculation instruction may calculate the above-mentioned loop times by performing calculation task loop for calculating the rearranged query vector and the rearranged weight parameter, for example, the above-mentioned loop times are repeated for the above-mentioned calculation task loop of [8, 8] @ [ bacth, 8] - > [8, bacth ], so as to obtain the query result corresponding to the input data.

The number of loops is determined by the number of available registers in the user terminal, so that more available registers are utilized for calculation in a single calculation, and the memory access amount is further reduced.

In one embodiment, step S145 may be implemented as follows steps S145 c-S145 d.

Step S145c: an available thread in the user terminal is determined.

Step S145d: and when the available threads are smaller than the parallel dimension threshold, calculating the rearranged query vector and the rearranged weight parameter in a parallel mode through each available thread by using the calculation instruction.

Illustratively, since the first dimension corresponding to the weight parameter after the data shape rearrangement is performed is parallelizable, the work division may be performed in this dimension according to the number of threads to perform the parallel computation. For example, for the computing task [512, 512, 8, 8] @ [512, 14, 8], [512, 512, 8, 8] with a first dimension of 512 in the example above, each thread may perform 128 data computations, that is, each thread is responsible for: the computing tasks of [128, 512, 8, 8] @ [512, 14, 8] are divided in parallel according to equal division. Alternatively, the amount of each parallel computing task may be divided in a non-uniform division manner, for example, when the division cannot be uniform, an unnecessary part may be put into the last thread to be executed.

The embodiment adopts a multithread parallel computing mode, so that the multi-core computing capability of the user terminal can be fully utilized, the multi-core performance is improved, and the computing efficiency of the large language model is improved.

In one embodiment, the processing layers of the large language model except the embedded layer may be loaded in the memory of the user terminal, so that the large language model can perform information reasoning through the memory. The processing layers other than the embedding layer may include a text conversion tokenizer layer, a Linear calculation layer, an attention layer, an activation function layer, a loss function layer, etc., and may include data processing layers other than the embedding layer. The above-mentioned embedded layer can be deployed in the disk of the user terminal, for example, the model file corresponding to the embedded layer can be stored in the disk of the user terminal, and the embedded layer includes the weight parameters of the embedded layer corresponding to each data.

Correspondingly, the step S120 may be implemented as the following steps S121 to S122.

Step S121: and determining a target weight parameter corresponding to the input data from the weight parameters of each embedded layer included in the embedded layer, and loading the target weight parameter into a memory of the user terminal.

Step S122: and converting the input data into floating point query vectors of the corresponding floating point data types through the target weight parameters.

And deleting the target weight parameter from the memory after the process of determining the corresponding vector of the input data through the embedded layer is detected to be completed, so as to reduce the occupation of the memory.

For example, if the embedding layer weight corresponding to the embedding layer is 151936 x4096 floating point numbers, the embedding layer weight parameters corresponding to the token ids may be selected according to the token unit identifier sequence (token id sequence, id range is 0-151936) input to the embedding layer, and each id corresponds to the embedding layer weight of 4096 floating point numbers. The general implementation in the related art can load the whole embedded layer weight into the memory, and acquire the embedded layer weight parameter of the corresponding position according to the id sequence, and the mode can occupy a large amount of user terminal memory, so that the operation efficiency of the user terminal is reduced.

In the embodiment of the application, because word embedding is realized by selecting corresponding data from N according to id from N×H embedding layer weight parameters, for example, an input sequence is [2, 9886, 32], the floating point numbers corresponding to three ids, namely 2, 9886, 32, are loaded from a disk according to the corresponding offsets of the ids, namely 3×4096 floating point numbers are loaded, and 151936 ×4096 floating point numbers are not required to be loaded, the occupation of memory can be greatly reduced, and the calculation efficiency of a large language model is improved.

Alternatively, the data type of the 16 floating point number bf16 may be used to store the corresponding embedded layer weight parameter, which can reduce the calculation amount by half compared to the original embedded layer weight parameter.

According to the embodiment, the weight parameters of the embedded layer are stored in the disk, all the weight parameters are not required to be loaded into the memory, and the required parts are loaded into the memory in a file reading mode to realize word embedding, so that the operation efficiency of the user terminal can be improved.

In one embodiment, in step S122, the input data may be converted into the floating point query vector of the corresponding floating point data type according to the following steps S122a to S122 b.

Step S122a: the text conversion layer based on the large language model converts the input data into converted data corresponding to a data format supported by the large language model.

Specifically, the input data may be divided into each partition unit based on tokenizer layers of the large language model, and then the partition units are mapped into a vocabulary corresponding to a data format supported by the large language model, where the vocabulary includes identifiers corresponding to each partition unit, so as to obtain an identifier token corresponding to the partition unit corresponding to the input data, where the identifier token corresponding to the partition unit is the converted data.

Step S122b: and converting the converted data into floating point query vectors of the corresponding floating point data types through the target weight parameters.

According to the embodiment, the text conversion layer converts the input data into the converted data corresponding to the data format supported by the large language model, and then the corresponding vector conversion is carried out, so that the large language model can infer the input data of different types, the language types supported by the large language model are improved, and the applicability is more general.

In an embodiment, before step S110, the method may further include the following steps S110a to S110b.

Step S110a: and obtaining a model file corresponding to the large language model and used for being deployed at the user terminal, wherein the model file comprises each text conversion layer file, a linear calculation layer file and an embedded layer file, and each text conversion layer file is represented by a file in a unified text format.

The model files used for deployment at the user terminal can be files of a mobile terminal neural network (Mobile Neural Network, MNN for short) type, and can also be other model files used for deployment at the user terminal, and the model files used for deployment at the user terminal are more suitable for deployment of large language models in mobile user terminals such as mobile phones and tablet computers, so that convenience of model deployment is improved.

Because different large language models LLM use different tokenizer models, the formats of the models are different, text-based (such as tiktoken model) and json-based (such as GPT2Tokenizer model) and protobuf-based (such as SENTENCEPIECE model), the support of these tokenizer directly at the end side needs to support multiple file formats and needs much code logic to realize, the size of the binary files at the final deployment is increased, and the binary files are exported into files with unified text txt format, the complexity of the large language models in realizing codes at the deployment can be reduced, and the binary size corresponding to tokenizer processing logic is reduced, so that the large language models are lighter and more convenient to deploy and run on the user terminal.

Step S110b: and loading the text conversion layer file and the linear calculation layer file into a memory of the user terminal, and deploying the embedded layer file into a disk of the user terminal.

According to the scheme of the embodiment, the large language model can be deployed on the user terminal in a lightweight mode.

The embodiment of the application also provides a method for exporting the large language model, which can export the model file and comprises the following steps A-E.

Step A: and partitioning the large language model based on different realized functions to obtain model partitions corresponding to the functions, wherein the model partitions comprise embedded layer partitions and linear calculation layer partitions.

Optionally, the model blocks may further include text conversion layer blocks. Step A abstracts the large language model into a plurality of substructures for block export, and the size of a single model can be reduced so as to be convenient for distribution.

And (B) step (B): and exporting the linear calculation layer blocks and the embedded layer blocks into a model file format which can be identified by a user terminal to obtain the user terminal embedded layer blocks and the user terminal linear calculation layer blocks.

For example, the linear computation layer block and the embedded layer block may be exported as the MNN model file. In the embodiment of the application, when the model is exported, a large language model can be exported into a universal ONNX model, and then a ONNX model is converted into an MNN model through a mnnconvert conversion tool and the like, and in the conversion process, the MNN model can be quantized into a model coding mode of data types such as 8bit, 4bit, bf16, pf16 and the like.

Step C: and exporting the text conversion layer blocks into unified text format block files to obtain text conversion layer blocks in a text format.

In the process of exporting the text conversion layer blocks into the block files with the unified text format, special representation symbols in the text conversion layer can be converted into texts with original meanings corresponding to the special representation symbols, and the text conversion layer blocks can be saved by using base64 coding.

Step D: and quantizing the weight parameters of the linear calculation layer blocks of the user terminal into format data of integer data types to obtain quantized linear calculation layer blocks.

Step E: and determining the user terminal embedded layer block, the quantized linear calculation layer block and the text conversion layer block in the text format as model files corresponding to a large language model and used for deployment at the user terminal.

The execution subject of steps a to E may be the above-mentioned user terminal, in which case a communication connection may be established between the user terminal and the server side where the original large language model is deployed, so as to download and export a model file for deployment at the user terminal from the server side. The execution subject of the steps A-E can also be a server for deploying the original large language model, and the application is not particularly limited.

Alternatively, the step a may be implemented as the following steps a to c.

Step a: and partitioning the large language model based on different realized functions to obtain the to-be-optimized model partition corresponding to each function.

Step b: and replacing codes related to the data shape information in codes corresponding to the model blocks to be optimized with operator codes not directly related to the data shape information, and obtaining replaced blocks.

Since the large language model may have a dynamic shape problem in the export process, the dynamic shape refers to a shape only supported when exporting the large language model to a model in ONNX format due to a code implementation problem, and when the shape input by the large language model is changed when the large language model actually runs, the running result of the large language model may be wrong, and some codes related to the shape need to be modified to solve the problem. The specific modification mode is to replace codes related to the data shape information in codes corresponding to the model blocks to be optimized with operator codes not directly related to the data shape information.

For example, some operators related to the shape of tensor tensor may be combined and replaced by operators that are not directly related to shape information, such as squeeze, unsqueeze, transpose.

Step c: and deleting redundant data in the replaced blocks to obtain model blocks corresponding to the functions.

Specifically, the calculation graphs in the replaced partition blocks can be subjected to operator combination and/or redundancy operator deletion, the position constant codes in the replaced partition blocks can be deleted, and other redundancy data can be deleted.

The position constant code mainly refers to a cache of which the position code can be calculated according to the maximum length when the position code of a plurality of LLMs is realized, and each model block can generate a large constant (tens of megabits) when the position code is exported to ONNX, and the part of data can be calculated according to the length of actual input, so that the part of position constant code can be deleted.

In one particular embodiment, when exporting a large language model, as shown in FIG. 4, export of the large language model may be accomplished by the following functional modules.

As shown in fig. 4, the code rewriting module coderefactor may rewrite the pytorch code corresponding to the original large language model, and correct the problems of the dynamic shape, the redundant data, and the like mentioned above; each model block can be abstracted through a block module class architecture so as to realize unified calculation and export logic of each model block; the hierarchical module block split can split and export deocde block according to the hierarchical information of llm by using multiple layers of llm, for example, export N layers of decode blocks into N block models respectively, add the norm layer of llm to the last block model, combine the Linear transformation layer of llm with the sampling function (such as argmax) and export lm layer. Because ONNX can additionally store the weight when the file is larger than 2GB, the calculation graph is separated from the weight, the calculation graph can be separated from the weight by separate export in the embodiment, the problem of separation of the calculation graph and the weight can be avoided, and meanwhile, the size of a single file is reduced, so that the file is convenient to issue; various tokenizer model can be rewritten into a unified txt format through a model rewriting module tokenizer rewrite, and meanwhile, some special representation symbols are converted into original meanings and saved by using base64 codes; the whole llm model can be exported one by one according to an abstract structure through an export module onnx export or exported together; the onnx model can be converted into an mnn model through a conversion module mnnconvert, and computational graph optimization can be performed through a graph optimization module graph optimize; the weight quantization module weight can be used for quantizing the weight parameter of the floating point type of the model into the weight parameter of the 4bit type or the 8bit type, and tokenizer txt of the unified text format corresponding to the MNN model and tokenizer layers used for deployment at the mobile terminal can be obtained based on the processing logic executed by each functional module. Since the execution of each functional module for model derivation in fig. 4 has been described above, the specific implementation of each functional module in deriving a large language model will not be described in detail here.

When the scheme provided by the embodiment of the application is used for running the large language model in the user terminal, the large language model is required to be converted into the unified text file (tokenizer text for short) corresponding to tokenizer and the mnn model file, and when the large language model is running, files except the embedded layer file in the mnn model file are required to be loaded into a memory, and the embedded layer file is stored in a disk of the user terminal; the reasoning framework of the user terminal takes MNN as a basis, and the MNN model framework can load the model and obtain output according to calculation.

When the scheme provided by the embodiment of the application is used, when the user terminal performs data query through the large language model, the converted tokenizer file and the mnn model file are stored in the user terminal, the two files are loaded into the memory when the large language model is started, and the embedding part in the mnn model file is not required to be loaded and embedding is stored in the disk.

For example, after a user inputs a section of text on an interface of a large language model of the user terminal to perform data query, the user terminal may perform data inference query through the following steps 1 to 3.

Step 1: converting the array of integers to a vector by tokenizer converting the array of integers to an array of integers using Embedding;

Step 2: taking the vector in the step 1 as the input of an MNN model corresponding to a decoding block of a decoding layer, and performing reasoning of the model by MNN to obtain output; outputting an input execution inference serving as a next layer block, and sequentially executing processing layers of all models; the output of the last layer is transmitted to an lm layer for reasoning to obtain the token id of the next word;

Step 3: converting the token id into text through tokenizer, if streaming output is used, displaying the text in the generated content, continuing to acquire the next text by the current token id through the processes of the steps 1 and 2 until the terminal text is encountered, and ending the process to obtain a query result corresponding to the text input by the user.

Example two

The second embodiment of the present application also provides a device for processing data based on a large language model corresponding to the data access method embodiment provided in the first embodiment, where the device for processing data based on a large language model is applied to a user terminal, and the user terminal is deployed with a large language model, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance. Since the apparatus embodiment is substantially similar to the method embodiment, the description is relatively simple, and details of relevant technical features and effects to be achieved are just described with reference to the corresponding description of the method embodiment for processing data based on a large language model provided above. As shown in fig. 6, the apparatus for processing data based on a large language model according to the present embodiment includes:

an acquisition unit 201 for acquiring input data;

The vector conversion unit 202 is configured to perform vector conversion on the input data through an embedding layer of the large language model, so as to obtain a floating point query vector of a floating point data type corresponding to the input data; converting the floating point query vector into an integer query vector of an integer data type;

and the operation unit 203 is configured to obtain a query result corresponding to the input data by performing an operation on the weight parameter of the linear calculation layer and the integer query vector.

Optionally, the weight parameter of the linear computation layer is quantized into 4-bit binary integer int4 format data in advance;

The vector conversion unit is specifically configured to: the floating point query vector is converted into an integer query vector in the format of an 8-bit binary integer int 8.

Optionally, the operation unit is specifically configured to: converting the weight parameters of the linear calculation layer into weight parameters of a target integer type, wherein the target integer type is the same as the integer type corresponding to the integer query vector; matrix multiplication is carried out on the integer query vector and the weight parameter of the target integer type, so that a matrix operation result is obtained; performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type; and determining a query result corresponding to the input data according to the output result of the floating point data type.

Optionally, the operation unit is specifically configured to: obtaining a scaling factor and a zero point when the weight parameters of the linear calculation layer are quantized, wherein the scaling factor is a scaling factor for mapping floating point numbers to an integer range, and the zero point is an offset for mapping zero values of the floating point numbers to zero values of the integers; and according to the scaling factor and the zero point, performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type.

Optionally, the apparatus may further include:

a rearrangement unit, configured to obtain a single calculation dimension size corresponding to a calculation instruction supported by the user terminal; determining a data chunk size matching the single-computation dimension size; rearranging the data arrangement shapes of the integer query vector and the weight parameters of the linear calculation layer according to the data block size to obtain a rearranged query vector and rearranged weight parameters;

the operation unit is specifically configured to: and operating the rearranged query vector and the rearranged weight parameter through the calculation instruction to obtain a query result corresponding to the input data.

Optionally, the data arrangement shape of the integer query vector is [ batch, ic ], the data arrangement shape of the weight parameter of the linear computation layer is [ oc, ic ], where batch represents the batch size of the integer query vector, ic corresponding to the integer query vector represents the number of channels of the input data of the large language model, ic corresponding to the weight parameter represents the number of channels of the input feature map in the weight parameter of the linear computation layer, and oc represents the number of channels of the output feature map in the weight parameter of the linear computation layer;

The rearrangement unit is specifically configured to: rearranging the data arrangement shape of the integer query vector into [ ic/pack, latch, pack ], rearranging the data arrangement shape of the weight parameter of the linear calculation layer into [ oc/pack, ic/pack, pack, pack ], wherein pack represents the data block size;

The operation unit is specifically configured to: and running the calculation of the dimension corresponding to the oc/pack in a parallel mode, circularly calculating the calculation of the dimension corresponding to the ic/pack in a circular calculation mode, taking the batch and the pack as the output dimension, taking the pack as the calculation dimension when the calculation instruction is singly calculated, and calculating the rearranged query vector and the rearranged weight parameter by using the calculation instruction to obtain the query result corresponding to the input data.

Optionally, the operation unit is specifically configured to: acquiring the available number of registers corresponding to the user terminal; and determining the circulation times of calculation of the calculation instruction according to the available number of the registers and the batch size of the integer query vector, and calculating the rearranged query vector and the rearranged weight parameter through the calculation instruction based on the circulation times to obtain a query result corresponding to the input data.

Optionally, the operation unit is specifically configured to: determining available threads in the user terminal; the rearranged query vector and the rearranged weight parameters are operated in parallel through each of the available threads by using the calculation instruction.

Optionally, processing layers of the large language model except for an embedding layer are loaded in a memory of the user terminal, the embedding layer is deployed in a disk of the user terminal, and the embedding layer comprises an embedding layer weight parameter corresponding to each data respectively;

The vector conversion unit is specifically configured to: determining a target weight parameter corresponding to the input data from all weight parameters included in the embedded layer, and loading the target weight parameter into the memory;

And converting the input data into floating point query vectors of corresponding floating point data types through the target weight parameters.

Optionally, the vector conversion unit is specifically configured to: converting the input data into converted data corresponding to a data format supported by the large language model based on a text conversion layer of the large language model; and converting the converted data into floating point query vectors of corresponding floating point data types through the target weight parameters.

Optionally, the apparatus further comprises:

the model deployment unit is used for acquiring model files corresponding to the large language model and used for being deployed at the user terminal, wherein the model files comprise text conversion layer files, linear calculation layer files and embedded layer files, and the text conversion layer files are represented by files in a unified text format; and loading the text conversion layer file and the linear calculation layer file in the memory of the user terminal, and deploying the embedded layer file in a disk of the user terminal.

Optionally, the apparatus further comprises:

The export unit is used for partitioning the large language model based on different realized functions to obtain model partitions corresponding to the functions, wherein the model partitions comprise embedded layer partitions, text conversion layer partitions and linear calculation layer partitions; exporting the linear calculation layer blocks and the embedded layer blocks into a model file format which can be identified by a user terminal, and obtaining the user terminal embedded layer blocks and the user terminal linear calculation layer blocks; exporting each text conversion layer block into a unified text format block file to obtain a text conversion layer block in a text format; and determining the user terminal embedded layer block, the user terminal linear calculation layer block and the text conversion layer block in the text format as model files corresponding to the large language model and used for deployment at the user terminal.

Optionally, the deriving unit is specifically configured to: partitioning the large language model based on different realized functions to obtain to-be-optimized model partitions corresponding to the functions; replacing codes related to the data shape information in codes corresponding to the to-be-optimized model blocks with operator codes not directly related to the data shape information to obtain replaced blocks; and deleting redundant data in the replaced blocks to obtain model blocks corresponding to the functions.

Optionally, the deriving unit is specifically configured to perform at least one of:

performing operator merging and/or redundant operator deleting on the calculation graphs in the replaced sub-blocks;

and deleting the position constant codes in the replaced partition.

The third embodiment of the present application further provides an electronic device embodiment corresponding to the method for processing data based on a large language model provided in the first embodiment, where the electronic device is a user terminal, and a large language model is deployed on the user terminal, and weight parameters of each linear calculation layer of the large language model are quantized into format data of an integer data type in advance; the following description of the embodiments of the electronic device is merely illustrative. The electronic device embodiment is as follows:

Please understand the above-mentioned electronic device with reference to fig. 7, fig. 7 is a schematic diagram of the electronic device. The electronic device provided in this embodiment includes: a processor 1001, a memory 1002, a communication bus 1003, and a communication interface 1004;

The memory 1002 is used for storing computer instructions for data processing which, when read and executed by the processor 1001, perform the steps of:

Acquiring input data;

The fourth embodiment of the present application also provides a computer-readable storage medium for implementing the method described in the first embodiment. The embodiments of the computer readable storage medium provided by the present application are described more simply, and reference should be made to the corresponding descriptions of the above-described method embodiments, which are merely illustrative.

The computer readable storage medium provided by the embodiment stores computer instructions, and the computer readable storage medium can be applied to a user terminal, wherein a large language model is deployed on the user terminal, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance; the instructions, when executed by the processor, perform the steps of:

Acquiring input data;

The fifth embodiment of the present application also provides a computer program product for implementing the method according to the first embodiment. The embodiments of the computer program product according to the application are described in a relatively simple manner, and reference should be made to the corresponding descriptions of the embodiments of the method described above, which are intended to be illustrative only.

The computer program product provided by the present embodiment comprises a computer program which, when executed by a processor, performs the steps of:

Acquiring input data;

The sixth embodiment of the present application also provides a large language model including:

an embedding embedding layer for converting the text data into vectors;

The various computational layers of the large language model provided in this embodiment have been described in detail in the first embodiment and will not be described in detail here.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the application has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. The method is characterized by being applied to a user terminal, wherein the user terminal is provided with a large language model, and weight parameters of each linear calculation layer of the large language model are quantized into format data of integer data types in advance; the method comprises the following steps:

Acquiring input data;

2. The method according to claim 1, wherein the calculating the integer query vector by the weight parameter of the linear calculation layer to obtain the query result corresponding to the input data includes:

converting the weight parameters of the linear calculation layer into weight parameters of a target integer type, wherein the target integer type is the same as the integer type corresponding to the integer query vector;

matrix multiplication is carried out on the integer query vector and the weight parameter of the target integer type, so that a matrix operation result is obtained;

Performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type;

And determining a query result corresponding to the input data according to the output result of the floating point data type.

3. The method according to claim 2, wherein said dequantizing said matrix operation result to obtain an output result of a floating point data type comprises:

obtaining a scaling factor and a zero point when the weight parameters of the linear calculation layer are quantized, wherein the scaling factor is a scaling factor for mapping floating point numbers to an integer range, and the zero point is an offset for mapping zero values of the floating point numbers to zero values of the integers;

and according to the scaling factor and the zero point, performing inverse quantization on the matrix operation result to obtain an output result of the floating point data type.

4. The method of claim 1, wherein prior to the computing of the integer query vector with the weight parameters through the linear computation layer, the method further comprises:

Acquiring a single calculation dimension corresponding to a calculation instruction supported by the user terminal;

determining a data chunk size matching the single-computation dimension size;

Rearranging the data arrangement shapes of the integer query vector and the weight parameters of the linear calculation layer according to the data block size to obtain a rearranged query vector and rearranged weight parameters;

The operation is performed on the weight parameter of the linear calculation layer and the integer query vector to obtain a query result corresponding to the input data, including:

and operating the rearranged query vector and the rearranged weight parameter through the calculation instruction to obtain a query result corresponding to the input data.

5. The method according to claim 4, wherein the data arrangement shape of the integer query vector is a two-dimensional matrix formed by the batch size of the integer query vector and the number of channels of the input data of the large language model, and the data arrangement shape of the weight parameter of the linear computation layer is a two-dimensional matrix formed by the number of channels of the output feature map in the weight parameter of the linear computation layer and the number of channels of the input feature map in the weight parameter of the linear computation layer;

Rearranging the data arrangement shape of the integer query vector and the weight parameter of the linear calculation layer according to the data block size to obtain a rearranged query vector and a rearranged weight parameter, wherein the rearranging step comprises the following steps:

According to the data block size, carrying out block rearrangement on the channel number of the input data of the large language model of the integer query vector to obtain a rearranged query vector; according to the data block size, carrying out block rearrangement on the channel number of the output characteristic diagram of the weight parameter of the linear calculation layer and the channel number of the input characteristic diagram of the linear calculation layer to obtain rearranged weight parameters;

The operation on the rearranged query vector and the rearranged weight parameter through the calculation instruction is performed to obtain a query result corresponding to the input data, and the operation comprises the following steps:

And taking the batch size of the integer query vectors and the data block size as the dimension of output, taking the data block size as the calculation dimension when the calculation instruction is calculated once, and calculating the rearranged query vectors and the rearranged weight parameters by using the calculation instruction to obtain a query result corresponding to the input data.

6. The method of claim 5, wherein the computing the rearranged query vector and the rearranged weight parameter using the computing instruction to obtain the query result corresponding to the input data comprises:

acquiring the available number of registers corresponding to the user terminal;

And determining the circulation times of calculation of the calculation instruction according to the available number of the registers and the batch size of the integer query vector, and calculating the rearranged query vector and the rearranged weight parameter through the calculation instruction based on the circulation times to obtain a query result corresponding to the input data.

7. The method of claim 5, wherein the operating the rearranged query vector with the rearranged weight parameters using the calculation instructions comprises:

Determining available threads in the user terminal;

the rearranged query vector and the rearranged weight parameters are operated in parallel through each of the available threads by using the calculation instruction.

8. The method according to any one of claims 1 to 7, wherein processing layers of the large language model other than an embedding layer are loaded in a memory of the user terminal, the embedding layer is deployed in a disk of the user terminal, and the embedding layer includes an embedding layer weight parameter corresponding to each data;

The vector conversion is carried out on the input data through the embedding layer of the large language model to obtain a floating point query vector of a floating point data type corresponding to the input data, which comprises the following steps:

determining a target weight parameter corresponding to the input data from all weight parameters included in the embedded layer, and loading the target weight parameter into the memory;

9. The method of claim 8, wherein said converting the input data into a floating point query vector of a corresponding floating point data type by the target weight parameter comprises:

converting the input data into converted data corresponding to a data format supported by the large language model based on a text conversion layer of the large language model;

And converting the converted data into floating point query vectors of corresponding floating point data types through the target weight parameters.

10. The method of claim 9, wherein prior to the acquiring input data, the method further comprises:

the method comprises the steps that model files corresponding to the large language model and used for being deployed at a user terminal are obtained, wherein the model files comprise text conversion layer files, linear calculation layer files and embedded layer files, and the text conversion layer files are represented by files in a unified text format;

and loading the text conversion layer file and the linear calculation layer file in the memory of the user terminal, and deploying the embedded layer file in a disk of the user terminal.

11. The method according to claim 10, wherein the model file is derived by:

Partitioning the large language model based on different realized functions to obtain model partitions corresponding to the functions, wherein the model partitions comprise embedded layer partitions, text conversion layer partitions and linear calculation layer partitions;

exporting the linear calculation layer blocks and the embedded layer blocks into a model file format which can be identified by a user terminal, and obtaining the user terminal embedded layer blocks and the user terminal linear calculation layer blocks;

Exporting each text conversion layer block into a unified text format block file to obtain a text conversion layer block in a text format;

And determining the user terminal embedded layer block, the user terminal linear calculation layer block and the text conversion layer block in the text format as model files corresponding to the large language model and used for deployment at the user terminal.

12. The method according to claim 11, wherein the partitioning the large language model based on the implemented functions to obtain model partitions corresponding to the functions includes:

partitioning the large language model based on different realized functions to obtain to-be-optimized model partitions corresponding to the functions;

replacing codes related to the data shape information in codes corresponding to the to-be-optimized model blocks with operator codes not directly related to the data shape information to obtain replaced blocks;

And deleting redundant data in the replaced blocks to obtain model blocks corresponding to the functions.

13. The method of claim 12, wherein the deleting redundant data in the replaced partition comprises at least one of:

and deleting the position constant codes in the replaced partition.

14. A large language model, comprising:

an embedding embedding layer for converting the text data into vectors;

15. An electronic device, comprising: a processor, a memory, and computer program instructions stored on the memory and executable on the processor; the processor, when executing the computer program instructions, implements the method of any of the preceding claims 1-13.

16. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any of the preceding claims 1-13.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-13.