CN117271780B

CN117271780B - Method and system for compressing context based on large language model

Info

Publication number: CN117271780B
Application number: CN202311546547.7A
Authority: CN
Inventors: 曹自强; 高俊; 曹敏; 付国宏; 施屹然
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-02-23
Anticipated expiration: 2043-11-20
Also published as: CN117271780A

Abstract

The invention relates to the technical field of large language models, and discloses a method and a system for compressing a context based on a large language model, wherein the method comprises the following steps: acquiring a text to be compressed, and adding task description, separator and compression groove; under the condition that GPU resources are short, the text to be compressed is compressed by utilizing the existing large language model, a projection layer is additionally trained, and when the GPU resources are abundant, the text to be compressed is compressed by the pre-training large language model; and reasoning the trained large language model to generate text replies. The invention provides a method and a system for compressing a context based on a large language model, wherein task prompts of the system participate in the compression process, so that compressed virtual characters have better purposes, and a better output result is generated. The method can utilize the existing large language model, can deploy compression in a pre-training stage, and can compress the context through the large language model without additional training of a compressor.

Description

Method and system for compressing context based on large language model

Technical Field

The invention relates to the technical field of large language models, in particular to a method and a system for compressing a context based on a large language model.

Background

The currently common context compression methods can be summarized as: extraction-based compression, soft-hint-based compression.

Extraction-based compression refers to that a model directly extracts key information from a context given the context to be compressed, thereby reducing the input length of the original context.

Compression based on soft hinting refers to the compression of a model into a set of virtual characters given the context that needs to be compressed.

Disadvantages of the current technology:

1. at present, the compression mode based on extraction increases the confusion degree of the language model because the compression result is not a complete sentence. Although the extracted context reduces the input length, it will also be larger resulting in performance degradation.

2. The traditional compression method based on soft prompt mainly trains a compressor to compress the context into soft prompt, but the compressor usually needs a large amount of training data, and has high computational power requirement for training the compressor due to the large training parameter amount of a large language model, and meanwhile, the method cannot utilize the knowledge learned by the language model in the pre-training process. On the other hand, conventional soft hint based compression only supports full compression, i.e., compressing a given context entirely from the beginning into a soft hint and fixing the compressed soft hint location. However, in many tasks, such as question answering systems, the compression should be conditioned on a particular problem in an automatic summary task that focuses on the problem.

The purpose of compression is to make the model input more content obtain additional information and multiplex compression result to quicken reasoning and save calculation force, and the current method based on soft prompt only has multiplexing and accelerating effects, and because only full compression is supported, the model input is only a short soft prompt, thereby wasting a large number of input windows, and being contrary to the first purpose of compression.

Disclosure of Invention

The present invention has been made in view of the above-described problems.

Therefore, the technical problems solved by the invention are as follows: aiming at the traditional compression method based on soft prompt, a large amount of data is needed to train the compressor and the number of training parameters is huge, pre-training knowledge cannot be utilized, and only full compression is supported, so that the compression purpose of solving the problem that the ultra-long text cannot be input is overcome.

In order to solve the technical problems, the invention provides the following technical scheme: a method for compressing a context based on a large language model itself, comprising: and acquiring a text to be compressed, and adding a task description, a separator and a compression slot.

And under the condition that GPU resources are short, the text to be compressed is compressed by using the existing large language model, a projection layer is additionally trained, and when the GPU resources are abundant, the text to be compressed is compressed by the pre-trained large language model.

And reasoning the trained large language model to generate text replies.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the adding task description, separator and compression slot comprises splicing task description, text to be compressed and continuous mask sequence into a new sequence：

；

Wherein,representing task descriptions->Representing text to be compressed, < >>Indicates a compression groove->Representing a sequence of consecutive masks.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the text to be compressed is compressed by using the existing large language model, and the additional training projection layer comprises the step of generating compressed virtual characters of the compressed text by using the large language model as a compressor.

Sequences are sequencedPerforming encoding operations in a large language model LLM with frozen input parameters->The last layer of hidden layer in the encoder is expressed as:

；

wherein,representation and->Hidden layer state corresponding to each compression groove.

The hidden layer contains refined summarized context information, expressed as:

；

where_indicates that the corresponding output is discarded.

Establishing a linear projection layerWill->Into the projection layer, by linear transformation +.>The method comprises the steps of projecting an encoding output representation space into an input representation space of a large language model, and converting the encoding output representation space into compressed virtual characters which can be understood by the large language model:

；

wherein,indicate->The compressed virtual characters.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the compressing the text to be compressed by using the existing large language model further comprises decoding by using the large language model as a decoder according to the compressed virtual characters and the uncompressed context to generate a loss function of the text to be compressed.

The gold mark text is expressed as:

；

wherein,the +.f. representing the gold mark text>And (5) personal words.

Will beInputting the hidden layer into a large language model LLM to perform decoding operation to obtain the last hidden layer of the hidden layer:

；

the final probability output is obtained by softmax function operation:

；

wherein,before +.>And (5) personal words.

Loss functionExpressed as:

；

wherein,representing the length of the gold mark text, < >>Representing model parameters->The +.f. representing the gold mark text>Personal word (s)/(s)>Represent the firsttAnd a probability output.

For loss functionModel parameters->Is to obtain the gradient ∈>：

；

Calculating gradient of each data in a small batch by adopting a small batch gradient descent methodCalculate the mean value of the individual data gradients +.>：

；

Small batch average gradientAnd learning rate->Multiplication, update to model parameters->And (3) the following steps:

；

wherein,indicating batch size, +.>Representing the learning rate.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the pre-training of the large language model itself to compress the text to be compressed includes pre-training the whole large language model, and the steps are as follows:

sequences are sequencedThe pre-training is performed in a large language model LLM with parameters that are not frozen.

The encoding operation is performed such that,the last layer of hidden layer in the encoder is expressed as:

；

The hidden layer contains refined summarized context information, expressed as:

；

and executing decoding operation, decoding according to the compressed virtual characters and the uncompressed context to obtain text reply, and generating a cross entropy loss function with the gold mark text.

The gold mark text is expressed as:

；

wherein,the +.f. representing the gold mark text>And (5) personal words.

；

the final probability output is obtained by softmax function operation:

；

wherein,before +.>And (5) personal words.

Loss functionExpressed as:

；

wherein,representing model parameters->The +.f. representing the gold mark text>Personal word (s)/(s)>Represent the firsttAnd a probability output.

For loss functionModel parameters->Is to obtain the gradient ∈>：

；

wherein,indicating batch size, +.>Representing the learning rate.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the reasoning includes a dynamic interaction process:

before inputting the text to be compressed, the user selects whether the text marking keyword is self-behaving as task description:

if the user selects self-labeling, the user considers important words and sentences in the labeling text, the system generates corresponding task description based on the user labeling, splices the task description with the text to be compressed, and gives a reply through model reasoning.

If the user selects the keyword which is not marked, matching task description from a preset task description database, splicing the task description with the text to be compressed, and giving a reply through model reasoning.

As a preferred embodiment of the method for compressing a context based on a large language model according to the present invention, the method comprises: the giving a reply through model reasoning comprises the steps of generating compressed virtual characters of compressed text by using a large language model as a compressor and generating a reply according to the compressed virtual characters and an uncompressed context by using the large language model as a decoder.

When generating replies according to the compression model after training, inputting compressed virtual characters and fronttThe step of generating results, calculating to obtain the firsttProbability of a step、/>And outputting the corresponding word until the model outputs EOS when the word is maximum.

Generating a text reply is represented as:

；

wherein,representing all compressed virtual characters +_>Representing uncompressed text.

A system for compressing a context based on a large language model itself, characterized by: comprising the steps of (a) a step of,

and the preprocessing module is used for acquiring a text to be compressed and adding task description, separator and compression groove.

And the model training module is used for compressing the text to be compressed by utilizing the existing large language model under the condition that GPU resources are short, additionally training a projection layer, and pre-training the large language model to compress the text to be compressed when the GPU resources are abundant.

And the reasoning generation reply module is used for reasoning the trained large language model to generate text replies.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor realizes the steps of the method as described above.

The invention has the beneficial effects that: the task prompt participates in the compression process, so that the compressed virtual character is more purposeful, and a better output result is generated. And partial compression, full compression and splicing compression are supported simultaneously, and the number of the compressed virtual character positions is not fixed. The method can utilize the existing large language model, can deploy compression in a pre-training stage, and can compress the context through the large language model without additional training of a compressor. The compressed virtual characters can measure the similarity of articles, so that the articles can be used for retrieval.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a general flow chart of a method for compressing a context based on a large language model itself according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a Decoder-only model architecture of a method for compressing a context based on a large language model according to a first embodiment of the present invention;

FIG. 3 is a training flowchart of a method for compressing a context based on a large language model itself according to a first embodiment of the present invention;

FIG. 4 is a flowchart of a pre-training large language model of a method for compressing context based on the large language model itself according to a first embodiment of the present invention;

FIG. 5 is a flow chart of reasoning of a method for compressing context based on a large language model itself according to a first embodiment of the present invention;

FIG. 6 is a flow chart of virtual character retrieval for a method of compressing a context based on a large language model itself according to a first embodiment of the present invention;

FIG. 7 is a graph showing the comparison of inference performance of various models of a method for compressing context based on a large language model itself according to a second embodiment of the present invention;

FIG. 8 is a partially compressed example diagram of a method for compressing a context based on a large language model itself, according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

Referring to fig. 1 to 6, for one embodiment of the present invention, a method for compressing a context based on a large language model is provided, including:

s1: and acquiring a text to be compressed, and adding a task description, a separator and a compression slot.

Adding task descriptions, separators, and compression slots includes concatenating the task descriptions, text to be compressed, and the sequence of continuous masks into a new sequence：

；

It should be noted that the task description is derived from a preset database.

Further, the continuous mask sequence is added at the back because the large language model of Decoder-only can summarize the foregoing by adding the mask sequence at the back of the compression context due to the single-term attention mechanism. A large language model framework diagram of the Decoder-only is shown in FIG. 2.

S2: under the condition that GPU resources are short, the text to be compressed is compressed by utilizing the existing large language model, a projection layer is additionally trained, and when the GPU resources are abundant, the pre-trained large language model compresses the text to be compressed.

It should be noted that, under the condition of limited GPU resources, the cost can be reduced by directly using the existing large language model, because the large language model itself remains frozen in the whole training process, the training parameters have only one learnable projection layer, and the encoding result of the large language model is projected into compressed virtual characters for decoding of the large language model, as shown in fig. 3.

Compressed virtual characters of the compressed text are generated using the large language model itself as a compressor.

；

The hidden layer contains refined summarized context information, expressed as:

；

where_indicates that the corresponding output is discarded.

；

wherein,indicate->The compressed virtual characters.

It should be noted that the projection layer may be implemented by using various neural network structures such as a linear layer, a multi-layer perceptron, and the like. The projection layer is used for the most of the continuous mask sequenceThe next hidden layer state H is mapped from the coding output representation space to the input representation space, thereby realizing virtual compression characters from the last hidden layer state H of K continuous mask sequences to K virtual compression characters. Thus, the large language model can understand these virtual characters with the pre-text information even without training the large language model LLM in advance.

The large language model is used as a decoder to decode according to the compressed virtual characters and the uncompressed context, and a loss function with the gold mark text is generated.

The gold mark text is expressed as:

；

wherein,the +.f. representing the gold mark text>And (5) personal words.

；

the final probability output is obtained by softmax function operation:

；

wherein,before +.>And (5) personal words.

Loss functionExpressed as:

；

For loss functionModel parameters->Is to obtain the gradient ∈>：

；

wherein,indicating batch size, +.>Representing the learning rate.

It should be noted that the batch size is set when the model itself is compressedLearning rate. In the first 10% of the update steps of the training process, a learning rate linear warm-up strategy is adopted, so that the learning rate is increased from 0 to 2e-5 in a linear manner, and a better optimization direction is found in the preliminary trial-and-error of the model. And then, as the model parameter updating direction is basically stable, the learning rate is slowly attenuated to 3e-6, and the phenomenon of disastrous forgetting of the model is prevented. The whole training process can be completed within 4 hours in 8 NVIDIA RTX A5000 (24 GB) display cards, and the training requirement and the cost are very low.

Pre-training the large language model includes not requiring additional training of the projection layer, as shown in fig. 4.

It should be noted that since the model is not frozen, the amount of parameters increases, and the encoding and decoding come from the same model, no additional training of the projection layer is required. The method comprises the following steps:

；

The hidden layer contains refined summarized context information, expressed as:

；

The gold mark text is expressed as:

；

wherein,the +.f. representing the gold mark text>And (5) personal words.

；

the final probability output is obtained by softmax function operation:

；

wherein,before +.>And (5) personal words.

Loss functionExpressed as:

；

For loss functionModel parameters->Is to obtain the gradient ∈>：

；

wherein,indicating batch size, +.>Representing the learning rate.

It should be noted that pre-training a large language model requires higher costs, but improves performance and efficiency over training itself with a large language model. The whole training process of pre-training is completed in about 180 hours in 8 NVIDIA Tesla A100 (80 GB) display cards, the training requirement cost is higher, and the model performance is improved.

S3: and reasoning the trained large language model to generate a text reply, as shown in fig. 5.

Reasoning includes generating compressed virtual characters of the compressed text using the large language model itself as a compressor, and generating replies based on the compressed virtual characters and the uncompressed context using the large language model itself as a decoder.

It should be noted that during the reasoning process, all model parameters are frozen, do not participate in training, and the model does not calculate the loss function.

When training is carried out by utilizing a large language model according to the text to be compressed, the calculated probability is calculatedAnd outputting the corresponding word until the model outputs EOS when the word is maximum.

It should be noted that the output EOS refers to the abbreviation of EndOfSequence, which is a special mark or symbol that indicates the end of a sequence (e.g., text).

When the large language model is pre-trained according to the text to be compressed, the calculated probability is calculatedAnd outputting the corresponding word until the model outputs EOS when the word is maximum.

Generating a text reply is represented as:

；

Furthermore, the compressed virtual characters can measure the similarity between the compressed contexts well due to the fact that the dimensions are fixed and the compressed parts are naturally summarized, so that the compressed virtual characters can be applied to retrieval. The main method comprises the following steps:

firstly, compressing candidate contexts to be searched and a current reasoning text into virtual characters through a large language model, and then calculating the inner product similarity between the reasoning text and the candidate context compression virtual characters. The k candidate contexts with highest inner product similarity are selected using topk functions similar to the pytorch framework (the function is to return the largest k values and their indices). The retrieval process is shown in fig. 6.

The reasoning also comprises a dynamic interaction process, which comprises the following specific steps:

It should be noted that, the template generation method is adopted to generate the corresponding task description according to the user mark, and when the user inputs the keyword, the task description is generated: "pay attention to and keep as much as possible the main information related to [ keyword ].

Furthermore, the invention also provides a system for compressing the context based on the large language model, which specifically comprises the following steps:

And the model training module is used for training the large language model according to the text to be compressed under the condition that GPU resources are short, and pre-training the large language model according to the text to be compressed when the GPU resources are abundant.

The computer device may be a server. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data cluster data of the power monitoring system. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include Read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magnetic random access memory (MagnetoresistiveRandomAccessMemory, MRAM), ferroelectric memory (FerroelectricRandomAccessMemory, FRAM), phase change memory (PhaseChangeMemory, PCM), graphene memory, and the like. Volatile memory may include random access memory (RandomAccessMemory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can take many forms, such as static random access memory (StaticRandomAccessMemory, SRAM) or dynamic random access memory (DynamicRandomAccessMemory, DRAM), among others. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

Example 2

Referring to fig. 7 and 8, in order to verify the beneficial effects of the present invention, a scientific demonstration is performed through a simulation experiment, which is an embodiment of the present invention.

Comparing the common compression model with the present invention, as shown in fig. 7, it can be seen that my invention performs better than the common compression model.

Based on whether there is a hint to compress the entire context, long instructions were randomly extracted from the QFS and QA datasets to evaluate, the results are shown in Table 1:

TABLE 1 comparison results on DUC, CICERO dataset

It is known from the table that providing task descriptions during compression achieves significant performance improvements over the query Foucuse Summarritation task and the query task.

Compared with the existing method which can only compress the text in full quantity, the method provided by the invention can support partial compression and full quantity compression at the same time, and maximally utilizes the performance of a large language model. As shown in Table 2, the entity represents the full compression, and the Partial represents the Partial compression, which results in a higher quality reply due to the provision of more text information.

TABLE 2 representation of various large language models

In the fields of XSUM and cicro, vicuna generates a response based on the actual context, LLM-CCEntire compresses the entire context to a virtual token, while LLM-cchartial leaves the front part uncompressed and the back part of the context compressed.

As can be seen from table 2, the my invention compresses and converts the large language model itself into compressed virtual characters without requiring an additional compressor to construct specific inputs, so that the my invention has no limitation on the form of final input, and the large language model itself takes real characters as inputs, so that the input of the large language model in generating a reply can be the compression virtual characters mixed with the real characters, thereby providing more information to the large language model.

When processing the ultra-long text, my invention can provide more information to the model by compressing the truncated portion, thereby improving the quality of the reply. As shown in table 3, DUC is a QFS dataset with a longer input length, and a common model, such as direct truncation when encountering a hypertext, causes information loss, and Rec-SUM is a solution to long input proposed by OpenAI, and autopressor is the best open source compression model at present.

TABLE 3 results on DUC dataset

As can be seen from Table 3, the LLM-CC trained by our scheme achieves a huge performance improvement over all baselines, in experiments we provide what we have originally cut off to the model to compress it into compressed virtual characters and splice the parts that are not desired to be compressed to form the input model, and experimental results show that the model (LLM-CC) trained by our scheme produces the highest quality recovery result.

The virtual characters compressed by the my invention can better measure semantic similarity in context learning, so that better context prompt samples are selected for the model. As shown in Table 4, LLM-CC represents samples retrieved with S-BERT, and Rerank represents samples virtually retrieved with our compression. Vicura 0/1-shot represents a naive context demonstration defined in ICL. LLM-CC1/2-shot represents compressed slots as abstract context presentations where S-BERT retrieves their context presentations. The Rerank-LLM-CC1/2-shot represents a context demonstration through LLM-CC rearrangement.

Table 4 results on XSUM dataset

Table 4 shows that the compressed virtual characters can better summarize the foregoing, so that the semantic similarity can be better measured, and the retrieval work can be completed.

In addition, with the help of the preprocessing module, firstly, inputting the context to be compressed on the prompt splice into the large language model for encoding, obtaining k compressed virtual characters, wherein the value of k and the position of M are not limited (for example, the front half part of compression is arranged in front, the rear half part of compression is arranged in rear), and then inputting the context not to be compressed on the virtual character splice into the large language model for decoding, and the compression example is shown in fig. 8.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method for compressing a context based on a large language model itself, comprising:

acquiring a text to be compressed, and adding task description, separator and compression groove;

under the condition that GPU resources are short, the text to be compressed is compressed by utilizing the existing large language model, a projection layer is additionally trained, and when the GPU resources are abundant, the text to be compressed is compressed by the pre-training large language model;

reasoning the trained large language model to generate text replies;

the adding of the task description, the separator and the compression slot comprises the step of splicing the task description, the text to be compressed and the continuous mask sequence into a new sequence:

sequence＝(x _p ,x _c ,x _m )

x _m ＝[M][M]…[M]

wherein x is _p Representing task descriptions, x _c Representing text to be compressed, [ M ]]Representing a compression groove, x _m Representing a sequence of consecutive masks;

the text to be compressed is compressed by using the existing large language model, and the additional training projection layer comprises the step of generating compressed virtual characters of the compressed text by using the large language model as a compressor;

performing encoding operation by sending sequence into parameter frozen large language model LLM, x _m The last layer of hidden layer in the encoder is expressed as:

H＝(h ₁ ,h ₂ ,…,h _i )

wherein h is _i Representing a hidden layer state corresponding to the i-th compression groove;

the hidden layer contains refined summarized context information, expressed as:

(_,_,H)＝LLM-ENC(x _p ,x _c ,x _m )

wherein_indicates that the corresponding output is discarded;

establishing a linear projection layer W _p H is fed into a projection layer, and projected from the coding output representation space to the input representation space of the large language model through linear transformation, and converted into compressed virtual characters which can be understood by the large language model:

q _i ＝W _p h _i

wherein q _i Representing an ith compressed virtual character;

the compressing the text to be compressed by using the existing large language model further comprises the step of decoding according to the compressed virtual characters and the uncompressed context by using the large language model as a decoder to generate a loss function of the text to be compressed;

the gold mark text is expressed as:

x _g ＝(w ₁ ,w ₂ ,..,w _i )

wherein w is _i An ith word representing the gold mark text;

will w _i Inputting the hidden layer into a large language model LLM to perform decoding operation to obtain the last hidden layer of the hidden layer:

h' _i ＝LLM-DEC(w _i )

the final probability output is obtained by softmax function operation:

p _t ＝p(w _i |w _<i )＝softmax(h' _i )

wherein w is _<i Representing the first i words entered;

the loss function L is expressed as:

where N represents the length of the gold mark text, θ represents the model parameters,i-th word representing gold mark text, p _t Representing a t-th probability output;

obtaining a partial derivative of the model parameter theta for the loss function L to obtain a gradient g:

calculating the gradient g of each data in a small batch by adopting a small batch gradient descent method _i Calculating the average value of each data gradient

Small batch average gradientMultiplying the learning rate alpha and updating the model parameter theta:

wherein b represents a batch size, and α represents a learning rate;

the pre-training of the large language model itself to compress the text to be compressed includes pre-training the whole large language model, and the steps are as follows:

the sequence is sent into a large language model LLM with unfrozen parameters for pre-training;

performing an encoding operation x _m The last layer of hidden layer in the encoder is expressed as:

H'＝(h' ₁ ,h' ₂ ,…,h' _i )

wherein h' _i Representing a hidden layer state corresponding to the i-th compression groove;

the hidden layer contains refined summarized context information, expressed as:

(_,_,H')＝LLM-ENC(x' _p ,x' _c ,x' _m )

performing decoding operation, namely decoding according to the compressed virtual characters and the uncompressed context to obtain text reply, and generating a cross entropy loss function with the gold mark text;

the gold mark text is expressed as:

x' _g ＝(w' ₁ ,w' ₂ ,..,w' _i )

wherein w' _i An ith word representing the gold mark text;

will w' _i Inputting the hidden layer into a large language model LLM to perform decoding operation to obtain the last hidden layer of the hidden layer:

h″ _i ＝LLM-DEC(w' _i )

the final probability output is obtained by softmax function operation:

p' _t ＝p'(w' _i |w' _<i )＝softmax(h” _i )

wherein w' _<i Representing the first i words entered;

the loss function L' is expressed as:

wherein, theta' represents a model parameter,i-th word, p 'representing gold mark text' _t Representing a t-th probability output;

partial derivatives of the model parameters theta are calculated for the loss function L ', and the gradient g' is obtained:

calculating the gradient g 'of each data in a small batch by adopting a small batch gradient descent method' _i Calculating the gradient of each dataAverage value of

Small batch average gradientMultiplying the learning rate alpha and updating the model parameter theta 'to the model parameter theta':

wherein b represents a batch size, and α represents a learning rate;

the reply is given through model reasoning, wherein the large language model is used as a compressor to generate compressed virtual characters of the compressed text, and the large language model is used as a decoder to generate the reply according to the compressed virtual characters and the uncompressed context;

when generating a reply according to the trained compression model, inputting the compressed virtual character and the generation result of the previous t steps, and calculating to obtain the probability p of the t th step _t 、p' _t Outputting the corresponding word until the model outputs EOS when the word is maximum;

generating a text reply is represented as:

output＝LLM(x _p ，q，x _k )

where q represents all compressed virtual characters, x _k Representing uncompressed text.

2. The method for compressing a context based on a large language model itself according to claim 1, wherein: the reasoning includes a dynamic interaction process:

if the user selects self-labeling, labeling words and sentences which are considered important by the user in the text, generating corresponding task description based on the user labeling, splicing the task description with the text to be compressed, and giving a reply through model reasoning;

3. A system for compressing a context based on a large language model itself using the method of any one of claims 1-2, characterized by:

the preprocessing module is used for acquiring a text to be compressed and adding task description, separator and compression groove;

the model training module is used for compressing the text to be compressed by utilizing the existing large language model under the condition that GPU resources are short, additionally training a projection layer, and pre-training the large language model to compress the text to be compressed when the GPU resources are abundant;

4. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 2 when the computer program is executed.

5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 2.