CN118036668B

CN118036668B - GPT model-oriented comprehensive evaluation method

Info

Publication number: CN118036668B
Application number: CN202410443128.9A
Authority: CN
Inventors: 高丰; 张汝云; 白文媛; 毛良献
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-06-07
Anticipated expiration: 2044-04-12
Also published as: CN118036668A

Abstract

The specification discloses a comprehensive evaluation method for GPT models, which comprises the steps of obtaining a GPT model to be tested, determining a training reasoning environment of the GPT model to be tested, determining theoretical performance of the GPT model to be tested according to the training reasoning environment, executing each test task by using the GPT model to be tested, determining execution performance of the GPT model to be tested when executing each test task according to execution process of the GPT model to be tested and the theoretical performance, determining reasoning capacity of the GPT model to be tested according to reasoning results of the GPT model to be tested when executing each test task, and determining test results of the GPT model to be tested according to the execution performance and the reasoning capacity, so that a user can have clear and visual knowledge on performances and capacities of different GPT models under the condition that model training is not performed, and the user can conveniently select the GPT model to meet self requirements.

Description

GPT model-oriented comprehensive evaluation method

Technical Field

The specification relates to the technical field of computers, in particular to a comprehensive evaluation method for a GPT model.

Background

Unlike natural language processing algorithms that can only accomplish a single task, the GPT (generative pre-Training ) model can perform a variety of complex tasks such as machine translation, text summarization, emotion analysis, dialog generation, etc. through a single model. Correspondingly, numerous GPT model products are also derived for different subdivision areas.

In the face of numerous GPT model products, it is often difficult for users to determine what GPT model product to choose to meet the needs of their own research or business because of the vague description of their own hardware needs and processing power by the GPT model.

Therefore, the invention provides a comprehensive evaluation method for the GPT model.

Disclosure of Invention

The specification provides a comprehensive evaluation method for a GPT model, so as to partially solve the problems existing in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a comprehensive evaluation method for a GPT model, which comprises the following steps:

Acquiring a GPT model to be tested;

Determining a training reasoning environment of the GPT model to be tested;

determining the theoretical performance of the GPT model to be tested according to the training reasoning environment;

executing each test task by using the GPT model to be tested;

Determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested and the theoretical performance, and determining the reasoning capacity of the GPT model to be tested according to the reasoning result of the GPT model to be tested when executing each test task;

and determining a test result of the GPT model to be tested according to the execution performance and the reasoning capability.

Optionally, the theoretical performance specifically includes: when the length of an input sample is 1, the GPT model to be tested executes ideal delay of a test task;

determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested for executing each test task and the theoretical performance, wherein the method specifically comprises the following steps:

determining the reasoning delay of the GPT model to be tested for executing the test task, wherein the length of an input sample of the test task is greater than 1;

And determining the execution performance of the GPT model to be tested when executing each test task according to the ideal delay and the reasoning delay.

Optionally, the theoretical performance specifically includes: theoretical throughput rate of the GPT model to be tested;

Determining the actual throughput rate of the GPT model to be tested when executing each test task;

and determining the execution performance of the GPT model to be tested when executing each test task according to the actual throughput rate and the theoretical throughput rate.

Optionally, determining the reasoning capability of the GPT model to be tested according to the reasoning result of each test task executed by the GPT model to be tested specifically includes:

Determining the matching degree of the reasoning result of the GPT model to be tested and a preset grammar rule according to the reasoning result of each test task executed by the GPT model to be tested, and determining the reasoning capacity of the GPT model to be tested according to the matching degree; and/or

Determining the coincidence degree of the reasoning result of the GPT model to be tested and preset ethic safety rules according to the reasoning result of each test task executed by the GPT model to be tested, and determining the reasoning capacity of the GPT model to be tested according to the coincidence degree; and/or

And determining the reasoning accuracy of the GPT model to be tested on the expertise in the vertical field according to the reasoning results of the GPT model to be tested on each test task, and determining the reasoning capacity of the GPT model to be tested according to the reasoning accuracy.

According to the reasoning results of the GPT model to be tested for executing each test task and the sample results obtained by the reference model for each test task, determining the difference between the reasoning results and the sample results;

and determining the reasoning capacity of the GPT model to be tested according to the difference.

Optionally, determining the difference between the reasoning result and the sample result specifically includes:

Determining a difference index, wherein the difference index is the number of test tasks with correct reasoning results and incorrect sample results, and/or the number of test tasks with incorrect reasoning results and correct sample results, and/or the number of test tasks with correct reasoning results and correct sample results, and/or the number of test tasks with incorrect reasoning results and incorrect sample results;

determining the reasoning capacity of the GPT model to be tested according to the difference, which specifically comprises the following steps:

And determining the reasoning capacity of the GPT model to be tested according to the difference index.

Optionally, acquiring the GPT model to be tested specifically includes:

Acquiring a plurality of GPT models to be tested;

Determining a test result of the GPT model to be tested, which specifically comprises the following steps:

determining a test result aiming at each GPT model to be tested;

after determining the test results for each GPT model under test, the method further comprises:

Taking the execution performance and the reasoning capacity as two dimensions of the comprehensive capacity, and determining a two-dimensional comprehensive capacity vector corresponding to each GPT model to be tested according to the execution performance and the reasoning capacity in the test result of each GPT model to be tested;

and displaying the two-dimensional comprehensive capacity vectors to a user.

The specification provides a comprehensive evaluation device facing GPT model, which comprises:

The model acquisition module is used for acquiring a GPT model to be detected;

The environment determining module is used for determining a training reasoning environment of the GPT model to be tested;

The performance acquisition module is used for determining the theoretical performance of the GPT model to be tested according to the training reasoning environment;

The execution module is used for executing each test task by using the GPT model to be tested;

the actual measurement module is used for determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested and the theoretical performance, and determining the reasoning capacity of the GPT model to be tested according to the reasoning result of the GPT model to be tested when executing each test task;

And the output module is used for determining a test result of the GPT model to be tested according to the execution performance and the reasoning capacity.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described GPT model-oriented comprehensive evaluation method.

The specification provides a device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the comprehensive evaluation method facing the GPT model when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

In the comprehensive evaluation method for the GPT model provided by the specification, a GPT model to be tested is obtained, a training reasoning environment of the GPT model to be tested is determined, theoretical performance of the GPT model to be tested is determined according to the training reasoning environment, each test task is executed by the GPT model to be tested, execution performance of the GPT model to be tested when each test task is executed is determined according to the execution process of the GPT model to be tested and the theoretical performance, reasoning capacity of the GPT model to be tested is determined according to the reasoning result of the GPT model to be tested when each test task is executed, and test results of the GPT model to be tested are determined according to the execution performance and the reasoning capacity.

According to the method, objective testing methods and standards are provided for the execution performance and reasoning capacity of the GPT model, so that a user can clearly and intuitively know the performance and capacity of different GPT models under the condition that model training is not performed, and the user can conveniently select the GPT model to meet the needs of self research or business.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a comprehensive evaluation method for a GPT model in the specification;

FIG. 2 is a graph of test result acquisition provided herein;

FIG. 3 is a schematic diagram of a confusion matrix based on a difference index provided in the present specification;

fig. 4 is a schematic diagram of a comprehensive evaluation device facing to a GPT model provided in the present specification;

Fig. 5 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a comprehensive evaluation method for a GPT model in the present specification, where the comprehensive evaluation method for a GPT model specifically includes the following steps:

s100: and obtaining the GPT model to be tested.

GPT model is a generic name of a series of neural network models with huge parameter scale using a multi-layer transducer architecture, and is usually trained to solve general language processing tasks such as text classification, question-answering, document summarization, text generation and the like, and can provide support for the generated artificial intelligence application program. The different GPT models often have great differences in the requirements for hardware resources according to the differences of the respective model structures, and further, the different GPT models also have different processing capacities for different types of language processing tasks after training due to the differences of sample data sets used during model training. When users with language processing task requirements select, it is often difficult to determine a proper GPT model because each GPT model lacks objective and same test data on its reasoning ability and execution performance. Therefore, the application provides a comprehensive evaluation method for a GPT model, the execution subject of the specification can be a comprehensive evaluation server for the GPT model, or other electronic equipment with connection capability with the GPT model to be tested, and the specification is not limited. For convenience of explanation, the comprehensive evaluation method for the GPT model provided in the present specification will be explained below with only a server as an execution body.

Firstly, a GPT model to be tested is obtained, the GPT model to be tested can be a GPT model which is specified by a user and needs to obtain a test result, the GPT model is arranged according to hardware resources supporting the GPT model and a training environment, an input sample provided by a server can be accepted, and an output result is determined according to the input sample.

The hardware resource supporting the GPT model may be a virtual resource of the cloud or a physical hardware resource, which is not limited herein; the training environment of the GPT model can be adjusted according to the actual needs of the training personnel, and the specification is not limited; the obtained GPT model to be tested can be various common architectures such as PyTorch, deepspeed, megatron, etc., and the present specification is not limited herein.

S102: and determining a training reasoning environment of the GPT model to be tested.

The training reasoning environment of the GPT model to be tested is determined specifically by: determining GPU, CPU computing clusters, storage and RDMA (remote direct memory access ) network resources of the GPT model to be tested; determining a training environment, a frame, a library and a platform for installing the GPT model to be tested; and determining the architecture and parameters of the GPT model to be tested.

S104: and determining the theoretical performance of the GPT model to be tested according to the training reasoning environment.

According to the training inference environment determined in step S102, the hardware condition, the model architecture and the parameters of the GPT model to be tested can be determined according to the training inference environment of the GPT model to be tested, and the theoretical performance of the GPT model to be tested can be further determined according to the hardware condition, the model architecture and the parameters.

Specifically, the theoretical performance may refer to the highest performance theoretically achievable when the GPT model to be tested performs the reasoning task in the acquired training reasoning environment. The execution performance may be an index such as inference delay, throughput rate, etc. that represents the efficiency of the GPT model to be tested to execute the inference task or the utilization rate of the hardware resource, which is not limited herein.

S106: and executing each test task by using the GPT model to be tested.

After the GPT model to be tested is obtained, each test task prearranged in the server can be obtained, and each test task is executed by using the GPT model. Specifically, the process of executing the test task may be to input the input samples corresponding to each test task into the GPT model to be tested, obtain the output result of the GPT model to be tested, and record the output result and the process.

The test tasks can preset information such as input samples, standard output results, test purposes and the like of each test task according to the test requirements on execution performance and reasoning capacity.

S108: determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested and the theoretical performance, and determining the reasoning capacity of the GPT model to be tested according to the reasoning result of the GPT model to be tested when executing each test task.

As described above, the purpose of executing each test task by using the GPT model to be tested is to obtain the execution performance and reasoning capability of the GPT model to be tested. Specifically, the execution performance is determined by the parameter architecture of the GPT model to be tested and the use condition of the GPT model to be tested on the hardware resources, and the higher the execution performance is, the higher the efficiency of the GPT model to be tested in executing test tasks and the higher the utilization rate of the hardware resources are; the reasoning capacity characterizes the capacity of the GPT model to be tested to understand the input sample and answer according to the understanding, the accuracy of the GPT model to be tested to answer the questions represented by the input sample or the matching degree of the output result of the GPT model to be tested and the preset answer rule corresponding to the input sample can be determined according to the preset standard output result, and the reasoning capacity is characterized according to the accuracy or the matching degree.

Therefore, by recording the output result and the process in step S108, the execution performance of the language model to be tested can be obtained according to the execution process of executing each test task by the GPT model to be tested, and the reasoning ability of the GPT model to be tested can be obtained according to the reasoning result of executing each test task by the GPT model to be tested.

For a test task, according to a preset test purpose of the test task, the test task can be used for determining reasoning capacity and execution performance at the same time or only one of the two, and the specification is not limited herein; the execution performance may be specifically represented by various common hardware resource usage indexes, such as GPU usage rate, CPU usage rate, load average value in unit time, or any combination of hardware resource usage indexes, and a specific combination manner between different hardware resource usage indexes may be to preset weights for the hardware resource usage indexes, and then combine the different hardware resource usage indexes by using various common average calculation methods, such as arithmetic average, geometric average, harmonic average, and the like, which is not limited in this specification.

It should be noted that, when characterizing the execution performance, the converted value may be used to characterize the execution performance after standard transformation or other numerical transformation operations are performed on the usage index of each hardware resource. Therefore, when a plurality of hardware resource usage indexes are used for characterization, each hardware resource usage index can have a unified measurement unit, and can be combined and compared more easily, for example, each hardware resource usage index can be converted into a characterization form of the ratio of an actual value to a maximum value, and when each hardware resource usage index is subjected to numerical conversion, the corresponding relation between the magnitude of the converted numerical value and the executing performance needs to be ensured, and the numerical value and the executing performance are the same for different hardware resource usage indexes.

S110: and determining a test result of the GPT model to be tested according to the execution performance and the reasoning capability.

After determining the execution performance and the reasoning capacity corresponding to the GPT model to be tested, the execution performance and the reasoning capacity can be represented by numerical values, the larger the numerical value is, the stronger the representation capacity is used for description, the representation value of the execution performance and the representation value of the reasoning capacity can be subjected to weighted fusion, and the test result of the GPT model to be tested is determined according to the comprehensive capacity representation value obtained after fusion, so that the stronger the comprehensive capacity representation value is, the better the test result of the GPT model to be tested is; the execution performance and the reasoning capacity can be regarded as two independent dimensions, and the two-dimensional vector is used for representing the test result of the GPT model to be tested.

Specifically, an execution performance standard value and an inference capability standard value can be preset, the execution performance characteristic value of the condition a is higher than the preset execution performance standard value, the inference capability characteristic value of the condition B is higher than the preset inference capability standard value, the two test standards are used as two test standards of the GPT model to be tested, a graph is obtained by using a test result shown in fig. 2, and an area where the GPT model to be tested is located is determined in the graph according to the performance characteristic value and the inference capability characteristic value of the GPT model to be tested. Simultaneously, the condition A and the condition B are satisfied to represent that the execution performance and the reasoning capacity of the GPT model to be tested are excellent, and a user is recommended to select the GPT model to be tested; neither condition A nor condition B satisfies hold more than one job currently indicating the execution performance and reasoning capacity of the GPT model to be tested, and the user is not recommended to select the GPT model to be tested; only one of the condition a and the condition B is satisfied, which indicates that the GPT model to be tested has room for further improvement, thereby obtaining the test result of the GPT model to be tested. After the test result of the GPT model to be tested is obtained, the test result can be fed back to the user.

The weight for carrying out weighted fusion on the characteristic value of the execution performance and the characteristic value of the reasoning capacity can be adjusted according to the requirement of a user, and the specification is not particularly limited; the execution performance standard value and the reasoning capacity standard value are preset, a common GPT model can be selected as a comparison model, the execution performance and the reasoning capacity of the comparison model can be determined, and the execution performance standard value and the reasoning capacity standard value can be determined according to the test result of each tested GPT model, and the specification is not limited herein; in order to make the test result easier to understand by the user, the test result of each common GPT model can be provided as a reference when the test result is fed back to the user.

The comprehensive evaluation method for the GPT model shown in fig. 1 provides objective test methods and standards for the execution performance and reasoning capacity of the GPT model, so that a user can have clearer and more visual understanding on the performance and capacity of different GPT models without model training, and the user can conveniently select the GPT model to meet the needs of self research or business.

In addition, the theoretical properties specifically include: when the length of an input sample is 1, the GPT model to be tested executes ideal delay of a test task; determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested for executing each test task and the theoretical performance, wherein the method specifically comprises the following steps: determining the reasoning delay of the GPT model to be tested for executing the test task, wherein the length of an input sample of the test task is greater than 1, and determining the execution performance of the GPT model to be tested when executing each test task according to the ideal delay and the reasoning delay.

Inference delay refers to the time interval of the model from the input sample to the output result, and is typically measured in milliseconds (ms) or seconds(s). The model with low reasoning delay is more suitable for scenes requiring quick response, such as real-time image processing, voice recognition and the like. In order to embody the use condition of the GPT model to be tested on hardware for supporting execution of the GPT model to be tested, rather than just the performance of hardware resources, the specification applies ideal delay and actual reasoning delay when the GPT model to be tested executes test tasks to determine the execution performance corresponding to the GPT model to be tested so as to measure the quick response capability of the GPT model to be tested, and specifically, the ratio of the ideal delay and the reasoning delay can be used for representing the execution performance corresponding to the GPT model to be tested.

The length of the input sample of the test task can be adjusted according to the requirement, the length unit of the input sample can be the number of semantic units in the text, and for different GPT models to be tested, the test tasks corresponding to the reasoning delay used for calculation are required to be ensured to be the same test tasks.

Specifically, when a plurality of test tasks exist, the average value of each reasoning delay of each test task can be used as the finally determined reasoning delay; in order to make the performance have a stronger characterization capability, the unit length of the input sample of the test task should be made as much as possible to be greater than 1, and in one or more embodiments of the present specification, the length of the input sample of the test task may be set to 2048 semantic units.

In addition, in step S108 shown in fig. 1, the theoretical performance specifically includes: theoretical throughput rate of the GPT model to be tested; determining the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested for executing each test task and the theoretical performance, wherein the method specifically comprises the following steps: and determining the actual throughput rate of the GPT model to be tested when executing each test task, and determining the execution performance of the GPT model to be tested when executing each test task according to the actual throughput rate and the theoretical throughput rate.

The throughput rate is an index for evaluating the utilization condition of the execution performance of the model running process, and the theoretical throughput rate is the upper throughput rate limit of the model determined according to the hardware condition of the model and the model architecture. For interactive tasks with strict requirements, such as chat robots, a model with high throughput rate can have faster response speed under the same hardware resources. The method and the device apply the actual throughput rate of the GPT model to be tested when executing each test task and the theoretical throughput rate of the GPT model to be tested, and determine the corresponding execution performance of the GPT model to be tested so as to measure the response capability of the GPT model under high-strength tasks.

Specifically, the average value of the actual throughput rates of all the test tasks can be used as the final actual throughput rate of the GPT model to be tested; the ratio of the actual throughput rate to the theoretical throughput rate can be used for representing the execution performance corresponding to the GPT model to be tested.

In one or more embodiments of the present disclosure, the execution performance of the GPT model to be tested may be determined together according to the ideal delay and the inferred delay of the GPT model to be tested, and the actual throughput rate and the theoretical throughput rate of the GPT model to be tested. Specifically, the ratio of the ideal delay to the inference delay can be taken as P1, the ratio of the actual throughput rate to the theoretical throughput rate can be taken as P2, the P1 and the P2 are combined to obtain the execution performance corresponding to the GPT model to be tested, the P1 and the P2 can be combined in a common weighted combination method, taking t1 as the weight of P1, t2 as the weight of P2, and v1 as the representation value of the execution performance as an example:

the present description is not limited to a specific weighted combination.

For the manner of determining the inference capability, in step S108 shown in fig. 1, the matching degree of the inference result of the GPT model to be tested and the preset grammar rule is determined according to the inference result of the GPT model to be tested, the inference capability of the GPT model to be tested is determined according to the matching degree, and/or the coincidence degree of the inference result of the GPT model to be tested to the preset ethical security rule is determined according to the inference result of the GPT model to be tested, the inference capability of the GPT model to be tested is determined according to the coincidence degree, and/or the inference accuracy of the GPT model to be tested to the expertise in the vertical field is determined according to the inference accuracy.

Determining the matching degree of the reasoning result of the GPT model to be tested and a preset grammar rule, wherein the matching degree can be preset statement structures which meet the grammar rule, and when the output result of the GPT model to be tested for a test task is obtained, detecting whether the output result meets the preset statement structure, and the more the test tasks of which the output result meets the statement structure, the higher the matching degree of the GPT model to be tested and the preset grammar rule; determining the coincidence degree of the reasoning result of the GPT model to be tested and preset ethic safety rules, which can be preset topic prohibiting keywords, when the output result of the GPT model to be tested for one test task is obtained, detecting whether the output result contains topic prohibiting keywords, and the more the test tasks of which the output result does not contain topic prohibiting keywords, the higher the coincidence degree of the GPT model to be tested and the preset ethic safety rules.

The grammar rule type test tasks can be set according to preset grammar rules, and according to the output results of the GPT model to be tested according to the grammar rule type test tasks, the more grammar rule type test tasks the output results meet the standard output results, the higher the matching degree is; presetting an ethical security class test task according to each ethical security rule, and according to the output result of the GPT model to be tested aiming at the ethical security class test task, enabling the more ethical security class test tasks of which the output results meet the standard output results, the higher the coincidence degree; and presetting a vertical domain expertise testing task according to the vertical domain expertise, and according to the output result of the GPT model to be tested according to the vertical domain expertise testing task, the more the vertical domain expertise testing tasks with the output result conforming to the standard output result, the higher the reasoning accuracy.

In one or more embodiments of the present description, to avoid ambiguity, the test task employs standardized choice questions:

Language rule class: lexical test questions, grammatical test questions, chapter accuracy test questions, semantic accuracy test questions, logical consistency test questions, style consistency test questions, knowledge accuracy test questions, knowledge enrichment test questions, knowledge consistency test questions, and the like.

Ethics An Quanlei: the test questions include the test questions of the anti-inflammatory content, the test questions of unfair and discrimination internal memory, the test questions of crimes and illegal activities, the test questions of sensitive topics, the test questions of physical injuries, the test questions of psychological health, the test questions of privacy and property, the test questions of ethics and morals, the test questions of instruction attack and the like.

Vertical domain expertise class: work needs to be performed in the vertical domain for the GPT model, in addition to the general domain. After editing the related questions in the field, the user can conveniently guide the questions into the test task, and the related tests can be conveniently developed.

Further, in step S108 shown in fig. 1, according to the reasoning results of the GPT model to be tested for executing each test task and the sample results obtained by the reference model for each test task, a difference between the reasoning results and the sample results is determined, and the reasoning capability of the GPT model to be tested is determined according to the difference.

In step S102, each test task may set information such as the input sample, the standard output result, and the test purpose of each test task in advance according to the above test requirements for performance and reasoning ability, where the setting of the standard output result may set a common GPT model, such as a text, chatGPT, etc., as a reference model, and the sample result obtained by the reference model for each test task is set as the standard output result and stored, so that after the reasoning result of the GPT model to be tested on each test task is obtained, the difference between the reasoning result and the sample result is determined, that is, the reasoning ability difference between the GPT model to be tested and the reference model is determined, and the reasoning ability of the GPT model to be tested is determined.

Specifically, in step S104 shown in fig. 1, a difference index is determined, where the difference index is the number of test tasks with correct reasoning results and incorrect sample results, and/or the number of test tasks with incorrect reasoning results and correct sample results, and/or the number of test tasks with correct reasoning results and correct sample results, and/or the number of test tasks with incorrect reasoning results and incorrect sample results, and the reasoning capability of the GPT model to be tested is determined according to the difference index.

The difference between the reasoning result and the sample result, namely, the difference index, can be specifically expressed as at least one of the number of test tasks q1 with correct reasoning result and incorrect sample result, the number of test tasks q2 with correct reasoning result and correct sample result, the number of test tasks q3 with correct reasoning result and correct sample result, and the number of test tasks q4 with incorrect reasoning result and incorrect sample result.

The specific way of determining the inference capability of the GPT model to be tested according to the difference index may be performed in various common ways, for example, the inference capability characterization value v2 may be set as follows: v2= (q2+q3)/(q1+q4), or v2= (q2-q3) 2/(q2+q3).

In one or more embodiments of the present disclosure, the confusion matrix shown in fig. 3 may also be established according to the above-mentioned difference index, and v2 is determined by using a microphone markni test:

The larger the value of v2 is, the larger the difference between the reasoning result and the sample result is, and the stronger the reasoning capability of the GPT model to be measured is.

In one or more embodiments of the present disclosure, a plurality of GPT models to be tested are obtained, a test result for each GPT model to be tested is determined, performance and reasoning capacity are used as two dimensions of comprehensive capacity, a two-dimensional comprehensive capacity vector corresponding to each GPT model to be tested is determined according to the performance and reasoning capacity in the test result of each GPT model to be tested, and each two-dimensional comprehensive capacity vector is displayed to a user.

According to the method provided by the specification, each GPT model which can be obtained by the server is used as a GPT model to be tested, the test result of each GPT model to be tested is determined, and the two-dimensional comprehensive capacity vector corresponding to each GPT model to be tested is determined according to each test result.

In one or more embodiments of the present description, the GPT model may also be trained by the user at his own discretion and tested on the completed GPT model.

The construction of the GPT model training environment comprises 4 aspects of hardware resources, software tools, data preparation, model parameter setting and the like:

(1) Hardware resources are obtained. The training of the GPT model requires a large amount of computing power and therefore requires the selection of high performance GPUs, CPUs, storage and networking, etc. GPUs are preferred for training the GPT model because they are more suitable than CPUs for processing a large number of matrix operations in parallel. The size and speed of memory and storage also affect the efficiency and cost of training, requiring reasonable allocation and optimization according to the size of the data set and model. In addition, there is a need to consider how to train with a high performance RDMA network, using multiple devices in a distributed manner to increase the speed and stability of training. This involves how the data and model are partitioned, and how the parameters and gradients are synchronized, etc.

(2) Installing a distributed GPT model training environment. Training of the GPT model requires selection of appropriate frameworks, libraries, and platforms, and how to configure, write, and monitor the training scripts. Currently, large model training and reasoning frameworks, such as PyTorch, deepspeed, megatron, etc., provide flexible and efficient programming interfaces, as well as rich functionality and community support. In addition, there are libraries specific to GPT models, such as Hugging Face Transformers, megatron-LM, etc., which provide a convenient way to pre-train and fine tune GPT models, as well as some up-to-date models and datasets. Some cloud computing platforms, such as ali cloud, hundred degree cloud and the like, also provide one-stop GPT model training services, including hardware resource allocation, software tool installation, running and monitoring of training scripts and the like.

(3) Data is prepared. Training of the GPT model requires selection of the appropriate data set, including both public and proprietary, and how to clean, pre-process, and label the data. The quality and number of datasets directly impact the performance and generalization ability of the model, and therefore require screening and sampling according to the goals and domain of the model. In general, the GPT model requires a large amount of text data, which can be obtained from the Internet or other sources, or can be collected and built on its own. The cleansing and preprocessing of the data sets includes removing extraneous content such as HTML tags, advertisements, copyright notices, etc., as well as unifying the format, coding, language, etc. of the data. Tokenization of a dataset includes segmenting text into smaller units, such as words, subwords, etc., and constructing vocabularies, coding schemes, etc.

(4) Setting model architecture, parameters, adjustment strategies, and the like. The choice of model architecture depends on the task and complexity of the model, and there are many popular model architectures such as Transformer, BERT, GPT, etc. that have different hierarchies, attentiveness mechanisms, pre-training goals, etc. On the basis of selecting a proper model architecture, parameters such as size, context window, optimizer and the like are determined. The choice of model size depends on the expressive power required for the model application and the computational cost of the model itself, in general, the larger the model, the better the performance, but also the more difficult it is to train and deploy. The choice of the context window depends on the memory capacity and long-term dependence of the model, and in general, the larger the context window, the more global information the model can capture for text, but also the more difficult it is to optimize and parallelize. The choice of optimizer depends on the convergence speed and stability of the model, and there are many optimization algorithms such as SGD, adam, LAMB, etc. that have different learning rates, momentums, regularization, etc. The adjustment strategy of the model comprises selecting proper loss functions, indexes, verification sets, test sets and the like, and how to perform super-parameter searching, model compression, model interpretation and the like.

The above comprehensive evaluation method for the GPT model provided for one or more embodiments of the present disclosure further provides a corresponding comprehensive evaluation device for the GPT model based on the same thought, as shown in fig. 4.

Fig. 4 is a schematic diagram of a comprehensive evaluation device for a GPT model provided in the present specification, which specifically includes:

The model acquisition module 400 acquires a GPT model to be measured.

The environment determining module 402 determines a training reasoning environment of the GPT model to be tested.

And the performance acquisition module 404 determines the theoretical performance of the GPT model to be tested according to the training reasoning environment.

And an execution module 406, which executes each test task by using the GPT model to be tested.

The actual measurement module 408 determines the execution performance of the GPT model to be tested when executing each test task according to the execution process of the GPT model to be tested and the theoretical performance, and determines the reasoning capability of the GPT model to be tested according to the reasoning result of the GPT model to be tested when executing each test task.

And an output module 410, configured to determine a test result of the GPT model to be tested according to the execution performance and the reasoning capability.

The actual measurement module 408 is specifically configured to: determining the reasoning delay of the GPT model to be tested for executing the test task, wherein the length of an input sample of the test task is greater than 1, and determining the execution performance of the GPT model to be tested when executing each test task according to the ideal delay and the reasoning delay.

The actual measurement module 408 is specifically configured to: and determining the actual throughput rate of the GPT model to be tested when executing each test task, and determining the execution performance of the GPT model to be tested when executing each test task according to the actual throughput rate and the theoretical throughput rate.

Optionally, the actual measurement module 408 is specifically configured to: determining the matching degree of the reasoning result of the GPT model to be tested and a preset grammar rule according to the reasoning result of the GPT model to be tested, determining the reasoning capability of the GPT model to be tested according to the matching degree, and/or determining the coincidence degree of the reasoning result of the GPT model to be tested and preset ethic safety rules according to the reasoning result of the GPT model to be tested, determining the reasoning capability of the GPT model to be tested according to the coincidence degree, and/or determining the reasoning accuracy of the GPT model to be tested on the expertise in the vertical field according to the reasoning accuracy according to the reasoning result of the GPT model to be tested.

Optionally, the actual measurement module 408 is specifically configured to: and determining the difference between the reasoning result and the sample result according to the reasoning result of each test task executed by the GPT model to be tested and the sample result obtained by the reference model aiming at each test task, and determining the reasoning capability of the GPT model to be tested according to the difference.

Optionally, the actual measurement module 408 is specifically configured to: determining a difference index, wherein the difference index is the number of test tasks with correct reasoning results and incorrect sample results, and/or the number of test tasks with incorrect reasoning results and correct sample results, and/or the number of test tasks with correct reasoning results and correct sample results, and/or the number of test tasks with incorrect reasoning results and incorrect sample results, and determining the reasoning capacity of the GPT model to be tested according to the difference index.

Optionally, the obtaining module 400 is specifically configured to: acquiring a plurality of GPT models to be tested;

the output module 410 is specifically configured to: determining a test result aiming at each GPT model to be tested;

the output module 410 is further configured to: and determining two-dimensional comprehensive capacity vectors corresponding to each GPT model to be tested according to the execution performance and the reasoning capacity in the test results of the GPT models to be tested by taking the execution performance and the reasoning capacity as two dimensions of the comprehensive capacity, and displaying the two-dimensional comprehensive capacity vectors to a user.

The present specification also provides a computer readable storage medium storing a computer program, where the computer program is configured to execute the comprehensive evaluation method for the GPT model provided in fig. 1.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. As shown in fig. 5, at the hardware level, the GPT model-oriented comprehensive evaluation device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile memory, and may of course include hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the comprehensive evaluation method facing the GPT model, which is shown in the figure 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 30 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware Description Language）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（Ruby Hardware Description Language）, and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K30, and Silicone Labs C8051F330, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims

1. The comprehensive evaluation method for the GPT model is characterized by comprising the following steps of:

Acquiring a GPT model to be tested;

Determining a training reasoning environment of the GPT model to be tested;

executing each test task by using the GPT model to be tested;

2. The method according to claim 1, wherein the theoretical properties specifically comprise: when the length of an input sample is 1, the GPT model to be tested executes ideal delay of a test task;

3. The method according to claim 2, wherein the theoretical properties specifically comprise: theoretical throughput rate of the GPT model to be tested;

4. The method of claim 1, wherein determining the inference capability of the GPT model under test based on the inference results of the GPT model under test performing each test task, specifically comprises:

5. The method of claim 1, wherein determining the inference capability of the GPT model under test based on the inference results of the GPT model under test performing each test task, specifically comprises:

6. The method of claim 5, wherein determining the difference of the inference result and the sample result comprises:

7. The method of claim 1, wherein obtaining the GPT model to be tested specifically comprises:

Acquiring a plurality of GPT models to be tested;

determining a test result aiming at each GPT model to be tested;

and displaying the two-dimensional comprehensive capacity vectors to a user.

8. The comprehensive evaluation device for the GPT model is characterized by comprising:

The model acquisition module is used for acquiring a GPT model to be detected;

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of the preceding claims 1-7 when the program is executed by the processor.