CN113722570B

CN113722570B - Method, device and equipment for constructing pre-training corpus and readable medium

Info

Publication number: CN113722570B
Application number: CN202110932826.1A
Authority: CN
Inventors: 于彤
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-07-18
Anticipated expiration: 2041-08-13
Also published as: CN113722570A

Abstract

The invention discloses a construction method of a pre-training corpus, which comprises the following steps: judging whether the size of the required data set is a small-scale data set or not based on the size of the pre-training corpus to be constructed; if the size of the data set is not a small-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the preset token number ratio; respectively calculating the weight of the single high-quality data set based on the number of the token needed to be provided by each high-quality data set and the number of the token of the single high-quality data set; the crawler data is sampled based on the weights of the crawler data, and the single high-quality data sets are respectively sampled based on the weights of the single high-quality data sets, so that a pre-training corpus is obtained. The invention also discloses a construction device, computer equipment and a readable storage medium of the pre-training corpus. According to the method, different sampling modes are adopted for the pre-training of different scales, so that the quality of the pre-training corpus is improved.

Description

Method, device and equipment for constructing pre-training corpus and readable medium

Technical Field

The present invention relates to the field of pre-training language models, and in particular, to a method, an apparatus, a device, and a readable medium for constructing a pre-training corpus.

Background

Pre-trained language models have become a very popular research direction in recent years. So-called pre-training language models require training with a large number of texts that occur in the life of people, so that the language model learns the probability distribution of each word or word occurring in these texts, thus modeling a model that fits these text distributions. The labels of the language model are the context of the language model, and the unlabeled language is easier to obtain than the labeled language. The method and the system enable people to train the language model by using unlabeled linguistic data almost without limitation, and the large-scale linguistic data enable the pre-trained language model to obtain strong learning ability, so that the pre-trained language model further shows excellent effects on downstream tasks. The pre-training model provides better model initialization, which generally leads to better generalization performance, accelerates convergence to the target task, and pre-training can also be seen as a regularization to avoid overfitting of small data.

For large-scale pre-training models, preparation and cleaning of the pre-training corpus is an important step. In general, data is divided into two categories: one class is some publicly-published natural language processing data sets, and the other class is crawler data. Publicly-published data sets are usually cleaned, the data are clean, and sources of the data such as news, encyclopedia, books, archives, questions and answers and the like can be basically ensured, wherein some artificially marked labels are also called high-quality data sets. However, the problem with the public dataset is that the artificial labels also define the problem space of natural language processing, which weakens the generalization ability of the model to a certain extent, and in addition, the public dataset is usually aimed at a certain field, and if the pre-training corpus is concentrated in a certain field, the performance of the model on a downstream task is also affected. The crawler data is messy compared with the existing public data set, and the data source needs to be paid attention to during transcoding cleaning so as to prevent data which is not suitable for pre-training from being mixed, and sensitive words, messy codes, tables and the like in the data need additional processing. If the crawler data is used for pre-training corpus, the pre-processing workload is much larger, and the quality of the processed data is often inferior to that of the public data set. The advantages of a crawler dataset are that the amount of data will be much larger and the source of the dataset will be richer.

When the pre-trained corpus is ready, a further important issue is how to sample in the corpus, thus composing the pre-trained dataset and the validation set. If the data set is obtained by adopting normal random sampling, the composition proportion of the data set is almost the size proportion of the corpus of each source. Because the data volume of the crawler data is large, a relatively large proportion is occupied in the training set; whereas the duty cycle of the public dataset, which is of relatively higher quality, is relatively low. According to the previous research, repeated times on a certain corpus during sampling can not affect the pre-training result, but for a large-scale pre-training model, the corpus is as large as possible, because the large-scale corpus is easier to be over-fitted on a smaller-scale dataset.

Most of the current pre-training models, no matter the size of the corpus scale, are almost sampled at equal proportion randomly among the corpora, i.e. the sizes of different corpus sources determine the proportion of the corpora occupied by the corpora in the pre-training data set. Even if unequal ratio sampling is mentioned in some methods, no explicit rules are given as to how the sampling ratio is determined.

Disclosure of Invention

In the prior art, the data volume of the crawler data is far larger than that of other public data sets, the quality of the training set is reduced by equal proportion sampling, and therefore the pre-training effect is affected. And there are no different data sampling schemes for the different scale pre-trained models. No explicit rules are given for the non-equal ratio sampling scheme and subsequent work is not easily referenced.

As the volume of the pre-training model increases, the scale of the pre-training corpus increases, and the disclosed high quality data set cannot meet the pre-training requirement. The problems we need to solve include: when non-high quality datasets need to be introduced; what is different for the composition of models and training corpus of different scales; how to ensure the quality of a training set and a verification set test set to the greatest extent when a non-high-quality data set is introduced into a pre-training corpus; a quantitative acquisition scheme is proposed for reference in subsequent work.

Therefore, an object of the embodiments of the present invention is to provide a method, apparatus, device, and readable medium for constructing a pre-training corpus, which adopts different sampling methods for pre-training of different scales, so as to control the ratio of crawler data in a pre-training data set, prevent the data sources with smaller scales from being excessively learned by a model, and improve the quality of the pre-training corpus.

Based on the above objects, an aspect of the embodiments of the present invention provides a method for constructing a pre-training corpus, including the following steps: judging whether the size of the required data set is a small-scale data set or not based on the size of the pre-training corpus to be constructed; if the size of the data set is not a small-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the preset token number ratio; respectively calculating the weight of each high-quality data set based on the number of the tokens required to be provided by the high-quality data set and the number of the tokens of the single high-quality data set; and sampling the crawler data based on the weight of the crawler data, and respectively sampling the single high-quality data sets based on the weight of the single high-quality data sets to obtain a pre-training corpus.

In some embodiments, the method further comprises: if the data set is required to be of a small scale, randomly sampling from the high quality data set to obtain a pre-training corpus.

In some embodiments, if the data set is not small-scale, calculating the weight of the crawler data and the number of tokens to be provided per high-quality data set based on the preset token number duty ratio includes: if the data set size is a middle-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the low preset token number ratio; if the data set is required to be large-scale, calculating the weight of the crawler data and the number of the token required to be provided by each high-quality data set based on the high preset token number ratio.

In some embodiments, calculating the weights for the individual high quality data sets based on the number of tokens that each high quality data set needs to provide and the number of tokens for the individual high quality data sets, respectively, includes: judging whether the weight of the single high-quality data set exceeds a preset weight; if the weight of the single high-quality data set exceeds the preset weight, setting the weight of the single high-quality data set as the value of the preset weight, and recalculating the weights of other single high-quality data sets.

In some embodiments, the method further comprises: the size of the pre-training corpus to be constructed is determined based on the parameter amounts of the language model.

In some implementations, determining the size of the pre-training corpus to be built based on the parametric quantity of the language model includes: if the parameter number of the language model is not more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 100 gigabytes; if the parameter quantity of the language model is not more than 10 bytes and more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes; if the parameter amount of the language model is not more than 100 bytes and is more than 10 bytes, confirming that the size of the pre-training corpus to be constructed is more than 1 terabyte.

In some embodiments, determining whether the desired dataset size is a small-scale dataset based on the size of the pre-training corpus to be constructed comprises: if the size of the pre-training corpus to be constructed is not more than 100 gigabytes, confirming that the size of the data set is a small-scale data set; if the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes, the data set is required to be of a size which is not a small-scale data set and is a medium-scale data set; if the size of the pre-training corpus to be constructed exceeds 1 terabyte, the data set is confirmed to be not a small-scale data set and a large-scale data set.

In another aspect of the embodiment of the present invention, there is also provided a device for constructing a pre-training corpus, including: the first module is configured to judge whether the size of the needed data set is a small-scale data set or not based on the size of the pre-training corpus to be constructed; the second module is configured to calculate the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the preset token number duty ratio if the data set size is not a small-scale data set; a third module configured to calculate weights of the single high quality data sets based on the number of tokens that each high quality data set needs to provide and the number of tokens of the single high quality data set, respectively; and a fourth module configured to sample the crawler data based on the weights of the crawler data, and sample the single high quality data sets based on the weights of the single high quality data sets, respectively, to obtain a pre-training corpus.

In still another aspect of the embodiment of the present invention, there is also provided a computer apparatus, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor performing the steps of the method described above.

In yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.

The invention has at least the following beneficial technical effects: different sampling modes are adopted for the pre-training of different scales, so that the duty ratio of crawler data in a pre-training data set is controlled, the data sources with smaller scales are prevented from being excessively learned by a model, and the quality of a pre-training corpus is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of a method of constructing a pre-training corpus provided by the present invention;

FIG. 2 is a schematic diagram of an embodiment of a device for building a pre-training corpus provided by the present invention;

FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;

fig. 4 is a schematic diagram of an embodiment of a computer readable storage medium provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are merely used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

Based on the above object, in a first aspect of the embodiments of the present invention, an embodiment of a method for building a pre-training corpus is provided. Fig. 1 is a schematic diagram of an embodiment of a method for constructing a pre-training corpus according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s01, judging whether the size of a required data set is a small-scale data set or not based on the size of a pre-training corpus to be constructed;

s02, if the size of the data set is not a small-scale data set, calculating the weight of the crawler data and the number of the token to be provided by each high-quality data set based on the preset token number ratio;

s03, respectively calculating the weight of each high-quality data set based on the number of the tokens required to be provided by each high-quality data set and the number of the tokens of each high-quality data set; and

s04, sampling the crawler data based on the weight of the crawler data, and respectively sampling the single high-quality data sets based on the weight of the single high-quality data sets to obtain a pre-training corpus.

In the embodiment, different sampling ratios are adopted for the pre-training of different scales, when the pre-training corpus is in the scale of tens of BG, the crawler data is not suitable, and a random sampling mode is used; when the pre-training corpus is in hundred GB scale, the data of the crawlers is ensured not to exceed 20%, and the proportion of each high-quality data source is ensured to be equal as much as possible; when the pre-training corpus is in a 1T scale, ensuring that the crawler data does not exceed 60 percent, and ensuring that the proportion of each high-quality data source is equal as much as possible; for data sources with smaller scale, the weight during training is ensured not to exceed 4 as much as possible so as to avoid the data set from being over-learned.

In this embodiment, taking the parameter amount at the 10B scale or the pre-training corpus at the hundred GB scale as an example, a certain weight needs to be added during sampling to ensure that the number of tokens occupied by the crawler data in the training set is not more than 20%, and the rest of high-quality data sets a from different sources ₁ ,a ₂ ,…,a _n The number of tokens in the training set is as equal as possible. Specifically, if we wish to train N number of tokens in pre-training the model, the total number of tokens of the crawler data is B, and the total number of tokens of the high-quality data set is A, respectively ₁ ,A ₂ ,…,A _n . If the number of the tokens occupied by the crawler data b in the training set does not exceed 0.2N, the weight used by the crawler data in sampling is w _b =0.2n/B. If we want the token specific gravity of each high quality data set to be equal, in the ideal state, the sampling weight of each high quality data set is respectively:

wherein w is to avoid overfitting _ai And is less than or equal to 4. If one or some of the high quality data sets is too small, so that w _ai When the calculated value of (2) is greater than 4, let w _ai =4, the remaining sampling weights are adjusted in order:

and so on.

For example, if 10 high-quality data sets except for the crawler data are used, each data set is ensured to be 8% of the token as much as possible, and the crawler data is 20%.

In this embodiment, taking the parameter amount at 100B scale or the pre-training corpus at 1T scale as an example, a certain weight needs to be added during sampling to ensure that the number of tokens occupied by the crawler data does not exceed 60%, and the number of tokens occupied by the rest high-quality data sets from different sources in the training set is as equal as possible. Specifically, if we wish to train N number of token in pre-training model, the total number of token of the crawler data is B, and the total number of token of the high quality data set is A respectively ₁ ,A ₂ ,…,A _n . If the number of the tokens occupied by the crawler data b in the training set does not exceed 0.6N, the weight used by the crawler data in sampling is w _b =0.6n/B. If we want the token specific gravity of each high quality data set to be equal, in the ideal state, the sampling weight of each high quality data set is respectively:

and so on.

For example, if there are 10 high quality datasets other than the crawler data, each dataset is guaranteed to be 4% of the token, and the crawler data is 60%.

In some embodiments of the invention, the method further comprises: if the data set is required to be of a small scale, randomly sampling from the high quality data set to obtain a pre-training corpus.

In this embodiment, when the parameter number is 1B scale or less, or the pre-training corpus is tens of BG scale, high-quality data sets are all used, and the high-quality data sets can be randomly sampled to obtain a pre-trained training set and a verification set, and the proportion of the different data sets contributing to the token is the proportion of the data set size.

In some embodiments of the present invention, if the required dataset size is not a small-scale dataset, calculating the weight of the crawler data and the number of tokens to be provided per high-quality dataset based on the preset token number duty ratio includes: if the data set size is a middle-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the low preset token number ratio; if the data set is required to be large-scale, calculating the weight of the crawler data and the number of the token required to be provided by each high-quality data set based on the high preset token number ratio.

In this embodiment, when the parameter number is 10B-scale or the pre-training corpus is a middle-scale data set of hundred GB-scale, a certain weight needs to be added during sampling to ensure that the number of tokens occupied by the crawler data B in the training set is not more than 20%, and the number of tokens occupied by the rest high-quality data sets of different sources in the training set is as equal as possible.

In this embodiment, when the parameter number is 100B scale, or when the pre-training corpus is 1T scale, a certain weight is added during sampling, so as to ensure that the number of tokens occupied by the crawler data does not exceed 60%, and the number of tokens occupied by the rest high-quality data sets from different sources in the training set is as equal as possible.

In some embodiments of the present invention, calculating weights for individual high quality data sets based on the number of tokens each high quality data set needs to provide and the number of tokens for the individual high quality data sets, respectively, comprises: judging whether the weight of a single high-quality data set exceeds a preset weight; if the weight of the single high-quality data set exceeds the preset weight, setting the weight of the single high-quality data set to be the value of the preset weight, and recalculating the weights of other single high-quality data sets.

In this embodiment, if one or some of the high-quality data sets is too small, so that the calculated value of the weight of the single high-quality data set is greater than 4, the weight of the single high-quality data set is made equal to 4, and the remaining sampling weights are sequentially adjusted.

In some embodiments of the invention, the method further comprises: the size of the pre-training corpus to be constructed is determined based on the parameter amounts of the language model.

In this embodiment, according to the relation between the pretrained corpus size D (token) and the model parameter N proposed by OpenAI, for an autoregressive model similar to the GPT model structure: d is not less than (5X 10) ³ )N ^0.74 . In the case of other model structures, such quantitative relationships may be more or less different. The unit of D here is the number of tokens of the data, and if converted into GB, the number of tokens varies slightly depending on the length of the token and the data of chinese and english.

In some embodiments of the present invention, determining the size of the pre-training corpus to be built based on the parametric quantity of the language model comprises: if the parameter number of the language model is not more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 100 gigabytes; if the parameter quantity of the language model is not more than 10 bytes and more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes; if the parameter amount of the language model is not more than 100 bytes and is more than 10 bytes, confirming that the size of the pre-training corpus to be constructed is more than 1 terabyte.

In some embodiments of the present invention, determining whether the desired dataset size is a small-scale dataset based on the size of the pre-training corpus to be constructed comprises: if the size of the pre-training corpus to be constructed is not more than 100 gigabytes, confirming that the size of the data set is a small-scale data set; if the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes, the data set is required to be of a size which is not a small-scale data set and is a medium-scale data set; if the size of the pre-training corpus to be constructed exceeds 1 terabyte, the data set is confirmed to be not a small-scale data set and a large-scale data set.

In this embodiment, in practice, the consensus that exists at present is that the larger the corpus is, the more advantageous the quality of the pre-trained model is; and a large scale model, the easier it is for overfitting to occur in the event of insufficient data. Based on the above two, it is recommended to prepare as large a corpus as possible for pre-training. Typically, if a model of trillion scale (100B) parameters is trained, corpora on the order of 1TB are typically prepared; while a model of 1B or 10B parameters is sufficient to use a corpus on the order of 100 GB; for smaller scales, the base can be decremented in a near linear relationship. The quantitative relationship between the model pre-training corpus and the model parameters can be determined according to different model structures and experiences.

In some embodiments of the invention, the method may be extended to unlabeled picture data and multimodal data.

It should be noted that, in the above embodiments of the method for constructing a pre-training corpus, the steps may be intersected, replaced, added and deleted, so that the method for constructing a pre-training corpus by using these reasonable permutation and combination transforms should also fall within the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

Based on the above object, a second aspect of the embodiments of the present invention provides a device for building a pre-training corpus. FIG. 2 is a schematic diagram of an embodiment of a device for building a pre-training corpus provided by the present invention. As shown in fig. 2, the device for constructing a pre-training corpus according to an embodiment of the present invention includes the following modules: a first module S11 configured to determine whether the size of the required dataset is a small-size dataset based on the size of the pre-training corpus to be constructed; a second module S12 configured to calculate the weight of the crawler data and the number of tokens to be provided by each high-quality data set based on the preset token number duty if the data set size is not a small-scale data set; a third module S13 configured to calculate weights of the individual high quality data sets based on the number of tokens that each high quality data set needs to provide and the number of tokens of the individual high quality data sets, respectively; and a fourth module S14 configured to sample the crawler data based on the weights of the crawler data, and sample the single high quality data sets based on the weights of the single high quality data sets, respectively, to obtain a pre-training corpus.

Based on the above object, a third aspect of the embodiments of the present invention proposes a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, the computer device according to the embodiment of the present invention includes the following means: at least one processor S21; and a memory S22, the memory S22 storing computer instructions S23 executable on the processor, which when executed by the processor, implement the steps of the above method.

The invention also provides a computer readable storage medium. Fig. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer-readable storage medium S31 stores a computer program S32 that, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as will be understood by those skilled in the art, implementing all or part of the above-described methods in the embodiments may be implemented by a computer program to instruct related hardware, and the program of the method for constructing a pre-training corpus may be stored in a computer readable storage medium, where the program may include the steps of the embodiments of the methods described above when executed. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

Furthermore, the method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. The above-described functions defined in the methods disclosed in the embodiments of the present invention are performed when the computer program is executed by a processor.

Furthermore, the above-described method steps and system units may also be implemented using a controller and a computer-readable storage medium storing a computer program for causing the controller to implement the above-described steps or unit functions.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer or general purpose or special purpose processor. Further, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method of constructing a pre-training corpus, comprising the steps of:

judging whether the size of the required data set is a small-scale data set or not based on the size of the pre-training corpus to be constructed;

if the size of the data set is not a small-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the preset token number ratio;

respectively calculating the weight of each high-quality data set based on the number of the tokens required to be provided by the high-quality data set and the number of the tokens of the single high-quality data set; and

sampling the crawler data based on the weight of the crawler data, and respectively sampling the single high-quality data sets based on the weight of the single high-quality data sets to obtain a pre-training corpus;

if the data set is required to be of a small-scale data set, randomly sampling from the high-quality data set to obtain a pre-training corpus;

judging whether the size of the required data set is a small-scale data set based on the size of the pre-training corpus to be constructed comprises:

if the size of the pre-training corpus to be constructed is not more than 100 gigabytes, confirming that the data set is required to be in a small-scale data set.

2. The method of claim 1, wherein if the required data set size is not a small-scale data set, calculating the weight of the crawler data and the number of tokens to be provided for each high-quality data set based on the preset token number ratio comprises:

if the data set size is a middle-scale data set, calculating the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the low preset token number ratio;

if the data set is required to be large-scale, calculating the weight of the crawler data and the number of the token required to be provided by each high-quality data set based on the high preset token number ratio.

3. The method of claim 1, wherein calculating weights for individual high quality data sets based on the number of tokens to be provided for each high quality data set and the number of tokens for the individual high quality data sets, respectively, comprises:

judging whether the weight of the single high-quality data set exceeds a preset weight;

if the weight of the single high-quality data set exceeds the preset weight, setting the weight of the single high-quality data set as the value of the preset weight, and recalculating the weights of other single high-quality data sets.

4. The method of building a pre-training corpus of claim 1, further comprising:

the size of the pre-training corpus to be constructed is determined based on the parameter amounts of the language model.

5. The method of claim 4, wherein determining the size of the pre-training corpus to be constructed based on the parameter amounts of the language model comprises:

if the parameter number of the language model is not more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 100 gigabytes;

if the parameter quantity of the language model is not more than 10 bytes and more than 1 byte, confirming that the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes;

if the parameter amount of the language model is not more than 100 bytes and is more than 10 bytes, confirming that the size of the pre-training corpus to be constructed is more than 1 terabyte.

6. The method of claim 1, wherein determining whether the desired dataset size is a small-scale dataset based on the size of the pre-training corpus to be constructed comprises:

if the size of the pre-training corpus to be constructed is not more than 1000 gigabytes and more than 100 gigabytes, the data set is required to be of a size which is not a small-scale data set and is a medium-scale data set;

if the size of the pre-training corpus to be constructed exceeds 1 terabyte, the data set is confirmed to be not a small-scale data set and a large-scale data set.

7. A device for constructing a pre-training corpus, comprising:

the first module is configured to judge whether the size of the needed data set is a small-scale data set or not based on the size of the pre-training corpus to be constructed;

the second module is configured to calculate the weight of the crawler data and the number of the token needed to be provided by each high-quality data set based on the preset token number duty ratio if the data set size is not a small-scale data set;

a third module configured to calculate weights of the single high quality data sets based on the number of tokens that each high quality data set needs to provide and the number of tokens of the single high quality data set, respectively; and

a fourth module configured to sample the crawler data based on the weights of the crawler data, and sample the single high-quality data sets based on the weights of the single high-quality data sets, respectively, to obtain a pre-training corpus;

a fifth module configured to randomly sample from the high quality dataset to obtain a pre-training corpus if the dataset is required to be small-scale dataset;

the first module is further configured to:

8. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-6.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1-6.