WO2021151292A1

WO2021151292A1 - Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium

Info

Publication number: WO2021151292A1
Application number: PCT/CN2020/117434
Authority: WO
Inventors: 邓悦; 郑立颖; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-08-28
Filing date: 2020-09-24
Publication date: 2021-08-05
Also published as: CN112069795B; CN112069795A

Abstract

A corpus monitoring method based on a mask language model, a corpus monitoring apparatus, a device, and a medium. The method comprises: inputting a corpus word to be trained into a generator for training, so as to obtain a probability distribution corresponding to the corpus word (S101); inputting the probability distribution into a discriminator for training, so as to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word has been replaced (S102), and the prediction result is stored in a blockchain node; inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by means of the discriminator and on the basis of the classification label and the corpus word, so as to obtain a context vector (S103); and monitoring, according to the context vector, the state of the corpus word to be trained (S104). The method effectively improves the training efficiency of a model, and can determine an abnormal situation of a log file efficiently and accurately.

Description

基于掩码语言模型的语料检测方法、装置、设备及介质Corpus detection method, device, equipment and medium based on mask language model

本申请要求于2020年8月28日提交中国专利局、申请号为202010888877.4，发明名称为“基于掩码语言模型的语料检测方法、装置、设备及介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 28, 2020, the application number is 202010888877.4, and the invention title is "corpus detection method, device, equipment and medium based on mask language model", all of which The content is incorporated in this application by reference.

技术领域Technical field

本申请涉及智能决策技术领域，尤其涉及一种基于掩码语言模型的语料检测方法、装置、计算机设备及介质。This application relates to the technical field of intelligent decision-making, and in particular to a corpus detection method, device, computer equipment and medium based on a mask language model.

背景技术Background technique

在文本处理时，日志文件的异常检测在现代大型分布式***的管理中起着重要的作用，其广泛用于记录***运行时信息的日志中。目前，发明人意识到，运维人员通常是使用关键字搜索和规则匹配来检查并匹配日志。然而，随着工作量和业务需求的增加，人工检测所需时长也相应增长，变得更加耗时耗力。为了减少人工工作量和同时为了提高目前检测的正确率，基于深度学习的日志异常检测方法在异常检测方向的应用逐渐增加。In text processing, the anomaly detection of log files plays an important role in the management of modern large-scale distributed systems, and it is widely used in logs that record system runtime information. At present, the inventor realizes that operation and maintenance personnel usually use keyword search and rule matching to check and match logs. However, with the increase in workload and business requirements, the time required for manual inspection has also increased correspondingly, which has become more time-consuming and labor-intensive. In order to reduce the manual workload and at the same time to improve the current detection accuracy, the application of log anomaly detection methods based on deep learning in the direction of anomaly detection is gradually increasing.

发明人研究过程中发现，目前比较流行的文本处理模型就是基于掩码预训练语言模型，但由于其对于计算资源要求很高，训练成本和运行时长的限制了对模型的修改以及训练。In the course of the inventor's research, it was discovered that the current popular text processing model is based on mask pre-training language model, but due to its high requirements for computing resources, training cost and running time limit the modification and training of the model.

发明内容Summary of the invention

本申请提供了一种智能决策的一种基于掩码语言模型的语料检测方法、装置、计算机设备及介质，有效提高模型训练效率，且能高效准确的判断日志文件的异常情况。This application provides an intelligent decision-making method, device, computer equipment and medium for corpus detection based on a mask language model, which effectively improves the efficiency of model training, and can efficiently and accurately judge abnormal conditions of log files.

第一方面，本申请提供了一种基于掩码语言模型的语料检测方法，所述方法应用于掩码语言模型，所述掩码语言模型包括生成器和判别器；所述方法包括：In the first aspect, this application provides a corpus detection method based on a masked language model, the method is applied to a masked language model, the masked language model includes a generator and a discriminator; the method includes:

将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布；Input the corpus words to be trained into the generator for training, and obtain the probability distribution corresponding to the corpus words;

将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；Inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word has been replaced;

根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；According to the category of the corpus word, input a classification label into the discriminator, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

根据所述上下文向量检测所述待训练的语料单词的状态；所述语料单词的类别包括日志文件类别。The state of the corpus word to be trained is detected according to the context vector; the category of the corpus word includes a log file category.

第二方面，本申请还提供了一种语料检测装置，所述装置包括：In the second aspect, this application also provides a corpus detection device, which includes:

第一训练模块，用于将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布；The first training module is used to input the corpus words to be trained into the generator for training, and obtain the probability distribution corresponding to the corpus words;

第二训练模块，用于将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；The second training module is configured to input the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, and the prediction result includes whether the corpus word has been replaced;

调整模块，用于根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；An adjustment module, configured to input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

检测模块，用于根据所述上下文向量检测所述待训练的语料单词的状态。The detection module is used to detect the state of the corpus word to be trained according to the context vector.

第三方面，本申请还提供了一种计算机设备，所述计算机设备包括存储器和处理器；所述存储器用于存储计算机程序；所述处理器，用于执行所述计算机程序并在执行所述计算机程序时实现以下方法：In a third aspect, the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and execute the The computer program implements the following methods:

将待训练的语料单词输入掩码语言模型的生成器进行训练，得到所述语料单词对应的概率分布；Input the corpus words to be trained into the generator of the mask language model for training, and obtain the probability distribution corresponding to the corpus words;

将所述概率分布输入到所述掩码语言模型的判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；Inputting the probability distribution to the discriminator of the mask language model for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word has been replaced;

根据所述上下文向量检测所述待训练的语料单词的状态。The state of the corpus word to be trained is detected according to the context vector.

第四方面，本申请还提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时使所述处理器实现以下方法：In a fourth aspect, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following method:

本申请中模型的训练时间大幅下降，且模型的训练效率更高，使得检验异常的结果更加的高效和迅速。In this application, the training time of the model is greatly reduced, and the training efficiency of the model is higher, so that the abnormal result of the test is more efficient and faster.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the application.

附图说明Description of the drawings

为了更清楚地说明本申请实施例技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

图1是本申请的实施例提供的一种掩码语音模型的架构图；Fig. 1 is an architecture diagram of a masked speech model provided by an embodiment of the present application;

图2是本申请的实施例提供的一种生成器架构图；Fig. 2 is an architecture diagram of a generator provided by an embodiment of the present application;

图3是本申请的实施例提供的一种生成器的输入词向量示意图；FIG. 3 is a schematic diagram of an input word vector of a generator provided by an embodiment of the present application;

图4是本申请的实施例提供的一种判别器结构图；Figure 4 is a structural diagram of a discriminator provided by an embodiment of the present application;

图5是本申请的实施例提供的一种基于掩码语言模型的语料检测方法的流程示意图；FIG. 5 is a schematic flowchart of a corpus detection method based on a mask language model provided by an embodiment of the present application;

图6是本申请的实施例提供的检验日志文件的判别器结构图；FIG. 6 is a structural diagram of the discriminator of the inspection log file provided by the embodiment of the present application;

图7为本申请的实施例提供的另一种语料检测装置的示意性框图；FIG. 7 is a schematic block diagram of another corpus detection device provided by an embodiment of the application;

图8为本申请的实施例提供的一种计算机设备的结构示意性框图。FIG. 8 is a schematic block diagram of the structure of a computer device provided by an embodiment of the application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

本申请的技术方案可应用于人工智能、区块链和/或大数据技术领域，涉及的数据如预测结果等可存储于数据库中，或者可以存储于区块链中，比如通过区块链分布式存储，本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology, and the data involved, such as prediction results, can be stored in a database, or can be stored in a blockchain, such as distributed through a blockchain Type storage, this application is not limited.

附图中所示的流程图仅是示例说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解、组合或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowchart shown in the drawings is only an example, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions.

应当理解，在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

还应当进理解，在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

本申请的实施例提供了一种基于掩码语言模型的语料检测方法、装置、计算机设备及存储介质。其中，该基于掩码语言模型的语料检测方法有效提高模型训练效率，且能高效准确的判断日志文件的异常情况。The embodiments of the present application provide a corpus detection method, device, computer equipment, and storage medium based on a mask language model. Among them, the corpus detection method based on the mask language model effectively improves the efficiency of model training, and can efficiently and accurately determine the abnormal situation of the log file.

下面结合附图，对本申请的一些实施方式作详细说明。在不冲突的情况下，下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

请参阅图1，图1是本申请的实施例提供的一种掩码语言模型的训练架构，掩码语言模型的训练架构包括生成器和判别器，训练的时候，生成器和判别器一起训练，检测时，仅仅通过判别器检测，有效提高训练效率。Please refer to FIG. 1. FIG. 1 is a training architecture of a mask language model provided by an embodiment of the present application. The training architecture of the mask language model includes a generator and a discriminator. During training, the generator and the discriminator are trained together , When detecting, only through the discriminator to detect, effectively improve the training efficiency.

生成器架构如图2所示，生成器包括一系列的编码器Encoder和对应的输入及输出，生成器的输入是词向量，生成器的输出是概率分布，即当前位置单词的概率分布，对应的是每个单词的概率，并且，可以通过最大概率选出最可能的单词。The generator architecture is shown in Figure 2. The generator includes a series of encoders and corresponding inputs and outputs. The input of the generator is a word vector, and the output of the generator is a probability distribution, that is, the probability distribution of the word at the current position. Is the probability of each word, and the most probable word can be selected by the maximum probability.

具体地，如图3所示，生成器的编码器的输入是一个一个单词所对应的向量w ₁、w ₂、……w _n，词向量的生成由3个部分向量的叠加构成，部分向量可以包括单词维度向量、句子维度向量和位置维度向量。 Specifically, as shown in Figure 3, the input of the encoder of the generator is a vector w ₁ , w ₂ ,...w _n corresponding to a word. The generation of the word vector is composed of the superposition of three partial vectors, and the partial vector It can include a word dimension vector, a sentence dimension vector, and a position dimension vector.

该生成器的任务本质简单，且参数不是特别多，并且，该生成器的架构是为了给判别器输入更多复杂的任务。The task of the generator is simple in nature, and the parameters are not particularly large, and the structure of the generator is to input more complex tasks to the discriminator.

判别器架构如图4所示，判别器的架构总体与生成器架构类似，也是包括一系列的编码器、输入、输出，图4中的O ₁……O _n是来自生成器的输出，同样地，输入的O ₁……O _n经过判别器的词向量层Embedding后，输入判别器的编码器Encoder结构中，与生成器不同的是，判别器的编码器在输出时添加了一层分类器Classifier，用于判断每个单词是否被替换过，对应的输出为R ₁……R _n，为概率，判断为0/1的分类，即判断单词是否被替换过。 As shown in FIG discrimination architecture, the overall architecture is similar to the 4 discriminators generated architecture, also includes a series of encoder, input, output, in FIG. 4 O ₁ ...... O _n is the output from the generator, the same , the input of O ₁ ...... O _n layer after word vector Embedding discriminator, the discriminator input encoder encoder configuration, and is different from the generator, an encoder determines when the output layer adds classification Classifier, used to judge whether each word has been replaced, the corresponding output is R ₁ ……R _n , which is the probability, judged as a 0/1 classification, that is, judge whether the word has been replaced.

基于上述掩码语言模型的架构，提出基于掩码语言模型的语料检测方法。Based on the above-mentioned mask language model architecture, a corpus detection method based on the mask language model is proposed.

请参阅图5，图5是本申请的实施例提供的一种基于掩码语言模型的语料检测方法的示意流程图。该基于掩码语言模型的语料检测方法可应用于图1中的掩码语言模型中，有效提高模型训练效率，且能高效准确的判断日志文件的异常情况。Please refer to FIG. 5. FIG. 5 is a schematic flowchart of a corpus detection method based on a mask language model provided by an embodiment of the present application. The corpus detection method based on the masked language model can be applied to the masked language model in Figure 1, effectively improving the efficiency of model training, and can efficiently and accurately determine the abnormal situation of the log file.

如图5所示，该基于掩码语言模型的语料检测方法具体包括步骤S101至步骤S104。As shown in FIG. 5, the corpus detection method based on the mask language model specifically includes steps S101 to S104.

S101、将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布。S101. Input the corpus words to be trained into the generator for training, and obtain the probability distribution corresponding to the corpus words.

在用户需要对某一个语句进行部分掩码时，将整个语句输入到掩码语言模型中，整个语句作为待训练的语料单词，输入掩码语言模型的生成器进行训练，具体包括：When the user needs to partially mask a certain sentence, input the entire sentence into the mask language model. The entire sentence is used as the corpus word to be trained and input into the generator of the mask language model for training, which specifically includes:

S11、将所述待训练的语料单词对应的词向量输入生成器中，所述词向量包括单词维度、句子维度和位置维度。S11. Input a word vector corresponding to the word of the corpus to be trained into the generator, where the word vector includes a word dimension, a sentence dimension, and a position dimension.

如图3所示，生成器的编码器的输入是一个一个单词所对应的向量w ₁、w ₂、……w _n，词向量的生成由3个部分向量的叠加构成，部分向量可以包括单词维度向量、句子维度向量和位置维度向量。 As shown in Figure 3, the input of the encoder of the generator is a vector w ₁ , w ₂ ,...w _n corresponding to a word. The generation of the word vector is composed of the superposition of three partial vectors, and the partial vectors can include words Dimension vector, sentence dimension vector, and position dimension vector.

如图4所示，待训练的语料单词为句子，以“the value is high”、“it is urgent now”为例，将“the value is high”、“it is urgent now”作为生成器的输入，且生成器输入时，包括单词维度、句子维度和位置维度。As shown in Figure 4, the corpus words to be trained are sentences. Take "the value is high" and "it is urgent now" as examples, and use "the value is high" and "it is urgent now" as the input of the generator , And the generator input includes word dimension, sentence dimension and location dimension.

S12、通过所述生成器将所述单词维度、句子维度和位置维度叠加后的词向量输入到所述生成器的编码器中进行编码，得到各个维度词向量，其中，所述编码器包括多个。S12. Input the word vector of the word dimension, sentence dimension, and position dimension superimposed by the generator into the encoder of the generator for encoding to obtain word vectors of various dimensions, wherein the encoder includes multiple Piece.

生成器将输入的词向量包括单词维度、句子维度和位置维度三个维度进行叠加后，得到叠加后的词向量，然后，将叠加后的词向量输入到生成器的编码器中进行编码，得到各个维度词向量，生成器的编码器有很多层，通过生成器的层层编码，得到各个维度词向量。The generator superimposes the input word vector including three dimensions of word dimension, sentence dimension and position dimension to obtain the superimposed word vector. Then, the superimposed word vector is input into the encoder of the generator for encoding, and the result is For word vectors of various dimensions, the encoder of the generator has many layers, and word vectors of various dimensions are obtained through the layer-by-layer encoding of the generator.

生成器可以是一个预先训练好的模型，也可以在输入的时候使用模型进行训练，针对单词维度，例如，词向量的总体长度可以是768长度，假如，输入的待训练的语料单词对应是6个单词，那么，生成器模型对应的输出的单词维度为(6，768)。The generator can be a pre-trained model, or use the model for training during input. For the word dimension, for example, the total length of the word vector can be 768 length. If the input corpus word to be trained corresponds to 6 Words, then the output word dimension corresponding to the generator model is (6,768).

在一些实施方式中，针对句子维度，当输入的待训练的语料单词包括两句话时，那么，生成器可以对于不同的句子添加不同的词向量(embedding)则对应的第一句话1的维度为(1，768)，第二句话2的维度为(2，768)。In some embodiments, for the sentence dimension, when the input corpus words to be trained include two sentences, then the generator can add different word vectors (embedding) to the different sentences, then the corresponding first sentence 1 The dimension is (1,768), and the dimension of the second sentence 2 is (2,768).

在一些实施方式中，针对位置维度，在所述待训练的语料单词中包括不同位置的同一单词时，还需要考虑到单词的位置信息，根据单词的位置信息。例如，输入的句子为“I come and I watch”，那么，对于生成器而言，里面的两个“I”是不一样的，针对于位置的信息，生成器采用正弦编码的方式，公式如下：In some embodiments, with respect to the position dimension, when the corpus words to be trained include the same word in different positions, the position information of the word needs to be taken into consideration, and the position information of the word needs to be taken into consideration. For example, if the input sentence is "I come and I watch", then, for the generator, the two "Is" in it are not the same. For the position information, the generator uses sinusoidal coding, the formula is as follows :

其中，pos是位置的索引，代表词语在序列中的位置，i是向量中的索引，d是生成器模型的维度，生成器模型使用768维度。这个公式可以将位置的信息，在向量的偶数位置，用正弦函数编码，在奇数位置，用余弦函数编码，这样位置编码向量的每个维度是不同频率的波形，每个值介于-1和1之间，从而得到位置维度。Among them, pos is the index of the position, representing the position of the word in the sequence, i is the index in the vector, d is the dimension of the generator model, and the generator model uses 768 dimensions. This formula can encode the position information in the even position of the vector with a sine function, and in the odd position, use the cosine function to encode, so that each dimension of the position encoding vector is a waveform with a different frequency, and each value is between -1 and 1 to obtain the location dimension.

S13、按照预设替换规则将所述维度词向量中的部分单词随机替换掉，且得到各个维度词向量对应的概率分布。S13. Randomly replace some words in the dimensional word vector according to a preset replacement rule, and obtain a probability distribution corresponding to each dimensional word vector.

在生成器进行掩码处理时，生成器在待训练的语料单词输入时，根据预设规则，将待训练的语料单词的部分单词替换掉，例如，将在[mask]位置的token替换掉，生成器输出时，通过句子中没有被掩码单词上下文来预测被掩盖的单词，比如，如图1所示，在输入的待训练的语料单词为“the value is high”中，[mask]位置对应为“the”及“value”，那么，则掩码单词为“the”及“value”，没有被掩码单词的上下文为“is”及“high”，通过“is”及“high”来预测“the”及“value”。When the generator performs mask processing, the generator replaces part of the words in the corpus to be trained according to preset rules when inputting words in the corpus to be trained, for example, replace the token at the position of [mask], When the generator outputs, it predicts the words that are masked based on the context of the words that are not masked in the sentence. For example, as shown in Figure 1, in the input corpus word to be trained for "the value is high", the position of [mask] Corresponds to "the" and "value", then the masked words are "the" and "value", and the contexts of unmasked words are "is" and "high", and use "is" and "high" to Predict "the" and "value".

在一些实施方式中，在掩码时，会使用预设替换规则，比如，预设替换规则为生成器除了会随机选择20％的[mask]输入，生成器还在这20％的[mask]掩码中使用以下规则：In some implementations, when masking, a preset replacement rule is used. For example, the preset replacement rule is that in addition to the generator randomly selecting 20% of the [mask] input, the generator will also use the 20% of the [mask] input. The following rules are used in the mask:

1、10％使用任意单词进行替换；1. 10% use any word to replace;

2、10％单词不发生变化；2. 10% of the words remain unchanged;

3、80％单词使用[mask]替换。3. Use [mask] to replace 80% of the words.

在按照预设替换规则将维度词向量中的部分单词随机替换掉后，得到各个维度词向量对应的概率分布。After randomly replacing some words in the dimensional word vector according to the preset replacement rule, the probability distribution corresponding to each dimensional word vector is obtained.

可以理解的是，对于生成器的编码器Encoder内部，使用了一个注意力机制，注意力的机制目的其实就是在处理单个单词的时候，可以找到该单词所在的句子中比较相关的单词，并融合到需要处理的单词中，从而来达到更好的编码效果。本案的注意力机制为多头注意力机制，基于注意力机制(self-attention)的情况下，编码器Encoder设置为16层，那么，会将对应的注意力机制使用16次，经过线性映射，得到最终的输出。通过多头注意力机制，捕捉到模型不同的位置，从而捕捉到句子不通维度的信息。It is understandable that an attention mechanism is used inside the encoder Encoder of the generator. The purpose of the attention mechanism is actually to find the more relevant words in the sentence in which the word is located when processing a single word, and merge it To the words that need to be processed, so as to achieve a better coding effect. The attention mechanism in this case is a multi-head attention mechanism. Based on the self-attention, the encoder Encoder is set to 16 layers, then the corresponding attention mechanism will be used 16 times, and after linear mapping, we get The final output. Through the multi-head attention mechanism, different positions of the model are captured, and the information of different dimensions of the sentence is captured.

通过使用多头注意力机制和全新的词嵌入的方法，引入了三重维度(位置维度、句子维度、单词维度)的编码信息，使得对于单词的理解会有更多维度。By using a multi-headed attention mechanism and a brand-new word embedding method, three-dimensional (location dimension, sentence dimension, word dimension) coding information is introduced, so that the understanding of words will have more dimensions.

在一些实施方式中，在将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之后，所述方法还可以包括：In some embodiments, after the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the method may further include:

计算所述生成器的损失函数，且根据生成器的损失函数调整生成器。The loss function of the generator is calculated, and the generator is adjusted according to the loss function of the generator.

生成器的损失函数是对于那些被[mask]的单词通过上下文预测是否正确，通过公式3：The loss function of the generator is whether the words that are [mask] are correctly predicted by the context, through formula 3:

其中，L _MLM是生成器的损失函数，x是样本，x ^masked是Embedding过程中被掩码遮挡的样本，θ _G是生成器的参数，(x _i|x ^masked)是已知的情况下样本x _i的条件分布。 Among them, L _MLM is the loss function of the generator, x is the sample, x ^masked is the sample that is masked during the Embedding process, θ _G is the parameter of the generator, and (x _i |x ^masked ) is the sample when it is known Conditional distribution of x _i.

通过生成器对待训练的语料单词进行词向量叠加、编码以实现掩码，输出各个词向量以及词向量对应的概率分布。The generator performs word vector superposition and encoding on the corpus words to be trained to realize the mask, and outputs each word vector and the probability distribution corresponding to the word vector.

在一些实施方式中，在用户进行一些业务性操作时，业务***通常会产生一些对应的日志文件，在日志检测时，应用于本方案的掩码语言模型。若所述待训练的语料单词的类别为日志文件类别，那么，在将待训练的日志文件输入所述生成器进行训练，得到所述语料单词对应的概率分布之前，方法可以包括：In some embodiments, when the user performs some business operations, the business system usually generates some corresponding log files, which are applied to the mask language model of this solution during log detection. If the category of the corpus word to be trained is a log file category, then before the log file to be trained is input into the generator for training to obtain the probability distribution corresponding to the corpus word, the method may include:

对待训练的日志文件进行预处理。Preprocess the log file to be trained.

具体地，预处理可以是将日志文件中的大写格式转换为小写格式后，对于一些结构相同的固定文本进行过滤，并对于一些不重要的信息(地址/时间/日期)替换。Specifically, the preprocessing may be to convert the uppercase format in the log file to the lowercase format, filter some fixed texts with the same structure, and replace some unimportant information (address/time/date).

在日志文件完成预处理后，将处理后的日志文件输入到生成器中进行训练。After the log file is preprocessed, the processed log file is input into the generator for training.

针对不同的待训练的语料单词，需要先进行不同的预处理，本案以日志文件为例，但是不局限于日志文件。For different corpus words to be trained, different preprocessing needs to be performed first. This case uses log files as an example, but it is not limited to log files.

S102、将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过。S102. Input the probability distribution to the discriminator for training, and obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word has been replaced.

具体地，如图4所示，在判别器中，将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过可以包括：Specifically, as shown in FIG. 4, in the discriminator, the probability distribution is input to the discriminator for training, and the prediction result corresponding to the probability distribution is obtained, and the prediction result includes whether the corpus word is replaced Can include:

通过所述判别器对所述概率分布对应的词向量按照预设替换概率进行替换，以预测所述概率分布对应的词向量是否被替换过，得到预测结果。The word vector corresponding to the probability distribution is replaced by the discriminator according to a preset replacement probability to predict whether the word vector corresponding to the probability distribution has been replaced, and a prediction result is obtained.

在一些实施方式中，判别器接收到生成器的输出后，判别器对输入的词向量同样按照一定的概率进行替换，以预测生成器输出的单词是否被替换过，具体地，O ₁……O _n是对应每个来自生成器的输出，同样这些输入将要经过Embedding(词向量)层，输入到Encoder(编码器)结构中，最终和生成器不一样的地是添加了一层Classifier(分类器)，为了取判断每个单词是否为原来的还是替换的，对应的输出为R ₁……R _N。预测结果包括该词向量是否被替换过，即包括被替换过、未被替换过两种预测结果。 In some embodiments, after the discriminator receives the output of the generator, the discriminator also replaces the input word vector according to a certain probability to predict whether the word output by the generator has been replaced, specifically, O ₁ ... O _n corresponds to each output from the generator. These inputs will also go through the Embedding (word vector) layer and input into the Encoder (encoder) structure. In the end, unlike the generator, a layer of Classifier is added.器), in order to determine whether each word is original or replaced, the corresponding output is R ₁ ...R _N. The prediction result includes whether the word vector has been replaced, that is, it includes two prediction results that have been replaced and that have not been replaced.

例如，待训练的语料单词为“the value is high”输入生成器进行三个维度叠加、部分掩码后输出“the key is high”，很明显，对“vlaue”做了掩码处理，然后，将生成器输出的“the key is high”输入到判别器中进行判别，判别时，按照预设替换概率进行替换，得到“the”、“is”、“high”均为original(原始状态)，而“key”为replaced(被替换状态)，即，判别器判别出具体被替换过的单词。For example, the corpus word to be trained is "the value is high". The input generator performs three-dimensional superposition and partial masking and then outputs "the key is high". Obviously, "vlaue" is masked, and then, Input the "the key is high" output by the generator into the discriminator for discrimination. When discriminating, replace it according to the preset replacement probability, and get that "the", "is", and "high" are all original (original state), And "key" is replaced (replaced state), that is, the discriminator distinguishes the specific word that has been replaced.

在一些实施方式中，根据所述上下文向量检测所述待训练的语料单词的状态之后，所述方法还包括：In some embodiments, after detecting the state of the corpus word to be trained according to the context vector, the method further includes:

计算所述判别器的损失函数，且根据所述判别器的损失函数调整所述判别器。判别器的损失函数如公式4：The loss function of the discriminator is calculated, and the discriminator is adjusted according to the loss function of the discriminator. The loss function of the discriminator is as formula 4:

其中，L _Disc是判别器的损失函数，l(x)是示性函数，l(x)是线性函数，x是样本，t是时间步长，x ^corrupt是被替换后的样本，θ _D是判别器的参数，D是判别器。 Among them, L _Disc is the loss function of the discriminator, l(x) is the indicator function, l(x) is the linear function, x is the sample, t is the time step, x ^corrupt is the sample after being replaced, and θ _D is The parameters of the discriminator, D is the discriminator.

将所述生成器的损失函数与所述判别器的损失函数叠加，得到总体损失函数，以调整所述掩码语言模型。The loss function of the generator and the loss function of the discriminator are superimposed to obtain an overall loss function to adjust the mask language model.

具体地，生成器的损失函数与判别器的损失函数叠加，得到模型的总体损失函数为公式5：Specifically, the loss function of the generator and the loss function of the discriminator are superimposed, and the overall loss function of the model is obtained as Formula 5:

由于生成器和判别器是完全一样的结构，对于模型中包含在生成器和判别器中的参数，可以通过模型的参数共享使得模型训练更加高效。并且，在训练的时候，生成器和判别器一起训练，而对于使用的时候，只会有判别器投入到使用，对此，模型可以减少更多的参数，且会有更好的训练效率。Since the generator and the discriminator have exactly the same structure, for the parameters included in the generator and the discriminator in the model, the model training can be made more efficient through the parameter sharing of the model. Moreover, during training, the generator and the discriminator are trained together, and when used, only the discriminator is put into use. For this, the model can reduce more parameters and have better training efficiency.

需要强调的是，为进一步保证上述预测结果的私密性和安全性，上述预测结果还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above prediction result, the above prediction result can also be stored in a node of a blockchain.

S103、根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量。S103. Input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector.

在一些实施方式中，在语料单词为日志文件时，先对日志文件进行预处理，然后将预处理后的日志文件输入到生成器、判别器中进行训练，在检测时，将训练后得到的预测结果输入到模型的判别器中，对于模型的输入就是每个日志文本所对应的单词。In some embodiments, when the corpus word is a log file, the log file is preprocessed first, and then the preprocessed log file is input to the generator and discriminator for training. When detecting, the training obtained The prediction result is input to the discriminator of the model, and the input to the model is the word corresponding to each log text.

具体地，当语料单词的类别为日志文件类别时，根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量，可以包括：Specifically, when the category of the corpus word is the log file category, according to the category of the corpus word, the classification label is input into the discriminator, and the prediction result is adjusted by the discriminator based on the classification label and the corpus word to obtain The context vector can include:

S31、在判别器中，将日志文件对应的首个单词替换为分类标签。S31. In the discriminator, replace the first word corresponding to the log file with a classification label.

具体地，可以设置输入的长度为512，同时，在输入的时候将每个句子的开始O1替换为[CLS]分类标签，对应日志的异常或不异常。Specifically, the length of the input can be set to 512, and at the same time, when inputting, replace the beginning O1 of each sentence with the [CLS] classification label, which corresponds to the abnormality or non-abnormality of the log.

S32、将所述日志文件对应的所有单词均输入到编码器进行训练后，将分类标签对应的向量输入到二分类的神经网络中进行训练，输出上下文向量，所述上下文向量的第一个位置与所述分类标签对应。S32. After all the words corresponding to the log file are input to the encoder for training, the vector corresponding to the classification label is input to the two-class neural network for training, and the context vector is output, the first position of the context vector Correspond to the classification label.

具体地，考虑到异常检验是一个二分类的任务，因此，在经过生成器的一层一层编码器训练后，每层会获得一个向量，可以设置向量的长度是768，将[CLS]分类标签的向量直接输入到一个二分类的神经网络中，对应为上图中的classifier，用于判断是否异常，输出的结果为0/1，即对应异常或非异常的判断结果。Specifically, considering that anomaly detection is a two-classification task, after training with a layer-by-layer encoder of the generator, each layer will get a vector. You can set the length of the vector to 768, and classify [CLS] The label vector is directly input into a two-class neural network, which corresponds to the classifier in the above figure, which is used to judge whether it is abnormal. The output result is 0/1, which is the judgment result corresponding to abnormal or non-abnormal.

在一些实施方式中，若检测为多分类任务的话，可以使用多分类的神经网络，替换上述的classifier，且使用SoftMax逻辑回归函数获得各分类的概率，将其归属至最大概率对应的类别，即完成分类，得到分类结果。In some embodiments, if the detection is a multi-classification task, a multi-class neural network can be used to replace the above-mentioned classifier, and the SoftMax logistic regression function can be used to obtain the probabilities of each class and assign it to the class corresponding to the maximum probability, namely Complete the classification and get the classification result.

S104、根据所述上下文向量检测所述待训练的语料单词的状态；所述语料单词的类别包括日志文件类别。S104. Detect the state of the corpus word to be trained according to the context vector; the category of the corpus word includes a log file category.

在一些实施方式中，当语料单词的类别为日志文件类别时，根据所述上下文向量检测所述待训练的语料单词的状态可以包括：根据所述上下文向量的第一个位置判断所述日志文件的异常情况。In some embodiments, when the category of the corpus word is a log file category, detecting the state of the corpus word to be trained according to the context vector may include: judging the log file according to the first position of the context vector The abnormal situation.

在检测异常日志文件时，由于将[CLS]分类标签的向量直接输入到一个二分类的神经网络中后，只关注第一个位置是[CLS]的输出向量，作为上下文的向量。When detecting abnormal log files, since the vector of the [CLS] classification label is directly input into a two-class neural network, only the output vector whose first position is [CLS] is used as the context vector.

如图6所示，图6为在检测异常日志文件的判别器架构图，对于日志异常检测，输入是对应每个日志文件的句子，将输入的长度设置为512。同时在输入的时候将每个句子的开始替换为[CLS](分类标签)，得到对应日志的异常或者不异常的检测结果。As shown in Fig. 6, Fig. 6 is a diagram of the discriminator architecture for detecting abnormal log files. For log abnormality detection, the input is a sentence corresponding to each log file, and the length of the input is set to 512. At the same time, replace the beginning of each sentence with [CLS] (classification label) when inputting, and get the abnormal or non-abnormal detection result of the corresponding log.

由于检测的时候，只需要用到判别器，因此对于运维判断异常信息的服务器来说，CPU 和内存的负载降低了，同时对于检验异常的结果更加的高效和迅速，对于日常的检测的任务来说，检测的速度有非常大的提升。Since only the discriminator is needed for detection, the CPU and memory load is reduced for the server that determines abnormal information for operation and maintenance. At the same time, it is more efficient and faster for the abnormal results of the inspection, and for daily detection tasks. In other words, the detection speed has been greatly improved.

上述实施例提供基于掩码语言模型的语料检测方法，采用全新的掩码语言模型，掩码语言模型包括生成器和判别器，在训练时，将待训练的语料单词输入生成器训练，得到语料单词对应的概率分布，然后，将概率分布输入到判别器进行训练，得到概率分布对应的预测结果，从而确定掩码语言模型的预测结果，其中，预测结果包括语料单词是否被替换过，由于输入生成器的待训练语料单词为全部输入，生成器和判别器一起训练，判决器和生成器的参数也是相互共享，使得模型的训练时间再次大幅下降，且模型的训练效率更高；在使用模型时，只使用判别器针对语料单词的类别输入分类标签，因此，对模型的测试的效率会有大幅度的提升，且有效减少测试时间；在得到上下文向量后，根据上下文向量检测所述待训练的语料单词的状态，如检测运维服务器的日志文件的异常情况，使得检验异常的结果更加的高效和迅速，对于日常的检测的任务来说，检测的速度有非常大的提升。The above embodiments provide a corpus detection method based on a masked language model, using a brand-new masked language model. The masked language model includes a generator and a discriminator. During training, the words to be trained are input into the generator for training to obtain the corpus The probability distribution corresponding to the word, then, the probability distribution is input to the discriminator for training, and the prediction result corresponding to the probability distribution is obtained to determine the prediction result of the mask language model. The prediction result includes whether the corpus word has been replaced, because the input The generator’s corpus words to be trained are all inputs. The generator and the discriminator are trained together. The parameters of the judge and the generator are also shared with each other, which makes the training time of the model drop again and the training efficiency of the model is higher; At the time, only use the discriminator to input the classification label for the corpus word category, therefore, the efficiency of the model test will be greatly improved, and the test time will be effectively reduced; after the context vector is obtained, the context vector is used to detect the to-be-trained The state of the words in the corpus, such as detecting abnormalities in the log files of the operation and maintenance server, makes the abnormal results more efficient and rapid. For daily detection tasks, the detection speed has been greatly improved.

请参阅图7，图7是本申请的实施例提供一种语料检测装置的示意性框图，该语料检测装置用于执行前述的基于掩码语言模型的语料检测方法。其中，该语料检测装置可以配置于终端或服务器。Please refer to FIG. 7. FIG. 7 is a schematic block diagram of a corpus detection device provided by an embodiment of the present application. The corpus detection device is configured to execute the aforementioned mask language model-based corpus detection method. Among them, the corpus detection device can be configured in a terminal or a server.

如图7所示，该语料检测装置400，包括：第一训练模块401、第二训练模块402、调整模块403、检测模块404。As shown in FIG. 7, the corpus detection device 400 includes: a first training module 401, a second training module 402, an adjustment module 403, and a detection module 404.

第一训练模块401，用于将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布；The first training module 401 is configured to input corpus words to be trained into the generator for training, and obtain the probability distribution corresponding to the corpus words;

第二训练模块402，用于将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；The second training module 402 is configured to input the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, and the prediction result includes whether the corpus word has been replaced;

调整模块403，用于根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；The adjustment module 403 is configured to input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

检测模块404，用于根据所述上下文向量检测所述待训练的语料单词的状态。The detection module 404 is configured to detect the state of the corpus word to be trained according to the context vector.

需要说明的是，所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的装置和各模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of the description, the specific working process of the device and each module described above can refer to the corresponding process in the foregoing method embodiment, and it will not be omitted here. Go into details.

上述的装置可以实现为一种计算机程序的形式，该计算机程序可以在如图8所示的计算机设备上运行。The above-mentioned apparatus can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 8.

请参阅图8，图8是本申请的实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器。Please refer to FIG. 8, which is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application. The computer device may be a server.

参阅图8，该计算机设备包括通过***总线连接的处理器、存储器和网络接口，其中，存储器可以包括非易失性存储介质和内存储器。Referring to FIG. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.

非易失性存储介质可存储操作***和计算机程序。该计算机程序包括程序指令，该程序指令被执行时，可使得处理器执行任意一种基于掩码语言模型的语料检测方法。The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any corpus detection method based on the mask language model.

处理器用于提供计算和控制能力，支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.

内存储器为非易失性存储介质中的计算机程序的运行提供环境，该计算机程序被处理器执行时，可使得处理器执行任意一种基于掩码语言模型的语料检测方法。The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any corpus detection method based on the mask language model.

该网络接口用于进行网络通信，如发送分配的任务等。本领域技术人员可以理解，图8中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

应当理解的是，处理器可以是中央处理单元(Central Processing Unit，CPU)，该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中，通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

其中，在一个实施例中，所述处理器用于运行存储在存储器中的计算机程序，以实现如下步骤：Wherein, in an embodiment, the processor is used to run a computer program stored in a memory to implement the following steps:

在一些实施例中，所述处理器实现所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布，包括：In some embodiments, the processor implementing the input of the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words includes:

将所述待训练的语料单词对应的词向量输入生成器中，所述词向量包括单词维度、句子维度和位置维度；Inputting the word vector corresponding to the word of the corpus to be trained into the generator, where the word vector includes word dimension, sentence dimension and position dimension;

通过所述生成器将所述单词维度、句子维度和位置维度叠加后的词向量输入到所述生成器的编码器中进行编码，得到各个维度词向量，其中，所述编码器包括多个；Input the word vector of the word dimension, sentence dimension and position dimension superimposed by the generator into the encoder of the generator for encoding to obtain word vectors of various dimensions, wherein the encoder includes multiple;

按照预设替换规则将所述维度词向量中的部分单词随机替换掉，且得到各个维度词向量对应的概率分布。According to a preset replacement rule, some words in the dimensional word vector are randomly replaced, and the probability distribution corresponding to each dimensional word vector is obtained.

在一些实施例中，所述处理器实现所述待训练的语料单词的类别为日志文件类别，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之前，包括：In some embodiments, the processor realizes that the category of the corpus word to be trained is a log file category, and the corpus word to be trained is input to the generator for training to obtain the probability distribution corresponding to the corpus word Before, including:

在一些实施例中，所述处理器实现所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之后，包括：In some embodiments, after the processor implements the input of the corpus words to be trained into the generator for training, and obtains the probability distribution corresponding to the corpus words, the method includes:

在一些实施例中，所述处理器实现所述根据所述上下文向量检测所述待训练的语料单词的状态之后，包括：In some embodiments, after the processor realizes the detection of the state of the corpus word to be trained according to the context vector, the method includes:

计算所述判别器的损失函数，且根据所述判别器的损失函数调整所述判别器。The loss function of the discriminator is calculated, and the discriminator is adjusted according to the loss function of the discriminator.

在一些实施例中，所述处理器实现包括：In some embodiments, the processor implementation includes:

在一些实施例中，所述处理器还实现将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过，包括：In some embodiments, the processor further implements inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, and the prediction result includes whether the corpus word has been replaced, include:

通过所述判别器对所述概率分布对应的词向量按照预设替换概率进行替换，以预测所述概率分布对应的词向量是否被替换过，得到预测结果，所述预测结果存储于区块链节点中。The word vector corresponding to the probability distribution is replaced by the discriminator according to the preset replacement probability to predict whether the word vector corresponding to the probability distribution has been replaced, and the prediction result is obtained, and the prediction result is stored in the blockchain Node.

在一些实施例中，所述处理器还实现根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量，包括：In some embodiments, the processor further implements inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector ,include:

在判别器中，将日志文件对应的首个单词替换为分类标签；In the discriminator, replace the first word corresponding to the log file with the classification label;

将所述日志文件对应的所有单词均输入到编码器进行训练后，将分类标签对应的向量输入到二分类的神经网络中进行训练，输出上下文向量，所述上下文向量的第一个位置与所述分类标签对应；After all the words corresponding to the log file are input to the encoder for training, the vector corresponding to the classification label is input to the two-class neural network for training, and the context vector is output. Corresponding to the category label;

所述根据所述上下文向量检测所述待训练的语料单词的状态，包括：The detecting the state of the corpus word to be trained according to the context vector includes:

根据所述上下文向量的第一个位置判断所述日志文件的异常情况。The abnormal condition of the log file is judged according to the first position of the context vector.

本申请的实施例中还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序中包括程序指令，所述处理器执行所述程序指令，实现本申请实施例提供的任一项基于掩码语言模型的语料检测方法。The embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement the present application Any of the corpus detection methods based on the mask language model provided in the embodiments.

其中，所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元，例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备，例如所述计算机设备上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

可选的，本申请涉及的存储介质如计算机可读存储介质可以是非易失性的，也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

进一步地，所述计算机可读存储介质可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作***、至少一个功能所需的应用程序等；存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store Data created by the use of nodes, etc.

本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain)，本质上是一个去中心化的数据库，是一串使用密码学方法相关联产生的数据块，每一个数据块中包含了一批次网络交易的信息，用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

一种基于掩码语言模型的语料检测方法，其中，所述方法应用于掩码语言模型，所述掩码语言模型包括生成器和判别器；所述方法包括：A corpus detection method based on a masked language model, wherein the method is applied to a masked language model, and the masked language model includes a generator and a discriminator; the method includes:

将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布；Input the corpus words to be trained into the generator for training, and obtain the probability distribution corresponding to the corpus words;

将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；Inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word has been replaced;

根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；According to the category of the corpus word, input a classification label into the discriminator, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

根据所述上下文向量检测所述待训练的语料单词的状态。The state of the corpus word to be trained is detected according to the context vector.
根据权利要求1所述的方法，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布，包括：The method according to claim 1, wherein the inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words comprises:

将所述待训练的语料单词对应的词向量输入生成器中，所述词向量包括单词维度、句子维度和位置维度；Inputting the word vector corresponding to the word of the corpus to be trained into the generator, where the word vector includes word dimension, sentence dimension and position dimension;

通过所述生成器将所述单词维度、句子维度和位置维度叠加后的词向量输入到所述生成器的编码器中进行编码，得到各个维度词向量，其中，所述编码器包括多个；Input the word vector of the word dimension, sentence dimension and position dimension superimposed by the generator into the encoder of the generator for encoding to obtain word vectors of various dimensions, wherein the encoder includes multiple;

按照预设替换规则将所述维度词向量中的部分单词随机替换掉，得到各个维度词向量对应的概率分布。Part of the words in the dimensional word vector are randomly replaced according to a preset replacement rule, and the probability distribution corresponding to each dimensional word vector is obtained.
根据权利要求1所述的方法，其中，所述待训练的语料单词的类别为日志文件类别，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之前，所述方法包括：The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, and the corpus word to be trained is input to the generator for training to obtain the probability distribution corresponding to the corpus word Previously, the method included:

对待训练的日志文件进行预处理。Preprocess the log file to be trained.
根据权利要求1所述的方法，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之后，所述方法还包括：The method according to claim 1, wherein after the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the method further comprises:

计算所述生成器的损失函数，且根据所述生成器的损失函数调整所述生成器；Calculating a loss function of the generator, and adjusting the generator according to the loss function of the generator;

所述根据所述上下文向量检测所述待训练的语料单词的状态之后，所述方法还包括：After the detecting the state of the corpus word to be trained according to the context vector, the method further includes:

计算所述判别器的损失函数，且根据所述判别器的损失函数调整所述判别器。The loss function of the discriminator is calculated, and the discriminator is adjusted according to the loss function of the discriminator.
根据权利要求4所述的方法，其中，所述方法还包括：The method according to claim 4, wherein the method further comprises:

将所述生成器的损失函数与所述判别器的损失函数叠加，得到总体损失函数，以调整所述掩码语言模型。The loss function of the generator and the loss function of the discriminator are superimposed to obtain an overall loss function to adjust the mask language model.
根据权利要求1所述的方法，其中，所述将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过，包括：The method according to claim 1, wherein said inputting said probability distribution to said discriminator for training to obtain a prediction result corresponding to said probability distribution, said prediction result including whether said corpus word has been replaced ,include:

通过所述判别器对所述概率分布对应的词向量按照预设替换概率进行替换，以预测所述概率分布对应的词向量是否被替换过，得到预测结果，所述预测结果存储于区块链中。The word vector corresponding to the probability distribution is replaced by the discriminator according to the preset replacement probability to predict whether the word vector corresponding to the probability distribution has been replaced, and the prediction result is obtained, and the prediction result is stored in the blockchain middle.
根据权利要求1所述的方法，其中，所述待训练的语料单词的类别为日志文件类别，所述根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量，包括：The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, and the classification label is input into the discriminator according to the category of the corpus word, and the discriminator is passed Adjust the prediction result based on the classification label and the corpus words to obtain the context vector, including:

在判别器中，将日志文件对应的首个单词替换为分类标签；In the discriminator, replace the first word corresponding to the log file with the classification label;

将所述日志文件对应的所有单词均输入到编码器进行训练后，将分类标签对应的向量输入到二分类的神经网络中进行训练，输出上下文向量，所述上下文向量的第一个位置与所述分类标签对应；After all the words corresponding to the log file are input to the encoder for training, the vector corresponding to the classification label is input to the two-class neural network for training, and the context vector is output. Corresponding to the category label;

所述根据所述上下文向量检测所述待训练的语料单词的状态，包括：The detecting the state of the corpus word to be trained according to the context vector includes:

根据所述上下文向量的第一个位置判断所述日志文件的异常情况。The abnormal condition of the log file is judged according to the first position of the context vector.
一种语料检测装置，其中，包括：A corpus detection device, which includes:

第一训练模块，用于将待训练的语料单词输入掩码语言模型的生成器进行训练，得到所述语料单词对应的概率分布；The first training module is used to input the corpus words to be trained into the generator of the mask language model for training, and obtain the probability distribution corresponding to the corpus words;

第二训练模块，用于将所述概率分布输入到所述掩码语言模型的判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；The second training module is configured to input the probability distribution into the discriminator of the mask language model for training to obtain a prediction result corresponding to the probability distribution, and the prediction result includes whether the corpus word has been replaced;

调整模块，用于根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；An adjustment module, configured to input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

检测模块，用于根据所述上下文向量检测所述待训练的语料单词的状态。The detection module is used to detect the state of the corpus word to be trained according to the context vector.
一种计算机设备，其中，所述计算机设备包括存储器和处理器；A computer device, wherein the computer device includes a memory and a processor;

所述存储器用于存储计算机程序；The memory is used to store a computer program;

所述处理器，用于执行所述计算机程序并在执行所述计算机程序时实现以下步骤：The processor is configured to execute the computer program and implement the following steps when the computer program is executed:

将待训练的语料单词输入掩码语言模型的生成器进行训练，得到所述语料单词对应的概率分布；Input the corpus words to be trained into the generator of the mask language model for training, and obtain the probability distribution corresponding to the corpus words;

将所述概率分布输入到所述掩码语言模型的判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；Inputting the probability distribution to the discriminator of the mask language model for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word has been replaced;

根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；According to the category of the corpus word, input a classification label into the discriminator, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

根据所述上下文向量检测所述待训练的语料单词的状态。The state of the corpus word to be trained is detected according to the context vector.
根据权利要求9所述的计算机设备，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布时，具体实现以下步骤：The computer device according to claim 9, wherein when the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the following steps are specifically implemented:

将所述待训练的语料单词对应的词向量输入生成器中，所述词向量包括单词维度、句子维度和位置维度；Inputting the word vector corresponding to the word of the corpus to be trained into the generator, where the word vector includes word dimension, sentence dimension and position dimension;

通过所述生成器将所述单词维度、句子维度和位置维度叠加后的词向量输入到所述生成器的编码器中进行编码，得到各个维度词向量，其中，所述编码器包括多个；Input the word vector of the word dimension, sentence dimension, and position dimension superimposed by the generator into the encoder of the generator for encoding to obtain word vectors of various dimensions, wherein the encoder includes multiple;

按照预设替换规则将所述维度词向量中的部分单词随机替换掉，得到各个维度词向量对应的概率分布。Part of the words in the dimensional word vector are randomly replaced according to a preset replacement rule, and the probability distribution corresponding to each dimensional word vector is obtained.
根据权利要求9所述的计算机设备，其中，所述待训练的语料单词的类别为日志文件类别，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之前，所述处理器还用于执行所述计算机程序以实现以下步骤：The computer device according to claim 9, wherein the category of the corpus word to be trained is a log file category, and the corpus word to be trained is input to the generator for training to obtain the probability corresponding to the corpus word Before distribution, the processor is also used to execute the computer program to implement the following steps:

对待训练的日志文件进行预处理。Preprocess the log file to be trained.
根据权利要求9所述的计算机设备，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之后，所述处理器还用于执行所述计算机程序以实现以下步骤：The computer device according to claim 9, wherein, after the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the processor is further configured to execute the computer Procedure to achieve the following steps:

计算所述生成器的损失函数，且根据所述生成器的损失函数调整所述生成器；Calculating a loss function of the generator, and adjusting the generator according to the loss function of the generator;

所述根据所述上下文向量检测所述待训练的语料单词的状态之后，所述计算机设备还包括：After detecting the state of the corpus word to be trained according to the context vector, the computer device further includes:

计算所述判别器的损失函数，且根据所述判别器的损失函数调整所述判别器。The loss function of the discriminator is calculated, and the discriminator is adjusted according to the loss function of the discriminator.
根据权利要求9所述的计算机设备，其中，所述将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过时，具体实现以下步骤：9. The computer device according to claim 9, wherein said inputting said probability distribution to said discriminator for training to obtain a prediction result corresponding to said probability distribution, said prediction result including whether said corpus word is replaced Obsolete, specifically implement the following steps:

通过所述判别器对所述概率分布对应的词向量按照预设替换概率进行替换，以预测所述概率分布对应的词向量是否被替换过，得到预测结果，所述预测结果存储于区块链中。The word vector corresponding to the probability distribution is replaced by the discriminator according to the preset replacement probability to predict whether the word vector corresponding to the probability distribution has been replaced, and the prediction result is obtained, and the prediction result is stored in the blockchain middle.
根据权利要求9所述的计算机设备，其中，所述待训练的语料单词的类别为日志文件类别，所述根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量时，具体实现以下步骤：The computer device according to claim 9, wherein the category of the corpus word to be trained is a log file category, and the classification label is input into the discriminator according to the category of the corpus word, and the classification is passed through the discrimination The predictor adjusts the prediction result based on the classification label and the corpus words, and when the context vector is obtained, the following steps are specifically implemented:

在判别器中，将日志文件对应的首个单词替换为分类标签；In the discriminator, replace the first word corresponding to the log file with the classification label;

将所述日志文件对应的所有单词均输入到编码器进行训练后，将分类标签对应的向量输入到二分类的神经网络中进行训练，输出上下文向量，所述上下文向量的第一个位置与所述分类标签对应；After all the words corresponding to the log file are input to the encoder for training, the vector corresponding to the classification label is input to the two-class neural network for training, and the context vector is output. Corresponding to the category label;

所述根据所述上下文向量检测所述待训练的语料单词的状态时，具体实现以下步骤：When detecting the state of the corpus word to be trained according to the context vector, the following steps are specifically implemented:

根据所述上下文向量的第一个位置判断所述日志文件的异常情况。The abnormal condition of the log file is judged according to the first position of the context vector.
一种计算机可读存储介质，其中，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时使所述处理器实现以下步骤：A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps:

将待训练的语料单词输入掩码语言模型的生成器进行训练，得到所述语料单词对应的概率分布；Input the corpus words to be trained into the generator of the mask language model for training, and obtain the probability distribution corresponding to the corpus words;

将所述概率分布输入到所述掩码语言模型的判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过；Inputting the probability distribution to the discriminator of the mask language model for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word has been replaced;

根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量；According to the category of the corpus word, input a classification label into the discriminator, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

根据所述上下文向量检测所述待训练的语料单词的状态。The state of the corpus word to be trained is detected according to the context vector.
根据权利要求15所述的计算机可读存储介质，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布时，具体实现以下步骤：15. The computer-readable storage medium according to claim 15, wherein when the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the following steps are specifically implemented:

将所述待训练的语料单词对应的词向量输入生成器中，所述词向量包括单词维度、句子维度和位置维度；Inputting the word vector corresponding to the word of the corpus to be trained into the generator, where the word vector includes word dimension, sentence dimension and position dimension;

通过所述生成器将所述单词维度、句子维度和位置维度叠加后的词向量输入到所述生成器的编码器中进行编码，得到各个维度词向量，其中，所述编码器包括多个；Input the word vector of the word dimension, sentence dimension and position dimension superimposed by the generator into the encoder of the generator for encoding to obtain word vectors of various dimensions, wherein the encoder includes multiple;

按照预设替换规则将所述维度词向量中的部分单词随机替换掉，得到各个维度词向量对应的概率分布。Part of the words in the dimensional word vector are randomly replaced according to a preset replacement rule, and the probability distribution corresponding to each dimensional word vector is obtained.
根据权利要求15所述的计算机可读存储介质，其中，所述待训练的语料单词的类别为日志文件类别，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之前，所述计算机可读存储介质包括：The computer-readable storage medium according to claim 15, wherein the category of the corpus word to be trained is a log file category, and the corpus word to be trained is input to the generator for training to obtain the corpus word Before the corresponding probability distribution, the computer-readable storage medium includes:

对待训练的日志文件进行预处理。Preprocess the log file to be trained.
根据权利要求15所述的计算机可读存储介质，其中，所述将待训练的语料单词输入所述生成器进行训练，得到所述语料单词对应的概率分布之后，所述计算机程序被处理器执行时还使所述处理器实现以下步骤：The computer-readable storage medium according to claim 15, wherein after the corpus words to be trained are input into the generator for training, and the probability distribution corresponding to the corpus words is obtained, the computer program is executed by the processor It also enables the processor to implement the following steps:

计算所述生成器的损失函数，且根据所述生成器的损失函数调整所述生成器；Calculating a loss function of the generator, and adjusting the generator according to the loss function of the generator;

所述根据所述上下文向量检测所述待训练的语料单词的状态之后，所述计算机程序被处理器执行时还使所述处理器实现以下步骤：After detecting the state of the corpus word to be trained according to the context vector, when the computer program is executed by the processor, the processor further enables the processor to implement the following steps:

计算所述判别器的损失函数，且根据所述判别器的损失函数调整所述判别器。The loss function of the discriminator is calculated, and the discriminator is adjusted according to the loss function of the discriminator.
根据权利要求15所述的计算机可读存储介质，其中，所述将所述概率分布输入到所述判别器进行训练，得到所述概率分布对应的预测结果，所述预测结果包括所述语料单词是否被替换过时，具体实现以下步骤：The computer-readable storage medium according to claim 15, wherein said inputting said probability distribution to said discriminator for training, obtaining a prediction result corresponding to said probability distribution, said prediction result including said corpus word Whether it is replaced and obsolete, the specific steps are as follows:

通过所述判别器对所述概率分布对应的词向量按照预设替换概率进行替换，以预测所述概率分布对应的词向量是否被替换过，得到预测结果，所述预测结果存储于区块链中。The word vector corresponding to the probability distribution is replaced by the discriminator according to the preset replacement probability to predict whether the word vector corresponding to the probability distribution has been replaced, and the prediction result is obtained, and the prediction result is stored in the blockchain middle.
根据权利要求15所述的计算机可读存储介质，其中，所述待训练的语料单词的类别为日志文件类别，所述根据所述语料单词的类别，在所述判别器中输入分类标签，通过所述判别器基于分类标签以及语料单词调整所述预测结果，得到上下文向量时，具体实现以下步骤：The computer-readable storage medium according to claim 15, wherein the category of the corpus word to be trained is a log file category, and the classification label is input into the discriminator according to the category of the corpus word, and the When the discriminator adjusts the prediction result based on the classification label and the corpus words to obtain the context vector, the following steps are specifically implemented:

在判别器中，将日志文件对应的首个单词替换为分类标签；In the discriminator, replace the first word corresponding to the log file with the classification label;

将所述日志文件对应的所有单词均输入到编码器进行训练后，将分类标签对应的向量输入到二分类的神经网络中进行训练，输出上下文向量，所述上下文向量的第一个位置与所述分类标签对应；After all the words corresponding to the log file are input to the encoder for training, the vector corresponding to the classification label is input to the two-class neural network for training, and the context vector is output. Corresponding to the category label;

所述根据所述上下文向量检测所述待训练的语料单词的状态时，具体实现以下步骤：When detecting the state of the corpus word to be trained according to the context vector, the following steps are specifically implemented:

根据所述上下文向量的第一个位置判断所述日志文件的异常情况。The abnormal condition of the log file is judged according to the first position of the context vector.